[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845100#comment-16845100 ] Sergey Shelukhin commented on HDFS-14498: - Fixing it to reduce logging (to log rarely) might make it easier to get any useful info for a repro. We've hit it a few times and every time all useful logs are gone > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: > Failed to release lease for file . Committed blocks are waiting to be > minimally replicated. Try again later. > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path > in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, > pending creates: 1]. It will be retried. > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* > NameSystem.internalReleaseLease: Failed to release lease for file . > Committed blocks are waiting to be minimally replicated. Try again later. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) > at java.lang.Thread.run(Thread.java:745) > $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: > 1" hdfs_nn* > hdfs_nn.log:1068035 > hdfs_nn.log.2019-05-16-14:1516179 > hdfs_nn.log.2019-05-16-15:1538350 > {noformat} > Aside from an actual bug fix, it might make sense to make LeaseManager not > log so much, in case if there are more bugs like this... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844169#comment-16844169 ] Sergey Shelukhin commented on HDFS-14498: - The version is somewhat modified 2.9 (no modifications in this area as far as I see). Unfortunately when this thing starts spamming logs our NN logs roll to the limit in about 2-3h, so all the useful logs for this file are lost. > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: > Failed to release lease for file . Committed blocks are waiting to be > minimally replicated. Try again later. > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path > in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, > pending creates: 1]. It will be retried. > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* > NameSystem.internalReleaseLease: Failed to release lease for file . > Committed blocks are waiting to be minimally replicated. Try again later. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) > at java.lang.Thread.run(Thread.java:745) > $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: > 1" hdfs_nn* > hdfs_nn.log:1068035 > hdfs_nn.log.2019-05-16-14:1516179 > hdfs_nn.log.2019-05-16-15:1538350 > {noformat} > Aside from an actual bug fix, it might make sense to make LeaseManager not > log so much, in case if there are more bugs like this... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842339#comment-16842339 ] Sergey Shelukhin commented on HDFS-14498: - I don't have a repro because this issue also basically destroys namenode logs. I'm assuming judging by the fact that the file blocks cannot be replicated that write failed at some stage and/or client died. The file only has one block, and that block is the one without replicas. > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: > Failed to release lease for file . Committed blocks are waiting to be > minimally replicated. Try again later. > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path > in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, > pending creates: 1]. It will be retried. > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* > NameSystem.internalReleaseLease: Failed to release lease for file . > Committed blocks are waiting to be minimally replicated. Try again later. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) > at java.lang.Thread.run(Thread.java:745) > $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: > 1" hdfs_nn* > hdfs_nn.log:1068035 > hdfs_nn.log.2019-05-16-14:1516179 > hdfs_nn.log.2019-05-16-15:1538350 > {noformat} > Aside from an actual bug fix, it might make sense to make LeaseManager not > log so much, in case if there are more bugs like this... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841839#comment-16841839 ] Sergey Shelukhin commented on HDFS-14498: - cc [~elgoiri] [~ashlhud] [~raviprak] > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: > Failed to release lease for file . Committed blocks are waiting to be > minimally replicated. Try again later. > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path > in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, > pending creates: 1]. It will be retried. > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* > NameSystem.internalReleaseLease: Failed to release lease for file . > Committed blocks are waiting to be minimally replicated. Try again later. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) > at java.lang.Thread.run(Thread.java:745) > $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: > 1" hdfs_nn* > hdfs_nn.log:1068035 > hdfs_nn.log.2019-05-16-14:1516179 > hdfs_nn.log.2019-05-16-15:1538350 > {noformat} > Aside from an actual bug fix, it might make sense to make LeaseManager not > log so much, in case if there are more bugs like this... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
Sergey Shelukhin created HDFS-14498: --- Summary: LeaseManager can loop forever on the file for which create has failed Key: HDFS-14498 URL: https://issues.apache.org/jira/browse/HDFS-14498 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.9.0 Reporter: Sergey Shelukhin The logs from file creation are long gone due to infinite lease logging, however it presumably failed... the client who was trying to write this file is definitely long dead. The version includes HDFS-4882. We get this log pattern repeating infinitely: {noformat} 2019-05-16 14:00:16,893 INFO [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard limit 2019-05-16 14:00:16,893 INFO [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= 2019-05-16 14:00:16,893 WARN [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: Failed to release lease for file . Committed blocks are waiting to be minimally replicated. Try again later. 2019-05-16 14:00:16,893 WARN [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1]. It will be retried. org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file . Committed blocks are waiting to be minimally replicated. Try again later. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) at java.lang.Thread.run(Thread.java:745) $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1" hdfs_nn* hdfs_nn.log:1068035 hdfs_nn.log.2019-05-16-14:1516179 hdfs_nn.log.2019-05-16-15:1538350 {noformat} Aside from an actual bug fix, it might make sense to make LeaseManager not log so much, in case if there are more bugs like this... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14387) create a client-side override for dfs.namenode.block-placement-policy.default.prefer-local-node
Sergey Shelukhin created HDFS-14387: --- Summary: create a client-side override for dfs.namenode.block-placement-policy.default.prefer-local-node Key: HDFS-14387 URL: https://issues.apache.org/jira/browse/HDFS-14387 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin It should be possible for a service to decide whether it wants to use the local node preference; as it stands, if dfs.namenode.block-placement-policy.default.prefer-local-node is enabled, the services that run far fewer instances than there are DNs in the cluster unnecessarily concentrate their write load; the only way around it seems to be to disable prefer-local-node globally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7878) API - expose a unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227160#comment-16227160 ] Sergey Shelukhin commented on HDFS-7878: Thank you for all the work getting this in! I thought it wouldn't actually ever happen :) > API - expose a unique file identifier > - > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Chris Douglas > Labels: BB2015-05-TBR > Fix For: 3.0.0 > > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.07.patch, HDFS-7878.08.patch, > HDFS-7878.09.patch, HDFS-7878.10.patch, HDFS-7878.11.patch, > HDFS-7878.12.patch, HDFS-7878.13.patch, HDFS-7878.14.patch, > HDFS-7878.15.patch, HDFS-7878.16.patch, HDFS-7878.17.patch, > HDFS-7878.18.patch, HDFS-7878.19.patch, HDFS-7878.20.patch, > HDFS-7878.21.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138848#comment-16138848 ] Sergey Shelukhin commented on HDFS-7878: I'd like to attempt to resurrect this. However from reading the comments above I don't sense a consensus here... At this point, I really don't care at all about the class and method structure anymore as long as there's a usable API. Can someone please summarize the above discussion or guide one of the patch versions thru? It reads to me like every approach taken w.r.t. the classes/etc. was rejected by at least one person so I'm not sure which one to take. > API - expose an unique file identifier > -- > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Labels: BB2015-05-TBR > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15456174#comment-15456174 ] Sergey Shelukhin commented on HDFS-7878: I meant when callers are in different processes/nodes, e.g. Hive generating splits for remote tasks. We can work around it by getting status and verifying file ID, then opening from status. However fileId-based open would be more convenient (and save a call, potentially - at least from the caller perspective, not sure if it's same difference from the NN calls perspective)... > API - expose an unique file identifier > -- > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Labels: BB2015-05-TBR > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453828#comment-15453828 ] Sergey Shelukhin edited comment on HDFS-7878 at 9/1/16 12:13 AM: - One idea behind open(long/InodeId) is to be able to open files consistently; e.g. for partial caching (one needs to be sure that the cached data and the data read from FS are for the same file, guarding against overwrites). File ID is easy to propagate between different readers for this purpose, but it seems that FileStatus would be rather inconvenient. It forces the caller who is dealing with the FS to get the status by name first (which also only works if the name is known; in our case we do know the name) and verify that fileId is consistent. Is it possible to keep both APIs? was (Author: sershe): One idea behind open(long/InodeId) is to be able to open files consistently; e.g. for partial caching (one needs to be sure that the cached data and the data read from FS are for the same file, guarding against overwrites). File ID is easy to propagate between different readers for this purpose, but it seems that FileStatus would be rather inconvenient. It forces the caller to get the status by name first (which also only works if the name is known; in our case we do know the name) and verify that fileId is consistent. Is it possible to keep both APIs? > API - expose an unique file identifier > -- > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Labels: BB2015-05-TBR > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453828#comment-15453828 ] Sergey Shelukhin commented on HDFS-7878: One idea behind open(long/InodeId) is to be able to open files consistently; e.g. for partial caching (one needs to be sure that the cached data and the data read from FS are for the same file, guarding against overwrites). File ID is easy to propagate between different readers for this purpose, but it seems that FileStatus would be rather inconvenient. It forces the caller to get the status by name first (which also only works if the name is known; in our case we do know the name) and verify that fileId is consistent. Is it possible to keep both APIs? > API - expose an unique file identifier > -- > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Labels: BB2015-05-TBR > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15453830#comment-15453830 ] Sergey Shelukhin commented on HDFS-7878: Thanks for picking this up by the way! > API - expose an unique file identifier > -- > > Key: HDFS-7878 > URL: https://issues.apache.org/jira/browse/HDFS-7878 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin > Labels: BB2015-05-TBR > Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, > HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, > HDFS-7878.06.patch, HDFS-7878.patch > > > See HDFS-487. > Even though that is resolved as duplicate, the ID is actually not exposed by > the JIRA it supposedly duplicates. > INode ID for the file should be easy to expose; alternatively ID could be > derived from block IDs, to account for appends... > This is useful e.g. for cache key by file, to make sure cache stays correct > when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10757) KMSClientProvider combined with KeyProviderCache can result in wrong UGI being used
[ https://issues.apache.org/jira/browse/HDFS-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-10757: Summary: KMSClientProvider combined with KeyProviderCache can result in wrong UGI being used (was: KMSClientProvider combined with KeyProviderCache results in wrong UGI being used) > KMSClientProvider combined with KeyProviderCache can result in wrong UGI > being used > --- > > Key: HDFS-10757 > URL: https://issues.apache.org/jira/browse/HDFS-10757 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Critical > > ClientContext::get gets the context from CACHE via a config setting based > name, then KeyProviderCache stored in ClientContext gets the key provider > cached by URI from the configuration, too. These would return the same > KeyProvider regardless of current UGI. > KMSClientProvider caches the UGI (actualUgi) in ctor; that means in > particular that all the users of DFS with KMSClientProvider in a process will > get the KMS token (along with other credentials) of the first user, via the > above cache. > Either KMSClientProvider shouldn't store the UGI, or one of the caches should > be UGI-aware, like the FS object cache. > Side note: the comment in createConnection that purports to handle the > different UGI doesn't seem to cover what it says it covers. In our case, we > have two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, > including a KMS token, added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10757) KMSClientProvider combined with KeyProviderCache results in wrong UGI being used
[ https://issues.apache.org/jira/browse/HDFS-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-10757: Description: ClientContext::get gets the context from CACHE via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI from the configuration, too. These would return the same KeyProvider regardless of current UGI. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user, via the above cache. Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. Side note: the comment in createConnection that purports to handle the different UGI doesn't seem to cover what it says it covers. In our case, we have two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, including a KMS token, added. was: ClientContext::get gets the context from CACHE via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI from the configuration, too. These would return the same KeyProvider regardless of current UGI. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user, via the above cache. Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. Side note: the comment in createConnection that purports to handle the different UGI doesn't seem to cover it says it covers. In our case, we have two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, including a KMS token, added. > KMSClientProvider combined with KeyProviderCache results in wrong UGI being > used > > > Key: HDFS-10757 > URL: https://issues.apache.org/jira/browse/HDFS-10757 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Critical > > ClientContext::get gets the context from CACHE via a config setting based > name, then KeyProviderCache stored in ClientContext gets the key provider > cached by URI from the configuration, too. These would return the same > KeyProvider regardless of current UGI. > KMSClientProvider caches the UGI (actualUgi) in ctor; that means in > particular that all the users of DFS with KMSClientProvider in a process will > get the KMS token (along with other credentials) of the first user, via the > above cache. > Either KMSClientProvider shouldn't store the UGI, or one of the caches should > be UGI-aware, like the FS object cache. > Side note: the comment in createConnection that purports to handle the > different UGI doesn't seem to cover what it says it covers. In our case, we > have two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, > including a KMS token, added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10757) KMSClientProvider combined with KeyProviderCache results in wrong UGI being used
[ https://issues.apache.org/jira/browse/HDFS-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-10757: Description: ClientContext::get gets the context from CACHE via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI from the configuration, too. These would return the same KeyProvider regardless of current UGI. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user, via the above cache. Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. Side note: the comment in createConnection that purports to handle the different UGI doesn't seem to cover it says it covers. In our case, we have two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, including a KMS token, added. was: ClientContext::get gets the context from CACHE via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI from the configuration, too. These would return the same KeyProvider regardless of current UGI. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user, via the above cache. Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. > KMSClientProvider combined with KeyProviderCache results in wrong UGI being > used > > > Key: HDFS-10757 > URL: https://issues.apache.org/jira/browse/HDFS-10757 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Critical > > ClientContext::get gets the context from CACHE via a config setting based > name, then KeyProviderCache stored in ClientContext gets the key provider > cached by URI from the configuration, too. These would return the same > KeyProvider regardless of current UGI. > KMSClientProvider caches the UGI (actualUgi) in ctor; that means in > particular that all the users of DFS with KMSClientProvider in a process will > get the KMS token (along with other credentials) of the first user, via the > above cache. > Either KMSClientProvider shouldn't store the UGI, or one of the caches should > be UGI-aware, like the FS object cache. > Side note: the comment in createConnection that purports to handle the > different UGI doesn't seem to cover it says it covers. In our case, we have > two unrelated UGIs with no auth (createRemoteUser) with bunch of tokens, > including a KMS token, added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10757) KMSClientProvider combined with KeyProviderCache results in wrong UGI being used
[ https://issues.apache.org/jira/browse/HDFS-10757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-10757: Description: ClientContext::get gets the context from CACHE via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI from the configuration, too. These would return the same KeyProvider regardless of current UGI. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user, via the above cache. Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. was: ClientContext::get gets the context from cache via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI stored in configuration, too. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user... Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. > KMSClientProvider combined with KeyProviderCache results in wrong UGI being > used > > > Key: HDFS-10757 > URL: https://issues.apache.org/jira/browse/HDFS-10757 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sergey Shelukhin >Priority: Critical > > ClientContext::get gets the context from CACHE via a config setting based > name, then KeyProviderCache stored in ClientContext gets the key provider > cached by URI from the configuration, too. These would return the same > KeyProvider regardless of current UGI. > KMSClientProvider caches the UGI (actualUgi) in ctor; that means in > particular that all the users of DFS with KMSClientProvider in a process will > get the KMS token (along with other credentials) of the first user, via the > above cache. > Either KMSClientProvider shouldn't store the UGI, or one of the caches should > be UGI-aware, like the FS object cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-10757) KMSClientProvider combined with KeyProviderCache results in wrong UGI being used
Sergey Shelukhin created HDFS-10757: --- Summary: KMSClientProvider combined with KeyProviderCache results in wrong UGI being used Key: HDFS-10757 URL: https://issues.apache.org/jira/browse/HDFS-10757 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin Priority: Critical ClientContext::get gets the context from cache via a config setting based name, then KeyProviderCache stored in ClientContext gets the key provider cached by URI stored in configuration, too. KMSClientProvider caches the UGI (actualUgi) in ctor; that means in particular that all the users of DFS with KMSClientProvider in a process will get the KMS token (along with other credentials) of the first user... Either KMSClientProvider shouldn't store the UGI, or one of the caches should be UGI-aware, like the FS object cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10414) allow disabling trash on per-directory basis
[ https://issues.apache.org/jira/browse/HDFS-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288060#comment-15288060 ] Sergey Shelukhin commented on HDFS-10414: - 1) Is it true for all APIs? 2) This might be hard to modify for existing tools or scripts that are outside of one's control. > allow disabling trash on per-directory basis > > > Key: HDFS-10414 > URL: https://issues.apache.org/jira/browse/HDFS-10414 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin > > For ETL, it might be useful to disable trash for certain directories only to > avoid the overhead, while keeping it enabled for rest of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10414) allow disabling trash on per-directory basis
[ https://issues.apache.org/jira/browse/HDFS-10414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287091#comment-15287091 ] Sergey Shelukhin commented on HDFS-10414: - Suppose there's ETL process consisting of a sequence of Hive queries/other tools that write intermediate data to an HDFS directory hierarchy; it would be nice to disable trash for the root of that hierarchy, so that the intermediate data is not preserved in the trash if it's deleted or moved to a different FS, for example. However, we don't want to disable the trash for the entire cluster, cause there is also production data there for which it should be enabled. > allow disabling trash on per-directory basis > > > Key: HDFS-10414 > URL: https://issues.apache.org/jira/browse/HDFS-10414 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Sergey Shelukhin > > For ETL, it might be useful to disable trash for certain directories only to > avoid the overhead, while keeping it enabled for rest of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-10414) allow disabling trash on per-directory basis
Sergey Shelukhin created HDFS-10414: --- Summary: allow disabling trash on per-directory basis Key: HDFS-10414 URL: https://issues.apache.org/jira/browse/HDFS-10414 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin For ETL, it might be useful to disable trash for certain directories only to avoid the overhead, while keeping it enabled for rest of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-9567) LlapServiceDriver can fail if only the packaged logger config is present
[ https://issues.apache.org/jira/browse/HDFS-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin resolved HDFS-9567. Resolution: Invalid Wrong project > LlapServiceDriver can fail if only the packaged logger config is present > > > Key: HDFS-9567 > URL: https://issues.apache.org/jira/browse/HDFS-9567 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Sergey Shelukhin > > I was incrementally updating my setup on some VM and didn't have the logger > config file, so the packaged one was picked up apparently, which caused this: > {noformat} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/vagrant/llap/apache-hive-2.0.0-SNAPSHOT-bin/lib/hive-llap-server-2.0.0-SNAPSHOT.jar!/llap-daemon-log4j2.properties > at org.apache.hadoop.fs.Path.initialize(Path.java:205) > at org.apache.hadoop.fs.Path.(Path.java:171) > at > org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run(LlapServiceDriver.java:234) > at > org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main(LlapServiceDriver.java:58) > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > jar:file:/home/vagrant/llap/apache-hive-2.0.0-SNAPSHOT-bin/lib/hive-llap-server-2.0.0-SNAPSHOT.jar!/llap-daemon-log4j2.properties > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:202) > ... 3 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9567) LlapServiceDriver can fail if only the packaged logger config is present
Sergey Shelukhin created HDFS-9567: -- Summary: LlapServiceDriver can fail if only the packaged logger config is present Key: HDFS-9567 URL: https://issues.apache.org/jira/browse/HDFS-9567 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin I was incrementally updating my setup on some VM and didn't have the logger config file, so the packaged one was picked up apparently, which caused this: {noformat} java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/vagrant/llap/apache-hive-2.0.0-SNAPSHOT-bin/lib/hive-llap-server-2.0.0-SNAPSHOT.jar!/llap-daemon-log4j2.properties at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run(LlapServiceDriver.java:234) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main(LlapServiceDriver.java:58) Caused by: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/vagrant/llap/apache-hive-2.0.0-SNAPSHOT-bin/lib/hive-llap-server-2.0.0-SNAPSHOT.jar!/llap-daemon-log4j2.properties at java.net.URI.checkPath(URI.java:1823) at java.net.URI.(URI.java:745) at org.apache.hadoop.fs.Path.initialize(Path.java:202) ... 3 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: (was: HDFS-7878.05.patch) API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.05.patch API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.05.patch changed to nullable field and added to a couple more places API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.05.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376519#comment-14376519 ] Sergey Shelukhin commented on HDFS-7878: [~cmccabe] are you ok with the latest patch? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.04.patch Removed open, removed field from serialization. I think inheritance is a much cleaner way to express an FS-specific field compared to a null field in the shared FileStatus... API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.04.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14369660#comment-14369660 ] Sergey Shelukhin commented on HDFS-7878: 1) Sure 2) That seems error prone; the code operating on FileStatus will have to know whether fileId is supposed to be there (if it's not, it's ok to be null, if it's supposed to be there and is null, it's a bug); serialization will lose data.Writables don't work well with optional fields... API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.03.patch Attaching the patch with the class changes (actually it's the same as very first patch with some cleanup, javadoc etc.). Additionally, I added the call on DFS to open file by ID. API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.03.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364104#comment-14364104 ] Sergey Shelukhin commented on HDFS-7878: Note: confusingly, HdfsFileStatus is not FileStatus, that's an internal class [~cmccabe] can you clarify what you mean by two RPCs? The API makes one RPC. Two RPCs will only happen if the user calls both get ID and get status, which is not necessary. API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363709#comment-14363709 ] Sergey Shelukhin commented on HDFS-7878: That is approximately what was done in the first version of the patch (see attached) and then replaced in favor of new API... can you guys ([~cmccabe] and [~jingzhao]) decide which way is better :) I can add the open-by-id method then... API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357646#comment-14357646 ] Sergey Shelukhin commented on HDFS-7878: [~cmccabe] ping? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353579#comment-14353579 ] Sergey Shelukhin commented on HDFS-7878: ping? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.02.patch added javadoc API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353801#comment-14353801 ] Sergey Shelukhin commented on HDFS-7878: in particular I wonder why getFileStatus API is privileged and has to be consistent with getFileId. If you call getFileStatus and open currently, you can have the same problem - status from one file, open from different file. ID allows to overcome this by getting ID *first*, then using ID-based path. Of course if ID is obtained separately there's no guarantee but there's no way to overcome this. I don't care either way about subclass or method approach. API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353766#comment-14353766 ] Sergey Shelukhin commented on HDFS-7878: For file opening case, cannot the file be opened by getting ID first, then using the ID-based path as indicated above? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.02.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7895) open and getFileInfo APIs treat paths inconsistently wrt protocol
[ https://issues.apache.org/jira/browse/HDFS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7895: --- Summary: open and getFileInfo APIs treat paths inconsistently wrt protocol (was: open and getFileInfo APIs treat paths inconsistently) open and getFileInfo APIs treat paths inconsistently wrt protocol - Key: HDFS-7895 URL: https://issues.apache.org/jira/browse/HDFS-7895 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Sergey Shelukhin Assignee: Jing Zhao Priority: Minor When open() is called with regular HDFS path, hdfs://blah/blah/blah, it appears to work. However, getFileInfo doesn't {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.InvalidPathException): Invalid path name Invalid file name: hdfs://localhost:9000/apps/hive/warehouse/tpch_2.db/lineitem_orc/01_0 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4128) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:838) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:821) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988) {noformat} 1) this seems inconsistent. 2) not clear why the validation should reject what looks like a good HDFS path. At least, client code should clean this stuff up on the way. [~prasanth_j] has the details, I just filed a bug so I could mention how buggy HDFS is to [~jingzhao] :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7895) open and getFileInfo APIs treat paths inconsistently
Sergey Shelukhin created HDFS-7895: -- Summary: open and getFileInfo APIs treat paths inconsistently Key: HDFS-7895 URL: https://issues.apache.org/jira/browse/HDFS-7895 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin Assignee: Jing Zhao Priority: Minor When open() is called with regular HDFS path, hdfs://blah/blah/blah, it appears to work. However, getFileInfo doesn't {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.InvalidPathException): Invalid path name Invalid file name: hdfs://localhost:9000/apps/hive/warehouse/tpch_2.db/lineitem_orc/01_0 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4128) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:838) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:821) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988) {noformat} 1) this seems inconsistent. 2) not clear why the validation should reject what looks like a good HDFS path. At least, client code should clean this stuff up on the way. [~prasanth_j] has the details, I just filed a bug so I could mention how buggy HDFS is to [~jingzhao] :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7895) open and getFileInfo APIs treat paths inconsistently
[ https://issues.apache.org/jira/browse/HDFS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7895: --- Affects Version/s: 2.6.0 open and getFileInfo APIs treat paths inconsistently Key: HDFS-7895 URL: https://issues.apache.org/jira/browse/HDFS-7895 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Sergey Shelukhin Assignee: Jing Zhao Priority: Minor When open() is called with regular HDFS path, hdfs://blah/blah/blah, it appears to work. However, getFileInfo doesn't {noformat} Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.InvalidPathException): Invalid path name Invalid file name: hdfs://localhost:9000/apps/hive/warehouse/tpch_2.db/lineitem_orc/01_0 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4128) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:838) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:821) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1988) {noformat} 1) this seems inconsistent. 2) not clear why the validation should reject what looks like a good HDFS path. At least, client code should clean this stuff up on the way. [~prasanth_j] has the details, I just filed a bug so I could mention how buggy HDFS is to [~jingzhao] :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349859#comment-14349859 ] Sergey Shelukhin commented on HDFS-7878: [~jingzhao] actually getFileId already normalizes the file path (responding to gtalk discussion) So this is ready for +1 ;) API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin reassigned HDFS-7878: -- Assignee: Sergey Shelukhin API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.patch this patch exposes fileId via normal FileStatus. I can add separate API instead, which will be a smaller change, but it will add a separate API... please advise API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Attachments: HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347653#comment-14347653 ] Sergey Shelukhin commented on HDFS-7878: [~jingzhao] can you please review? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Attachments: HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Status: Patch Available (was: Open) API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Attachments: HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7878: --- Attachment: HDFS-7878.01.patch Updated to just have the API... API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347889#comment-14347889 ] Sergey Shelukhin commented on HDFS-7878: This QA was for old patch... API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin Assignee: Sergey Shelukhin Attachments: HDFS-7878.01.patch, HDFS-7878.patch See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345824#comment-14345824 ] Sergey Shelukhin commented on HDFS-7878: HdfsFileStatus is not inherited from FileStatus. Do you mean dfs.getClient().getFileInfo(path)? API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7878) API - expose an unique file identifier
Sergey Shelukhin created HDFS-7878: -- Summary: API - expose an unique file identifier Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7878) API - expose an unique file identifier
[ https://issues.apache.org/jira/browse/HDFS-7878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345930#comment-14345930 ] Sergey Shelukhin commented on HDFS-7878: Can you make it public? Or better yet add API to FileSustem. API - expose an unique file identifier -- Key: HDFS-7878 URL: https://issues.apache.org/jira/browse/HDFS-7878 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin See HDFS-487. Even though that is resolved as duplicate, the ID is actually not exposed by the JIRA it supposedly duplicates. INode ID for the file should be easy to expose; alternatively ID could be derived from block IDs, to account for appends... This is useful e.g. for cache key by file, to make sure cache stays correct when file is overwritten. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7825) HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API
[ https://issues.apache.org/jira/browse/HDFS-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334103#comment-14334103 ] Sergey Shelukhin commented on HDFS-7825: [~hagleitn] [~sseth] fyi :) HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API --- Key: HDFS-7825 URL: https://issues.apache.org/jira/browse/HDFS-7825 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin ByteBufferReadable::read(ByteBuffer) javadoc says: {noformat} After a successful call, buf.position() and buf.limit() should be unchanged, and therefore any data can be immediately read from buf. buf.mark() may be cleared or updated. {noformat} I have the following code: {noformat} ByteBuffer directBuf = ByteBuffer.allocateDirect(len); int pos = directBuf.position(); int count = file.read(directBuf); if (count 0) throw new EOFException(); if (directBuf.position() != pos) { RecordReaderImpl.LOG.info(Warning - position mismatch from + file.getClass() + : after reading + count + , expected + pos + but got + directBuf.position()); } {noformat} and I get: {noformat} 15/02/23 15:30:56 [pool-4-thread-1] INFO orc.RecordReaderImpl : Warning - position mismatch from class org.apache.hadoop.hdfs.client.HdfsDataInputStream: after reading 6, expected 0 but got 6 {noformat} So the position is changed, unlike the API doc indicates. Also, while I haven't verified yet, it may be that the 0-length read is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7825) read(ByteBuffer) method doesn't conform to its API
Sergey Shelukhin created HDFS-7825: -- Summary: read(ByteBuffer) method doesn't conform to its API Key: HDFS-7825 URL: https://issues.apache.org/jira/browse/HDFS-7825 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin ByteBufferReadable::read(ByteBuffer) javadoc says: {noformat} After a successful call, buf.position() and buf.limit() should be unchanged, and therefore any data can be immediately read from buf. buf.mark() may be cleared or updated. {noformat} I have the following code: {noformat} ByteBuffer directBuf = ByteBuffer.allocateDirect(len); int pos = directBuf.position(); int count = file.read(directBuf); if (count 0) throw new EOFException(); if (directBuf.position() != pos) { RecordReaderImpl.LOG.info(Warning - position mismatch from + file.getClass() + : after reading + count + , expected + pos + but got + directBuf.position()); } {noformat} and I get: {noformat} 15/02/23 15:30:56 [pool-4-thread-1] INFO orc.RecordReaderImpl : Warning - position mismatch from class org.apache.hadoop.hdfs.client.HdfsDataInputStream: after reading 6, expected 0 but got 6 {noformat} So the position is changed, unlike the API doc indicates. Also, while I haven't verified yet, it may be that the 0-length read is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7825) HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API
[ https://issues.apache.org/jira/browse/HDFS-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334098#comment-14334098 ] Sergey Shelukhin commented on HDFS-7825: [~jingzhao] can you take a look? HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API --- Key: HDFS-7825 URL: https://issues.apache.org/jira/browse/HDFS-7825 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin ByteBufferReadable::read(ByteBuffer) javadoc says: {noformat} After a successful call, buf.position() and buf.limit() should be unchanged, and therefore any data can be immediately read from buf. buf.mark() may be cleared or updated. {noformat} I have the following code: {noformat} ByteBuffer directBuf = ByteBuffer.allocateDirect(len); int pos = directBuf.position(); int count = file.read(directBuf); if (count 0) throw new EOFException(); if (directBuf.position() != pos) { RecordReaderImpl.LOG.info(Warning - position mismatch from + file.getClass() + : after reading + count + , expected + pos + but got + directBuf.position()); } {noformat} and I get: {noformat} 15/02/23 15:30:56 [pool-4-thread-1] INFO orc.RecordReaderImpl : Warning - position mismatch from class org.apache.hadoop.hdfs.client.HdfsDataInputStream: after reading 6, expected 0 but got 6 {noformat} So the position is changed, unlike the API doc indicates. Also, while I haven't verified yet, it may be that the 0-length read is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7825) HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API
[ https://issues.apache.org/jira/browse/HDFS-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HDFS-7825: --- Summary: HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API (was: read(ByteBuffer) method doesn't conform to its API) HdfsDataInputStream::read(ByteBuffer) method doesn't conform to its API --- Key: HDFS-7825 URL: https://issues.apache.org/jira/browse/HDFS-7825 Project: Hadoop HDFS Issue Type: Bug Reporter: Sergey Shelukhin ByteBufferReadable::read(ByteBuffer) javadoc says: {noformat} After a successful call, buf.position() and buf.limit() should be unchanged, and therefore any data can be immediately read from buf. buf.mark() may be cleared or updated. {noformat} I have the following code: {noformat} ByteBuffer directBuf = ByteBuffer.allocateDirect(len); int pos = directBuf.position(); int count = file.read(directBuf); if (count 0) throw new EOFException(); if (directBuf.position() != pos) { RecordReaderImpl.LOG.info(Warning - position mismatch from + file.getClass() + : after reading + count + , expected + pos + but got + directBuf.position()); } {noformat} and I get: {noformat} 15/02/23 15:30:56 [pool-4-thread-1] INFO orc.RecordReaderImpl : Warning - position mismatch from class org.apache.hadoop.hdfs.client.HdfsDataInputStream: after reading 6, expected 0 but got 6 {noformat} So the position is changed, unlike the API doc indicates. Also, while I haven't verified yet, it may be that the 0-length read is not handled properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-5916) provide API to bulk delete directories/files
[ https://issues.apache.org/jira/browse/HDFS-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896902#comment-13896902 ] Sergey Shelukhin commented on HDFS-5916: 1-2-3 are both up to you, for the case I have in mind it should operate like a sequence of regular deletes, for (1) probably best-effort, 2 - no, 3 - non-atomically. But that could be controlled by parameters. 4 - what do other operations do? As far as I recall some of them can recover. Can you provide details on how to enforce multiple RPC calls in one for this case? We currently use FileSystem/DistributedFileSystem interface. The workaround wouldn't work, due to legacy users as well as due to the fact that the files/dirs are already in the same path, it's just that we don't want to delete all of them - e.g. from /path/A, /path/B/, /path/C/ and /path/D we only want to delete B and D (of course with longer lists) provide API to bulk delete directories/files Key: HDFS-5916 URL: https://issues.apache.org/jira/browse/HDFS-5916 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin It would be nice to have an API to delete directories and files in bulk - for example, when deleting Hive partitions or HBase regions in large numbers, the code could avoid many trips to NN. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5916) provide API to bulk delete directories/files
Sergey Shelukhin created HDFS-5916: -- Summary: provide API to bulk delete directories/files Key: HDFS-5916 URL: https://issues.apache.org/jira/browse/HDFS-5916 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin It would be nice to have an API to delete directories and files in bulk - for example, when deleting Hive partitions or HBase regions in large numbers, the code could avoid many trips to NN. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5916) provide API to bulk delete directories/files
[ https://issues.apache.org/jira/browse/HDFS-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896195#comment-13896195 ] Sergey Shelukhin commented on HDFS-5916: I mean the programmatic API (e.g. on FileSystem class) for an arbitrary list of directories (which have common parent sub-tree in these cases, but don't have to I guess). i.e. ListPath dirList = ...; FileSystem fs = ...; fs.deleteAll(dirList); provide API to bulk delete directories/files Key: HDFS-5916 URL: https://issues.apache.org/jira/browse/HDFS-5916 Project: Hadoop HDFS Issue Type: Improvement Reporter: Sergey Shelukhin It would be nice to have an API to delete directories and files in bulk - for example, when deleting Hive partitions or HBase regions in large numbers, the code could avoid many trips to NN. -- This message was sent by Atlassian JIRA (v6.1.5#6160)