[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425825#comment-15425825 ] Abdullah Yousufi commented on HIVE-14165: - Actually, on closer look, FileInputFormat's listStatus specifically returns an InvalidInputFormat exception in those two cases, instead of an IO exception, so I can catch that. > Enable faster S3 Split Computation > -- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > Split size computation be may improved by the optimizations for listFiles() > in HADOOP-13208 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423120#comment-15423120 ] Abdullah Yousufi commented on HIVE-14165: - It calls FileSystem.java#listStatus(Path p, PathFilter filter). And that's correct, it verifies that there is at least one FileStatus under the current path, at which point it begins the logic of determining splits, primarily by calling InputFormat#getSplits(JobConf job, int numSplits). But FileInputFormat#getSplits(JobContext job) is going to call listStatus() anyway. When I remove this listing, I get a 2x speed increase in a 500 partions S3 table. Could FileInputFormat#getSplits(job) be modified to short-circuit return a FileNotFound Exception in the cases of a non-existent path and 0 files found, so that Hive could catch that and continue? > Enable faster S3 Split Computation > -- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > Split size computation be may improved by the optimizations for listFiles() > in HADOOP-13208 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422796#comment-15422796 ] Steve Loughran commented on HIVE-14165: --- which filesystem list calls does {{listStatusUnderPath()}} invoke? I'd expect it to throw a FileNotFoundException —catching that would avoid one check There's another point, which is it is looking just to see if there is any entry under the path. Is that right? > Enable faster S3 Split Computation > -- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > Split size computation be may improved by the optimizations for listFiles() > in HADOOP-13208 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421901#comment-15421901 ] Abdullah Yousufi commented on HIVE-14165: - So I did try the listFiles() optimization locally and modified Hive to call the function on the root directory of a partitioned table. While this does give a speedup for a select * query on a partitioned table, this approach is not really extensible to queries that do partition elimination, since in those cases it makes sense to just pass in the relevant partitions, as Hive currently does. I'm thinking it might make sense to remove the following list call on Hive in the case of S3 partitioned tables since the listing for the split computation is going to happen later anyway in Hadoop's FileInputFormat.java. FetchOperator.java#getNextPath() {code} if (fs.exists(currPath)) { for (FileStatus fStat : listStatusUnderPath(fs, currPath)) { if (fStat.getLen() > 0) { return true; } } } {code} My question is if it sounds good to remove this check. It seems that there may be errors that FileInputFormat.java#getSplits() may return if the partition directory does not have any files, but is there a better way to handle that? > Enable faster S3 Split Computation > -- > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > Split size computation be may improved by the optimizations for listFiles() > in HADOOP-13208 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392423#comment-15392423 ] Abdullah Yousufi commented on HIVE-14165: - Thanks for the clarification Steve, looking forward to that O(files/1000) recursive list > Enable faster S3 Split Computation by listing files in blocks > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > During split computation when a large number of files are required to be > listed from S3, instead of executing 1 API call per file, one can optimize by > listing 1000 files in each API call. This would reduce the amount of time > required for listing files. > Qubole has this optimization in place as detailed here: > https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392338#comment-15392338 ] Steve Loughran commented on HIVE-14165: --- If you look at the cost of listing in s3, you'll see that Hadoop already grabs 5000 objects at a time. What hurts is directory tree walking, as each subdir needs to be recursively probed. s3a will soon have an O(files/1000) recursive list. If you can use listFiles(path, recursive=true) you will get that speed > Enable faster S3 Split Computation by listing files in blocks > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > > During split computation when a large number of files are required to be > listed from S3, instead of executing 1 API call per file, one can optimize by > listing 1000 files in each API call. This would reduce the amount of time > required for listing files. > Qubole has this optimization in place as detailed here: > https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)