[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation

2016-08-17 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425825#comment-15425825
 ] 

Abdullah Yousufi commented on HIVE-14165:
-

Actually, on closer look, FileInputFormat's listStatus specifically returns an 
InvalidInputFormat exception in those two cases, instead of an IO exception, so 
I can catch that.

> Enable faster S3 Split Computation
> --
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() 
> in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation

2016-08-16 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15423120#comment-15423120
 ] 

Abdullah Yousufi commented on HIVE-14165:
-

It calls FileSystem.java#listStatus(Path p, PathFilter filter). And that's 
correct, it verifies that there is at least one FileStatus under the current 
path, at which point it begins the logic of determining splits, primarily by 
calling InputFormat#getSplits(JobConf job, int numSplits). But 
FileInputFormat#getSplits(JobContext job) is going to call listStatus() anyway.

When I remove this listing, I get a 2x speed increase in a 500 partions S3 
table. Could FileInputFormat#getSplits(job) be modified to short-circuit return 
a FileNotFound Exception in the cases of a non-existent path and 0 files found, 
so that Hive could catch that and continue?

> Enable faster S3 Split Computation
> --
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() 
> in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation

2016-08-16 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422796#comment-15422796
 ] 

Steve Loughran commented on HIVE-14165:
---

which filesystem list calls does {{listStatusUnderPath()}} invoke? I'd expect 
it to throw a FileNotFoundException —catching that would  avoid one check

There's another point, which is it is looking just to see if there is any entry 
under the path. Is that right? 

> Enable faster S3 Split Computation
> --
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() 
> in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation

2016-08-15 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421901#comment-15421901
 ] 

Abdullah Yousufi commented on HIVE-14165:
-

So I did try the listFiles() optimization locally and modified Hive to call the 
function on the root directory of a partitioned table. While this does give a 
speedup for a select * query on a partitioned table, this approach is not 
really extensible to queries that do partition elimination, since in those 
cases it makes sense to just pass in the relevant partitions, as Hive currently 
does.

I'm thinking it might make sense to remove the following list call on Hive in 
the case of S3 partitioned tables since the listing for the split computation 
is going to happen later anyway in Hadoop's FileInputFormat.java.

FetchOperator.java#getNextPath()
{code}
if (fs.exists(currPath)) {
  for (FileStatus fStat : listStatusUnderPath(fs, currPath)) {
if (fStat.getLen() > 0) {
  return true;
}
  }
}
{code}

My question is if it sounds good to remove this check. It seems that there may 
be errors that FileInputFormat.java#getSplits() may return if the partition 
directory does not have any files, but is there a better way to handle that?

> Enable faster S3 Split Computation
> --
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles() 
> in HADOOP-13208



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks

2016-07-25 Thread Abdullah Yousufi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392423#comment-15392423
 ] 

Abdullah Yousufi commented on HIVE-14165:
-

Thanks for the clarification Steve, looking forward to that O(files/1000) 
recursive list

> Enable faster S3 Split Computation by listing files in blocks
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> During split computation when a large number of files are required to be 
> listed from S3, instead of executing 1 API call per file, one can optimize by 
> listing 1000 files in each API call. This would reduce the amount of time 
> required for listing files.
> Qubole has this optimization in place as detailed here: 
> https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14165) Enable faster S3 Split Computation by listing files in blocks

2016-07-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392338#comment-15392338
 ] 

Steve Loughran commented on HIVE-14165:
---

If you look at the cost of listing in s3, you'll see that Hadoop already grabs 
5000 objects at a time. What hurts is directory tree walking, as each subdir 
needs to be recursively probed.

s3a will soon have an O(files/1000) recursive list. If you can use 
listFiles(path, recursive=true) you will get that speed

> Enable faster S3 Split Computation by listing files in blocks
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Abdullah Yousufi
>
> During split computation when a large number of files are required to be 
> listed from S3, instead of executing 1 API call per file, one can optimize by 
> listing 1000 files in each API call. This would reduce the amount of time 
> required for listing files.
> Qubole has this optimization in place as detailed here: 
> https://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/?nabe=5695374637924352:0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)