[jira] [Commented] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

2022-12-15 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648183#comment-17648183
 ] 

Steve Loughran commented on HIVE-26699:
---

in the builder pattern we use in hadoop. .opt() options are ignored by 
filesystems which don't recognise them. it's only the .must() ones which MUST 
be understood. so its safe to use

passing in FileStatus to the openFile() calls saves on a HEAD on s3a and abfs 
but has been a bit brittle in the past until it stabilised. You can just pass 
in the file length with the option fs.option.openfile.length and have it picked 
up where it is understood 

> Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
> --
>
> Key: HIVE-26699
> URL: https://issues.apache.org/jira/browse/HIVE-26699
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Hive reads JSON metadata information (TableMetadataParser::read()) multiple 
> times; E.g during query compilation, AM split computation, stats computation, 
> during commits  etc.
>  
> With large JSON files (due to multiple inserts), it takes a lot longer time 
> with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in 
> the order of 10x).To be on safer side, it will be good to set this to 
> "normal" mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

2022-12-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17647671#comment-17647671
 ] 

Steve Loughran commented on HIVE-26699:
---

the api itself went in to hadoop earlier, in 3.3.0 HADOOP-15229

if you are only building on 3.3+ you have the api, just the read policy is an 
s3a only one to set with opt("fs.s3a.experimenta.fadvise", "sequential")

HADOOP-16202
* defined some standard ops for all filesystems to recognise and optionally 
support.
* defined the idea that the read policy should be an ordered list of "policies 
to understand", so we could put in new ones later
* added file length as an option, rather than just filestatus
* and split start/end (nothing uses it, but prefetchers should know not to 
prefetch past split end)
* fixed every use in hadoop itself to say "whole-file" when reading the whole 
file, "sequential" when doing sequential reads. that addresses a bug where on a 
hive cluster with s3a fixed to be random, distcp and yarn localization are both 
underperformant

is hive hadoop 3.3.x + only yet?



> Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
> --
>
> Key: HIVE-26699
> URL: https://issues.apache.org/jira/browse/HIVE-26699
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>
> Hive reads JSON metadata information (TableMetadataParser::read()) multiple 
> times; E.g during query compilation, AM split computation, stats computation, 
> during commits  etc.
>  
> With large JSON files (due to multiple inserts), it takes a lot longer time 
> with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in 
> the order of 10x).To be on safer side, it will be good to set this to 
> "normal" mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

2022-11-12 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17632676#comment-17632676
 ] 

Steve Loughran commented on HIVE-26699:
---

you should be using the openFile() api call and set the read policy option to 
whole-file (assuming that is the intent), and ideally pass in the file 
status...or at least file length, which is enough for s3a to skip the HEAD, 
though not abfs.
see org.apache.hadoop.util.JsonSerialization for its max-performance json load, 
which the s3a and manifest committers both use

> Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
> --
>
> Key: HIVE-26699
> URL: https://issues.apache.org/jira/browse/HIVE-26699
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>
> Hive reads JSON metadata information (TableMetadataParser::read()) multiple 
> times; E.g during query compilation, AM split computation, stats computation, 
> during commits  etc.
>  
> With large JSON files (due to multiple inserts), it takes a lot longer time 
> with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in 
> the order of 10x).To be on safer side, it will be good to set this to 
> "normal" mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-16983) getFileStatus on accessible s3a://[bucket-name]/folder: throws com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error

2022-10-21 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622163#comment-17622163
 ] 

Steve Loughran commented on HIVE-16983:
---

its fixed in hadoop-3.0+ with a moved to shaded AWS binaries, maybe close as 
WORKSFORME?

> getFileStatus on accessible s3a://[bucket-name]/folder: throws 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden;
> -
>
> Key: HIVE-16983
> URL: https://issues.apache.org/jira/browse/HIVE-16983
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.1
> Environment: Hive 2.1.1 on Ubuntu 14.04 AMI in AWS EC2, connecting to 
> S3 using s3a:// protocol
>Reporter: Alex Baretto
>Assignee: Vlad Gudikov
>Priority: Major
> Attachments: HIVE-16983-branch-2.1.patch
>
>
> I've followed various published documentation on integrating Apache Hive 
> 2.1.1 with AWS S3 using the `s3a://` scheme, configuring `fs.s3a.access.key` 
> and 
> `fs.s3a.secret.key` for `hadoop/etc/hadoop/core-site.xml` and 
> `hive/conf/hive-site.xml`.
> I am at the point where I am able to get `hdfs dfs -ls s3a://[bucket-name]/` 
> to work properly (it returns s3 ls of that bucket). So I know my creds, 
> bucket access, and overall Hadoop setup is valid. 
> hdfs dfs -ls s3a://[bucket-name]/
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files
> ...etc. 
> hdfs dfs -ls s3a://[bucket-name]/files
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files/my-csv.csv
> However, when I attempt to access the same s3 resources from hive, e.g. run 
> any `CREATE SCHEMA` or `CREATE EXTERNAL TABLE` statements using `LOCATION 
> 's3a://[bucket-name]/files/'`, it fails. 
> for example:
> >CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, 
> >my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED 
> >BY ',' LOCATION 's3a://[bucket-name]/files/';
> I keep getting this error:
> >FAILED: Execution Error, return code 1 from 
> >org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> >java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus 
> >on s3a://[bucket-name]/files: 
> >com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: 
> >Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> >C9CF3F9C50EF08D1), S3 Extended Request ID: 
> >T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)
> This makes no sense. I have access to the bucket as one can see in the hdfs 
> test. And I've added the proper creds to hive-site.xml. 
> Anyone have any idea what's missing from this equation?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26063) Upgrade Apache parent POM to version 25

2022-10-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17617189#comment-17617189
 ] 

Steve Loughran commented on HIVE-26063:
---

apparently this or an explicit update to the maven shade plugin is needed to 
work with recent versions of bouncy castle, such as that coming with 
HADOOP-17563

> Upgrade Apache parent POM to version 25
> ---
>
> Key: HIVE-26063
> URL: https://issues.apache.org/jira/browse/HIVE-26063
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sylwester Lachiewicz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://maven.apache.org/pom/] 25 has been released on 2022-02-20



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-24484) Upgrade Hadoop to 3.3.1 And Tez to 0.10.2

2022-08-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574635#comment-17574635
 ] 

Steve Loughran commented on HIVE-24484:
---

nice!

> Upgrade Hadoop to 3.3.1 And Tez to 0.10.2 
> --
>
> Key: HIVE-24484
> URL: https://issues.apache.org/jira/browse/HIVE-24484
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 15.05h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-25827) Parquet file footer is read multiple times, when multiple splits are created in same file

2022-06-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554217#comment-17554217
 ] 

Steve Loughran commented on HIVE-25827:
---

thanks. next question: do have one or more of
* a filestatus struct of the file (and the abfs or s3a one, not a wrapped one 
by hive)
* the length of the file

have any of them and once hive -> 3.3.x for its hadoop dependency, yoiu will 
get to save a HEAD requst, at least if you can get them down to the file 
opening operation

> Parquet file footer is read multiple times, when multiple splits are created 
> in same file
> -
>
> Key: HIVE-25827
> URL: https://issues.apache.org/jira/browse/HIVE-25827
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Ádám Szita
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: image-2021-12-21-03-19-38-577.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With large files, it is possible that multiple splits are created in the same 
> file. With current codebase, "ParquetRecordReaderBase" ends up reading file 
> footer for each split. 
> It can be optimized not to read footer information multiple times for the 
> same file.
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160]
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91]
>  
>  
> !image-2021-12-21-03-19-38-577.png|width=1363,height=1256!
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-25980) Reduce fs calls in HiveMetaStoreChecker.checkTable

2022-06-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553577#comment-17553577
 ] 

Steve Loughran commented on HIVE-25980:
---

ok. I'd still recommend the method {{listStatusIterator}} for paginated 
retrieval of wide listings from hdfs, s3a and abfs, especially if you can do 
any useful work during that iteration

> Reduce fs calls in HiveMetaStoreChecker.checkTable
> --
>
> Key: HIVE-25980
> URL: https://issues.apache.org/jira/browse/HIVE-25980
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 3.1.2, 4.0.0
>Reporter: Chiran Ravani
>Assignee: Chiran Ravani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> MSCK Repair table for high partition table can perform slow on Cloud Storage 
> such as S3, one of the case we found where slowness was observed in 
> HiveMetaStoreChecker.checkTable.
> {code:java}
> "HiveServer2-Background-Pool: Thread-382" #382 prio=5 os_prio=0 
> tid=0x7f97fc4a4000 nid=0x5c2a runnable [0x7f97c41a8000]
>java.lang.Thread.State: RUNNABLE
>   at java.net.SocketInputStream.socketRead0(Native Method)
>   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>   at java.net.SocketInputStream.read(SocketInputStream.java:171)
>   at java.net.SocketInputStream.read(SocketInputStream.java:141)
>   at 
> sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
>   at 
> sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
>   at 
> sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1341)
>   at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
>   at 
> sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:957)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
>   at 
> com.amazonaws.thirdparty.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
>   at 
> com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
>   at 
> com.amazonaws.thirdparty.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
>   at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)

[jira] [Commented] (HIVE-25827) Parquet file footer is read multiple times, when multiple splits are created in same file

2022-04-08 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519777#comment-17519777
 ] 

Steve Loughran commented on HIVE-25827:
---

is this per input stream, or are separate streams opened to read it

if its the same opened file, HADOOP-18028 will mitigate this on s3

> Parquet file footer is read multiple times, when multiple splits are created 
> in same file
> -
>
> Key: HIVE-25827
> URL: https://issues.apache.org/jira/browse/HIVE-25827
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
> Attachments: image-2021-12-21-03-19-38-577.png
>
>
> With large files, it is possible that multiple splits are created in the same 
> file. With current codebase, "ParquetRecordReaderBase" ends up reading file 
> footer for each split. 
> It can be optimized not to read footer information multiple times for the 
> same file.
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java#L160]
>  
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L91]
>  
>  
> !image-2021-12-21-03-19-38-577.png|width=1363,height=1256!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-25980) Reduce fs calls in HiveMetaStoreChecker.checkTable

2022-03-28 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513422#comment-17513422
 ] 

Steve Loughran commented on HIVE-25980:
---

use listStatusIterator for incremental listing, page by page, rather than 
blocking to the end. if you can switch to listStatus(recursive=true) then on s3 
you avoid the need to mimic a treewalk entirely

> Reduce fs calls in HiveMetaStoreChecker.checkTable
> --
>
> Key: HIVE-25980
> URL: https://issues.apache.org/jira/browse/HIVE-25980
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 3.1.2, 4.0.0
>Reporter: Chiran Ravani
>Assignee: Chiran Ravani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> MSCK Repair table for high partition table can perform slow on Cloud Storage 
> such as S3, one of the case we found where slowness was observed in 
> HiveMetaStoreChecker.checkTable.
> {code:java}
> "HiveServer2-Background-Pool: Thread-382" #382 prio=5 os_prio=0 
> tid=0x7f97fc4a4000 nid=0x5c2a runnable [0x7f97c41a8000]
>java.lang.Thread.State: RUNNABLE
>   at java.net.SocketInputStream.socketRead0(Native Method)
>   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>   at java.net.SocketInputStream.read(SocketInputStream.java:171)
>   at java.net.SocketInputStream.read(SocketInputStream.java:141)
>   at 
> sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
>   at 
> sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
>   at 
> sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1341)
>   at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
>   at 
> sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:957)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
>   at 
> com.amazonaws.thirdparty.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
>   at 
> com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
>   at 
> com.amazonaws.thirdparty.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1331)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
>   at 
> 

[jira] [Updated] (HIVE-25912) Drop external table at root of s3 bucket throws NPE

2022-02-02 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-25912:
--
Summary: Drop external table at root of s3 bucket throws NPE  (was: Drop 
external table throw NPE)

> Drop external table at root of s3 bucket throws NPE
> ---
>
> Key: HIVE-25912
> URL: https://issues.apache.org/jira/browse/HIVE-25912
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 3.1.2
> Environment: Hive version: 3.1.2
>Reporter: Fachuan Bai
>Assignee: Fachuan Bai
>Priority: Major
>  Labels: metastore, pull-request-available
> Attachments: hive bugs.png
>
>   Original Estimate: 96h
>  Time Spent: 10m
>  Remaining Estimate: 95h 50m
>
> I create the external hive table using this command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `fcbai`(
> `inv_item_sk` int,
> `inv_warehouse_sk` int,
> `inv_quantity_on_hand` int)
> PARTITIONED BY (
> `inv_date_sk` int) STORED AS ORC
> LOCATION
> 'hdfs://emr-master-1:8020/';
> {code}
>  
> The table was created successfully, but  when I drop the table throw the NPE:
>  
> {code:java}
> Error: Error while processing statement: FAILED: Execution Error, return code 
> 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:java.lang.NullPointerException) 
> (state=08S01,code=1){code}
>  
> The same bug can reproduction on the other object storage file system, such 
> as S3 or TOS:
> {code:java}
> CREATE EXTERNAL TABLE `fcbai`(
> `inv_item_sk` int,
> `inv_warehouse_sk` int,
> `inv_quantity_on_hand` int)
> PARTITIONED BY (
> `inv_date_sk` int) STORED AS ORC
> LOCATION
> 's3a://bucketname/'; // 'tos://bucketname/'{code}
>  
> I see the source code found:
>  common/src/java/org/apache/hadoop/hive/common/FileUtils.java
> {code:java}
> // check if sticky bit is set on the parent dir
> FileStatus parStatus = fs.getFileStatus(path.getParent());
> if (!shims.hasStickyBit(parStatus.getPermission())) {
>   // no sticky bit, so write permission on parent dir is sufficient
>   // no further checks needed
>   return;
> }{code}
>  
> because I set the table location to HDFS root path 
> (hdfs://emr-master-1:8020/), so the  path.getParent() function will be return 
> null cause the NPE.
> I think have four solutions to fix the bug:
>  # modify the create table function, if the location is root dir return 
> create table fail.
>  # modify the  FileUtils.checkDeletePermission function, check the 
> path.getParent(), if it is null, the function return, drop successfully.
>  # modify the RangerHiveAuthorizer.checkPrivileges function of the hive 
> ranger plugin(in ranger rep), if the location is root dir return create table 
> fail.
>  # modify the HDFS Path object, if the URI is root dir, path.getParent() 
> return not null.
> I recommend the first or second method, any suggestion for me? thx.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-24852) Add support for Snapshots during external table replication

2021-11-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17436757#comment-17436757
 ] 

Steve Loughran commented on HIVE-24852:
---

# Does this downgrade properly when the destination FS is not hdfs?
# has anyone discussed with the HDFS team the possibility of providing an 
interface in hadoop-common for this?



> Add support for Snapshots during external table replication
> ---
>
> Key: HIVE-24852
> URL: https://issues.apache.org/jira/browse/HIVE-24852
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>  Labels: pull-request-available
> Attachments: Design Doc HDFS Snapshots for External Table 
> Replication-01.pdf, Design Doc HDFS Snapshots for External Table 
> Replication-02.pdf
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Add support for use of snapshot diff for external table replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24484) Upgrade Hadoop to 3.3.1

2021-09-09 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412556#comment-17412556
 ] 

Steve Loughran commented on HIVE-24484:
---

HADOOP-17313 actually went in to deal with hive processes having problems 
instantiating ABFS clients across many threads... Every thread would create its 
own client only for all but one of these to be discarded -there was enough 
contention in that creation process and that things would get really slow. Most 
noticeable when a service like Ranger was involved in FileSystem.initialize().

The patch considers being interrupted as a failure... What is Taz expecting?

> Upgrade Hadoop to 3.3.1
> ---
>
> Key: HIVE-24484
> URL: https://issues.apache.org/jira/browse/HIVE-24484
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 43m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24546) Avoid unwanted cloud storage call during dynamic partition load

2021-07-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380090#comment-17380090
 ] 

Steve Loughran commented on HIVE-24546:
---

I'd recommend
* skip the dest path check
* call mkdirs() without any probe, and if it returns false, check that there's 
a dir at the far end

if (!s.mkdirs(dpStagingPath)  && !fs.isDir(dpStagingPath) {
  throw new IOException("Failed to create dir " + dpStagingPath");
}

This relies on mkdir() returning false if the dir is already there

> Avoid unwanted cloud storage call during dynamic partition load
> ---
>
> Key: HIVE-24546
> URL: https://issues.apache.org/jira/browse/HIVE-24546
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
> Attachments: simple_test.sql
>
>
> {code:java}
>  private void createDpDirCheckSrc(final Path dpStagingPath, final Path 
> dpFinalPath) throws IOException {
> if (!fs.exists(dpStagingPath) && !fs.exists(dpFinalPath)) {
>   fs.mkdirs(dpStagingPath);
>   // move task will create dp final path
>   if (reporter != null) {
> reporter.incrCounter(counterGroup, 
> Operator.HIVE_COUNTER_CREATED_DYNAMIC_PARTITIONS, 1);
>   }
> }
>   }
>  {code}
>  
>  
> {noformat}
> at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:370)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:1960)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3164)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3031)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2899)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1723)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4157)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createDpDir(FileSinkOperator.java:948)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.updateDPCounters(FileSinkOperator.java:916)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:849)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:814)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.createNewPaths(FileSinkOperator.java:1200)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.getDynOutPaths(FileSinkOperator.java:1324)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:1036)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:111)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:969)
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

2021-06-30 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372145#comment-17372145
 ] 

Steve Loughran commented on HIVE-24849:
---


Something like this
  
* existence check is integrated into the LIST call, saves 1 request against all 
stores, 2 for s3
* listing is incremental, so can determine if a dir is not empty without 
processing all the output,
  at least on those stores which do the listings incrementally (hdfs, webhdfs, 
s3a, abfs).
  on the others it is no slower than listStatus, which is what it calls 
internally.
  
{code}


  public boolean isEmpty() throws HiveException {
Preconditions.checkNotNull(getPath());
try {
  FileSystem fs = FileSystem.get(getPath().toUri(),
  SessionState.getSessionConf());
  RemoteIterator it = listStatusIterator(getPath);
  while (it.hasNext()) {
FileStatus fs = it.next();
if (FileUtils.HIDDEN_FILES_PATH_FILTER.accept(fs.getPath())) {
  // something not matching the filter exists
  return false;
}

  }
  return true;
} catch (FileNotFoundException e) {
  // list failed
  return true;
} catch (IOException e) {
  throw new HiveException(e);
}
  }
  {code}

  {code}

For isDir(), I'd just call FileSystem.isDir(path) and ignore the deprecation 
warning, which is there to make people look at their uses and wonder if there 
are more efficient ways. (too often app code has isFile, isDir or exists() 
before some API Call which would just raise a FileNotFoundException anyway). 

What I would I recommend doing is looking at uses of isDir() and wondering if 
they could be eliminated entirely.

> Create external table socket timeout when location has large number of files
> 
>
> Key: HIVE-24849
> URL: https://issues.apache.org/jira/browse/HIVE-24849
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.3.4
> Environment: AWS EMR 5.23 with default Hive metastore and external 
> location S3
>  
>Reporter: Mithun Antony
>Priority: Major
>
> # The create table API call timeout when during an external table creation on 
> a location where the number files in the S3 location is large ( ie: ~10K 
> objects ).
> The default timeout `hive.metastore.client.socket.timeout` is `600s` current 
> workaround is it to increase the timeout to a higher value
> {code:java}
> 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c 
> main([])]: exec.DDLTask (DDLTask.java:failed(639)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>  at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
>  at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at 

[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

2021-06-29 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371535#comment-17371535
 ] 

Steve Loughran commented on HIVE-24849:
---

How does tbl.isEmpty() work? Does it do a listStatus call (blocks for all 
direct children) or use an iterator (listStatusIterator(), listFiles()) and 
return false if the returned iterator .hasNext() is true?  Using an iterator is 
better as the wait time will that for a single S3 or ABFS LIST call, rather 
than block for all the results

> Create external table socket timeout when location has large number of files
> 
>
> Key: HIVE-24849
> URL: https://issues.apache.org/jira/browse/HIVE-24849
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.3.4
> Environment: AWS EMR 5.23 with default Hive metastore and external 
> location S3
>  
>Reporter: Mithun Antony
>Priority: Major
>
> # The create table API call timeout when during an external table creation on 
> a location where the number files in the S3 location is large ( ie: ~10K 
> objects ).
> The default timeout `hive.metastore.client.socket.timeout` is `600s` current 
> workaround is it to increase the timeout to a higher value
> {code:java}
> 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c 
> main([])]: exec.DDLTask (DDLTask.java:failed(639)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>  at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
>  at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1199)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1185)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2399)
>  at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:752)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:740)
>  at 

[jira] [Commented] (HIVE-24484) Upgrade Hadoop to 3.3.1

2021-06-23 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368073#comment-17368073
 ] 

Steve Loughran commented on HIVE-24484:
---

bq. Would be great if folks could work on syncing the version of Guava which 
these products use, especially upgrading Druid.

hadoop trunk is trying to rip out a lot of its uses of Guava as its too brittle 
a dependency for everything

> Upgrade Hadoop to 3.3.1
> ---
>
> Key: HIVE-24484
> URL: https://issues.apache.org/jira/browse/HIVE-24484
> Project: Hive
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24916) EXPORT TABLE command to ADLS Gen2/s3 failing

2021-06-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367447#comment-17367447
 ] 

Steve Loughran commented on HIVE-24916:
---

If the hadoop version is recent, then calling 
hasPathCapabilility("fs.capability.paths.xattrs") will tell you if the store 
supports xattrs

currently only HDFS and webHDFS support the full XAttr API, so they are the 
only ones which return true for this probe.

the other tactic is: probe and downgrade. However, be aware that while S3A 
rejects all setXAttr calls, it does support the getters (As a way at getting at 
HTTP object headers)

> EXPORT TABLE command to ADLS Gen2/s3 failing
> 
>
> Key: HIVE-24916
> URL: https://issues.apache.org/jira/browse/HIVE-24916
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0
>Reporter: Rajkumar Singh
>Assignee: Rajkumar Singh
>Priority: Major
>
> "EXPORT TABLE" command invoked using distcp command failed with following 
> error -
> org.apache.hadoop.tools.CopyListing$XAttrsNotSupportedException: XAttrs not 
> supported for file system: abfs://storage...@xx.core.windows.net



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

2021-06-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367439#comment-17367439
 ] 

Steve Loughran commented on HIVE-24849:
---

[~glapark]

bq. Now, HiveServer2 does not send ListObjectV2 requests. Still Metastore sends 
a ListObjectV2 request from Warehouse.isDir().

don't worry, that will be size limited and is the probe for "is this a 
directory?"

> Create external table socket timeout when location has large number of files
> 
>
> Key: HIVE-24849
> URL: https://issues.apache.org/jira/browse/HIVE-24849
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.3.4
> Environment: AWS EMR 5.23 with default Hive metastore and external 
> location S3
>  
>Reporter: Mithun Antony
>Priority: Major
>
> # The create table API call timeout when during an external table creation on 
> a location where the number files in the S3 location is large ( ie: ~10K 
> objects ).
> The default timeout `hive.metastore.client.socket.timeout` is `600s` current 
> workaround is it to increase the timeout to a higher value
> {code:java}
> 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c 
> main([])]: exec.DDLTask (DDLTask.java:failed(639)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>  at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
>  at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1199)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1185)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2399)
>  at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:752)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:740)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> 

[jira] [Commented] (HIVE-24849) Create external table socket timeout when location has large number of files

2021-06-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367438#comment-17367438
 ] 

Steve Loughran commented on HIVE-24849:
---

is hive doing its own recursive treewalk or calling listFiles(path, 
recursive=true). That call is much, much faster against S3, as well as 
potentially for HDFS/webhdfs i

> Create external table socket timeout when location has large number of files
> 
>
> Key: HIVE-24849
> URL: https://issues.apache.org/jira/browse/HIVE-24849
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 2.3.4
> Environment: AWS EMR 5.23 with default Hive metastore and external 
> location S3
>  
>Reporter: Mithun Antony
>Priority: Major
>
> # The create table API call timeout when during an external table creation on 
> a location where the number files in the S3 location is large ( ie: ~10K 
> objects ).
> The default timeout `hive.metastore.client.socket.timeout` is `600s` current 
> workaround is it to increase the timeout to a higher value
> {code:java}
> 2021-03-04T01:37:42,761 ERROR [66b8024b-e52f-42b8-8629-a45383bcac0c 
> main([])]: exec.DDLTask (DDLTask.java:failed(639)) - 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:873)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:878)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
>  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>  at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>  at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>  at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
>  at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> Caused by: org.apache.thrift.transport.TTransportException: 
> java.net.SocketTimeoutException: Read timed out
>  at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
>  at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1199)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1185)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2399)
>  at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:93)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:752)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:740)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> 

[jira] [Commented] (HIVE-17133) NoSuchMethodError in Hadoop FileStatus.compareTo

2021-06-22 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-17133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367301#comment-17367301
 ] 

Steve Loughran commented on HIVE-17133:
---

Is this ready to go in? even without a new test?

> NoSuchMethodError in Hadoop FileStatus.compareTo
> 
>
> Key: HIVE-17133
> URL: https://issues.apache.org/jira/browse/HIVE-17133
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17133.1.patch
>
>
> The stack trace is:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.lambda$getAcidState$0(AcidUtils.java:931)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:234)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1454)
>   at java.util.Collections.sort(Collections.java:175)
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:929)
> {noformat}
> I'm on Hive master and using Hadoop 2.7.2. The method signature in Hadoop 
> 2.7.2 is:
> https://github.com/apache/hadoop/blob/release-2.7.2-RC2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileStatus.java#L336
> In Hadoop 2.8.0 it becomes:
> https://github.com/apache/hadoop/blob/release-2.8.0-RC3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileStatus.java#L332
> I think that breaks binary compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24717) Migrate to listStatusIterator in moving files

2021-02-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277866#comment-17277866
 ] 

Steve Loughran commented on HIVE-24717:
---

happy to review a hadoop PR with the relevant fix backported

> Migrate to listStatusIterator in moving files
> -
>
> Key: HIVE-24717
> URL: https://issues.apache.org/jira/browse/HIVE-24717
> Project: Hive
>  Issue Type: Improvement
>Reporter: Mustafa İman
>Assignee: Mustafa İman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hive.java has various calls to hdfs listStatus call when moving 
> files/directories around. These codepaths are used for insert overwrite 
> table/partition queries.
> listStatus It is blocking call whereas listStatusIterator is backed by a 
> RemoteIterator and fetches pages in the background. Hive should take 
> advantage of that since Hadoop has implemented listStatusIterator for S3 
> recently https://issues.apache.org/jira/browse/HADOOP-17074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23492) Remove unnecessary FileSystem#exists calls from ql module

2020-07-15 Thread Steve Loughran (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-23492:
--
Description: Wherever there is an exists() call before open() or delete(), 
remove it and infer from the FileNotFoundException raised in open/delete that 
the file does not exist. Exists() just checks for a FileNotFoundException so 
it's a waste of time, especially on higher-latency stores  (was: Wherever there 
is an exists() call before open() or delete(), remove it and infer from the 
FileNotFoundException raised in open/delete that the file does not exist. 
Exists() just checks for a FileNotFoundException so it's a waste of time, 
especially on clunkier FSes)

> Remove unnecessary FileSystem#exists calls from ql module
> -
>
> Key: HIVE-23492
> URL: https://issues.apache.org/jira/browse/HIVE-23492
> Project: Hive
>  Issue Type: Improvement
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23492.01.patch, HIVE-23492.02.patch, 
> HIVE-23492.03.patch, HIVE-23492.04.patch, HIVE-23492.05.patch
>
>
> Wherever there is an exists() call before open() or delete(), remove it and 
> infer from the FileNotFoundException raised in open/delete that the file does 
> not exist. Exists() just checks for a FileNotFoundException so it's a waste 
> of time, especially on higher-latency stores



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22819) Refactor Hive::listFilesCreatedByQuery to make it faster for object stores

2020-02-25 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044471#comment-17044471
 ] 

Steve Loughran commented on HIVE-22819:
---

LGTM -this saves two round trips to HDFS, S3 or ABFS.

> Refactor Hive::listFilesCreatedByQuery to make it faster for object stores
> --
>
> Key: HIVE-22819
> URL: https://issues.apache.org/jira/browse/HIVE-22819
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
> Attachments: HIVE-22819.1.patch, HIVE-22819.2.patch, 
> HIVE-22819.3.patch, HIVE-22819.4.patch
>
>
> {color:#ff}Hive::listFilesCreatedByQuery{color} does an exists(), an 
> isDir() and then a listing call. This can be expensive in object stores. We 
> should instead directly list the files in the directory (we'd have to handle 
> an exception if the directory does not exists, but issuing a single call to 
> the object store would most likely still end up being more performant). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation

2020-02-10 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033607#comment-17033607
 ] 

Steve Loughran commented on HIVE-14165:
---

What is the current status of this? Is it a defacto WONTFIX? Or is someone 
keeping the patch up to date

> Remove Hive file listing during split computation
> -
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Abdullah Yousufi
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, 
> HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, 
> HIVE-14165.07.patch, HIVE-14165.patch
>
>
> The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's 
> FileInputFormat.java will list the files during split computation anyway to 
> determine their size. One way to remove this is to catch the 
> InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the 
> Hive side instead of doing the file listing beforehand.
> For S3 select queries on partitioned tables, this results in a 2x speedup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2020-01-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007496#comment-17007496
 ] 

Steve Loughran commented on HIVE-16295:
---

yeah, where are we with this? Is anyone active on it? 

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, 
> HIVE-16295.3.WIP.patch, HIVE-16295.4.patch, HIVE-16295.5.patch, 
> HIVE-16295.6.patch, HIVE-16295.7.patch, HIVE-16295.8.patch, HIVE-16295.9.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22548) Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location

2019-12-06 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989710#comment-16989710
 ] 

Steve Loughran commented on HIVE-22548:
---

OK.

 BTW, if you call toString on the S3A connector you get a dump of its metrics; 
you can just do a LOG.debug("FS {}", fs) to get this @ debug level

> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final 
> location
> -
>
> Key: HIVE-22548
> URL: https://issues.apache.org/jira/browse/HIVE-22548
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 3.1.2
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
> Attachments: HIVE-22548.01.patch, HIVE-22548.02.patch
>
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also 
> runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22548) Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location

2019-12-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987093#comment-16987093
 ] 

Steve Loughran commented on HIVE-22548:
---

do you need that return code from removeEmptyDpDirectory()? As you are still 
doing listStatus calls which you can avoid...if you just replaced the entire 
function with delete(path, false) then only an empty dir will be deleted, so 
you save the cost of a listing

> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final 
> location
> -
>
> Key: HIVE-22548
> URL: https://issues.apache.org/jira/browse/HIVE-22548
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 3.1.2
>Reporter: Rajesh Balamohan
>Assignee: mahesh kumar behera
>Priority: Major
> Attachments: HIVE-22548.01.patch
>
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also 
> runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22548) Optimise Utilities.removeTempOrDuplicateFiles when moving files to final location

2019-11-27 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983416#comment-16983416
 ] 

Steve Loughran commented on HIVE-22548:
---

Also L1644 it calls path.exists() before the listFiles. Has anyone noticed that 
is marked as deprecated? There's a reason we warn people about it, and it's 
this recurrent code path of exists + operation, which duplicates the expensive 
check for files or directories existing.

*just call listStatus and treat a FileNotFoundException as a sign that the path 
doesn't exist*

It is exactly what exists() does after all.

While I'm looking at that class
h3. removeEmptyDpDirectory

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1601]

This contains a needless listFiles just to see if directory is empty.

if you use delete(path, false) (i.e. the non-recursive one), it does the check 
for having children internally * and rejects the call* . Just swallow any 
exception it raises telling you off about this fact.
 * we have a test for this for every single file system; it is the same as "rm 
dir" on the command line. You do not need to worry about it being implemented 
wrong.

h3. removeTempOrDuplicateFiles

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1757]

delete() returns false in only two conditions
 # you've tried to delete root
 # the file wasn't actually there

you shouldn't need to check and if there is any chance that some other process 
would delete the temp file, would convert a no-op into a failure.
h3. getFileSizeRecursively()

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1840]
 getFileSizeRecursively() is potentially really expensive too.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1853]
 this swallows all exception details. Please include the message and the nested 
exception. Everyone who fields support calls will appreciate this

> Optimise Utilities.removeTempOrDuplicateFiles when moving files to final 
> location
> -
>
> Key: HIVE-22548
> URL: https://issues.apache.org/jira/browse/HIVE-22548
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 3.1.2
>Reporter: Rajesh Balamohan
>Priority: Major
>
> {{Utilities.removeTempOrDuplicateFiles}}
> is very slow with cloud storage, as it executes {{listStatus}} twice and also 
> runs in single threaded mode.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1629



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22411) Performance degradation on single row inserts

2019-10-31 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964170#comment-16964170
 ] 

Steve Loughran commented on HIVE-22411:
---

patch looks functional to me at a glance

There is still a cost to all these list operations. Is there actually a way to 
avoid them -such as have whatever commits the output passing up the details on 
what has changed?

> Performance degradation on single row inserts
> -
>
> Key: HIVE-22411
> URL: https://issues.apache.org/jira/browse/HIVE-22411
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Attila Magyar
>Assignee: Attila Magyar
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22411.1.patch, Screen Shot 2019-10-17 at 8.40.50 
> PM.png
>
>
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> Therefore insertion time goes up linearly:
> !Screen Shot 2019-10-17 at 8.40.50 PM.png|width=601,height=436!
> The fix is to use fs.listFiles(path, /**recursive**/ true) instead the 
> handcrafter recursive method/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22411) Performance degradation on single row inserts

2019-10-29 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962280#comment-16962280
 ] 

Steve Loughran commented on HIVE-22411:
---

FYI [~gabor.bota][~rajesh.balamohan]

> Performance degradation on single row inserts
> -
>
> Key: HIVE-22411
> URL: https://issues.apache.org/jira/browse/HIVE-22411
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Attila Magyar
>Assignee: Attila Magyar
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: Screen Shot 2019-10-17 at 8.40.50 PM.png
>
>
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> Therefore insertion time goes up linearly:
> !Screen Shot 2019-10-17 at 8.40.50 PM.png|width=601,height=436!
> The fix is to use fs.listFiles(path, /**recursive**/ true) instead the 
> handcrafter recursive method/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22411) Performance degradation on single row inserts

2019-10-29 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962271#comment-16962271
 ] 

Steve Loughran commented on HIVE-22411:
---

 Why do you need to list every single file under a directory tree just to 
update the counter? That is a very expensive operation. On S3 it is 
O(files/5000) and you are billed for it; With S3Guard it is slightly faster and 
you are billed more for it. In both cases it lines you up for throttling by the 
service.
 
 
 Can't Hive count the amount of data during job commit?


> Performance degradation on single row inserts
> -
>
> Key: HIVE-22411
> URL: https://issues.apache.org/jira/browse/HIVE-22411
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Attila Magyar
>Assignee: Attila Magyar
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: Screen Shot 2019-10-17 at 8.40.50 PM.png
>
>
> Executing single insert statements on a transactional table effects write 
> performance on a s3 file system. Each insert creates a new delta directory. 
> After each insert hive calculates statistics like number of file in the table 
> and total size of the table. In order to calculate these, it traverses the 
> directory recursively. During the recursion for each path a separate 
> listStatus call is executed. In the end the more delta directory you have the 
> more time it takes to calculate the statistics.
> Therefore insertion time goes up linearly:
> !Screen Shot 2019-10-17 at 8.40.50 PM.png|width=601,height=436!
> The fix is to use fs.listFiles(path, /**recursive**/ true) instead the 
> handcrafter recursive method/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22054) Avoid recursive listing to check if a directory is empty

2019-07-30 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896009#comment-16896009
 ] 

Steve Loughran commented on HIVE-22054:
---

you are correct, the getContentSummary call will be horribly bad on S3; didn't 
know anyone used it. Filed HADOOP-16468 to speed it up, but it'll still be 
issuing {{descendants/1000}} LIST calls, which costs $ as well as time.

For directories where the parent is deleted, things are low cost today; this 
patch will deliver significant speedups in the state where the parent directory 
is not empty and 1+ subdirectory has a deep tree -its the depth which is 
potentially more expensive than the number of entries in a directory.



> Avoid recursive listing to check if a directory is empty
> 
>
> Key: HIVE-22054
> URL: https://issues.apache.org/jira/browse/HIVE-22054
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.13.0, 1.2.0, 2.1.0, 3.1.1, 2.3.5
>Reporter: Prabhas Kumar Samanta
>Assignee: Prabhas Kumar Samanta
>Priority: Major
> Attachments: HIVE-22054.2.patch, HIVE-22054.patch
>
>
> During drop partition on a managed table, first we delete the directory 
> corresponding to the partition. After that we recursively delete the parent 
> directory as well if parent directory becomes empty. To do this emptiness 
> check, we call Warehouse::getContentSummary(), which in turn recursively 
> check all files and subdirectories. This is a costly operation when a 
> directory has a lot of files or subdirectories. This overhead is even more 
> prominent for cloud based file systems like s3. And for emptiness check, this 
> is unnecessary too.
> This is recursive listing was introduced as part of HIVE-5220. Code snippet 
> for reference :
> {code:java}
> // Warehouse.java
> public boolean isEmpty(Path path) throws IOException, MetaException {
>   ContentSummary contents = getFs(path).getContentSummary(path);
>   if (contents != null && contents.getFileCount() == 0 && 
> contents.getDirectoryCount() == 1) {
> return true;
>   }
>   return false;
> }
> // HiveMetaStore.java
> private void deleteParentRecursive(Path parent, int depth, boolean mustPurge, 
> boolean needRecycle)
>   throws IOException, MetaException {
>   if (depth > 0 && parent != null && wh.isWritable(parent)) {
> if (wh.isDir(parent) && wh.isEmpty(parent)) {
>   wh.deleteDir(parent, true, mustPurge, needRecycle);
> }
> deleteParentRecursive(parent.getParent(), depth - 1, mustPurge, 
> needRecycle);
>   }
> }
> // Note: FileSystem::getContentSummary() performs a recursive listing.{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (HIVE-19580) Hive 2.3.2 with ORC files & stored on S3 are case sensitive on EMR

2019-02-20 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HIVE-19580.
---
Resolution: Not A Problem

OK. closing.

Trying hard to think of the best way to classify, e.g. cannot reproduce, etc, 
invalid. Closing as "not a problem" as it's not an ASF issue.

> Hive 2.3.2 with ORC files & stored on S3 are case sensitive on EMR
> --
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: EMR s3:// connector
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19580) Hive 2.3.2 with ORC files & stored on S3 are case sensitive on EMR

2019-02-20 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-19580:
--
Summary: Hive 2.3.2 with ORC files & stored on S3 are case sensitive on EMR 
 (was: Hive 2.3.2 with ORC files stored on S3 are case sensitive)

> Hive 2.3.2 with ORC files & stored on S3 are case sensitive on EMR
> --
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: EMR s3:// connector
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-19580) Hive 2.3.2 with ORC files stored on S3 are case sensitive

2019-02-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772255#comment-16772255
 ] 

Steve Loughran edited comment on HIVE-19580 at 2/19/19 7:21 PM:


If this is EMR then AWS are the only team who can deal with it. Their S3 
connector is not the ASF one. Changing the environment to make clear that both 
sightings are with EMR URLs.


was (Author: ste...@apache.org):
If this is EMR then AWS are the only person who can deal with it. Their S3 
connector is not the ASF one. Changing the environment to make clear that both 
sightings are with EMR URLs.

> Hive 2.3.2 with ORC files stored on S3 are case sensitive
> -
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: EMR s3:// connector
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19580) Hive 2.3.2 with ORC files stored on S3 are case sensitive

2019-02-19 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772255#comment-16772255
 ] 

Steve Loughran commented on HIVE-19580:
---

If this is EMR then AWS are the only person who can deal with it. Their S3 
connector is not the ASF one. Changing the environment to make clear that both 
sightings are with EMR URLs.

> Hive 2.3.2 with ORC files stored on S3 are case sensitive
> -
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: AWS S3 to store files
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-19580) Hive 2.3.2 with ORC files stored on S3 are case sensitive

2019-02-19 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-19580:
--
Environment: 
EMR s3:// connector

Spark 2.3 but also true for lower versions

Hive 2.3.2

  was:
AWS S3 to store files

Spark 2.3 but also true for lower versions

Hive 2.3.2


> Hive 2.3.2 with ORC files stored on S3 are case sensitive
> -
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: EMR s3:// connector
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16913) Support per-session S3 credentials

2018-11-02 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672912#comment-16672912
 ] 

Steve Loughran commented on HIVE-16913:
---

DTs aren't sufficient here as Hive uses its granted superuser rights to request 
DTs as a specific user from HDFS and YARN; you can't do this with object 
stores. Instead users will somehow have to be able to submit DTs with their 
queries

> Support per-session S3 credentials
> --
>
> Key: HIVE-16913
> URL: https://issues.apache.org/jira/browse/HIVE-16913
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
>
> Currently, the credentials needed to support Hive-on-S3 (or any other 
> cloud-storage) need to be to the hive-site.xml. Either using a hadoop 
> credential provider or by adding the keys in the hive-site.xml in plain text 
> (unsecure)
> This limits the usecase to using a single S3 key. If we configure per bucket 
> s3 keys like described [here | 
> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
>  it exposes the access to all the buckets to all the hive users.
> It is possible that there are different sets of users who would not like to 
> share there buckets and still be able to process the data using Hive. 
> Enabling session level credentials will help solve such use-cases. For 
> example, currently this doesn't work
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> {noformat}
> Because metastore is unaware of the the keys. This doesn't work either
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> set metaconf:fs.s3a.secret.key=my_secret_key;
> set metaconf:fs.s3a.access.key=my_access_key;
> {noformat}
> This is because only a certain metastore configurations defined in 
> {{HiveConf.MetaVars}} are allowed to be set by the user. If we enable the 
> above approaches we could potentially allow multiple S3 credentials on a 
> per-session level basis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-07-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554709#comment-16554709
 ] 

Steve Loughran commented on HIVE-16295:
---

w.r.t maven dependencies, if you are building against hadoop-3.x, the 
"hadoop-cloud-storage" POM artifact will pull in all the cloud modules of the 
specific version of hadoop, while downgrading the hadoop-common dependency to 
"provided". You'll inevitably have to exclude jackson versions which will creep 
in somewhere, but its goal is a lower-maintenance way to to pull in cloud 
connectors

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, 
> HIVE-16295.3.WIP.patch, HIVE-16295.4.patch, HIVE-16295.5.patch, 
> HIVE-16295.6.patch, HIVE-16295.7.patch, HIVE-16295.8.patch, HIVE-16295.9.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506017#comment-16506017
 ] 

Steve Loughran commented on HIVE-16391:
---

I'm pleased to see the kryo version stuff isn't an issue any more...what do the 
hive team have to say here?

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-06-06 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503216#comment-16503216
 ] 

Steve Loughran commented on HIVE-16295:
---


* PathOutputCommitterFactory; you can ask for that to become limited private + 
unstable and add Hive into the mix, add a MAPREDUCE patch
* for the other, again, a limited private + unstable for the internal commit 
constant, so we know to leave it alone , under HADOOP

bq. For the _SUCCESS file, is it something that is common to all 
PathOutputCommitter implementations

It's done in the S3A one, not done for FileOutputCommitter. The IBM Stocator 
committer also does a JSON manifest, just a different one (i.e. I don't know 
the details). We explicitly stuck a version marker on the one the S3A committer 
currently uses so as to allow for change, that is: the deser code will fail if 
that's not there/the wrong version.

FWIW, I do parse the file in my spark tests. Originally I had my own copy & 
paste of the file format, now I just import the s3a one.


> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch, 
> HIVE-16295.3.WIP.patch, HIVE-16295.4.patch, HIVE-16295.5.patch, 
> HIVE-16295.6.patch, HIVE-16295.7.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502129#comment-16502129
 ] 

Steve Loughran commented on HIVE-16391:
---

bq. The problem with that is that it changes the meaning of Hive's artifacts, 
so anybody currently importing hive-exec would see a breakage, and that's 
probably not desired.

probably true.

Obviously, its up to the hive team, but yes, the "purist" approach is unshaded 
with a shaded option.

One issue I recall from building that 1.2.1-spark JAR was that a very small bit 
of the hive API used by spark passed kryo objects around. It wasn't enough to 
shade, we had to tweak the hive source to import the previous kryo package so 
that all was in sync. If that is now fixed through: API changes, spark/hive 
version changes, life is simpler. Ideally:  an API which didn't pass shaded 
classes around.

Where do things stand there?

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501534#comment-16501534
 ] 

Steve Loughran commented on HIVE-16391:
---

Generally uses .patch files attached to the JIRA

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19580) Hive 2.3.2 with ORC files stored on S3 are case sensitive

2018-05-30 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495447#comment-16495447
 ] 

Steve Loughran commented on HIVE-19580:
---

Don't see why this should be s3-related.
* Can you replicate it on a normal hadoop FS?
* If not, given s3 is amazon's closed-source connector, can you replicate it 
with the ASF's own s3a connector

> Hive 2.3.2 with ORC files stored on S3 are case sensitive
> -
>
> Key: HIVE-19580
> URL: https://issues.apache.org/jira/browse/HIVE-19580
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.2
> Environment: AWS S3 to store files
> Spark 2.3 but also true for lower versions
> Hive 2.3.2
>Reporter: Arthur Baudry
>Priority: Major
> Fix For: 2.3.2
>
>
> Original file is csv:
> COL1,COL2
>  1,2
> ORC file are created with Spark 2.3:
> scala> val df = spark.read.option("header","true").csv("/user/hadoop/file")
> scala> df.printSchema
>  root
> |– COL1: string (nullable = true)|
> |– COL2: string (nullable = true)|
> scala> df.write.orc("s3://bucket/prefix")
> In Hive:
> hive> CREATE EXTERNAL TABLE test_orc(COL1 STRING, COL2 STRING) STORED AS ORC 
> LOCATION ("s3://bucket/prefix");
> hive> SELECT * FROM test_orc;
>  OK
>  NULL NULL
> *Everyfield is null. However if fields are generated using lower case in 
> Spark schemas then everything works.*
> The reason why I'm raising this bug is that we have customers using Hive 
> 2.3.2 to read files we generate through Spark and all our code base is 
> addressing fields using upper case while this is incompatible with their Hive 
> instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-04-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452645#comment-16452645
 ] 

Steve Loughran commented on HIVE-16295:
---

bq. is there a reason PathOutputCommitterFactory doesn't provide a way to 
construct a PathOutputCommitter using a JobContext rather than a 
TaskAttemptContext

I think it's because the only bits in hadoop & spark where committers were 
being constructed with JobContext alone was in the v1 committers, which these 
committers don't (currently) support. It just kept things simpler all round to 
not have to worry about two similar-but-slightly different constructors.

bq. does the DirectoryOutputCommitter work with Spark SQL or just Spark? I'

should work as a drop in replacement for a normal hadoop FileOutputCommitter; 
its not being clever the way the parititioned one is.

regarding dynamic partitioning, the S3A Committers do know which files they've 
created, which is stuff that goes in the manifest. If you load in the _SUCCESS 
File and read that section, you can infer it. If that works then create a 
hadoop JIRA "stabilize _SUCCESS format" and we'll think about what we can say 
"will always be retained". 

Or is this file being created too late in your workflow?

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-04-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452112#comment-16452112
 ] 

Steve Loughran commented on HIVE-16295:
---

One other comment: you can rely on _SUCCESS being a JSON file of size > 0 when 
you've switched to the new committer...it's got internal structures listing 
committer used and files created. This is invaluable for testing.

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter

2018-04-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450480#comment-16450480
 ] 

Steve Loughran commented on HIVE-16295:
---

Impressive. I'm not knowledgeable about hive to review this

One thing I'll highlight is you shoudn't have to be doing the reflection stuff, 
part of the committer design is the notion of a per-store committer factory 
which you can connect to so as to dynamically get the right one for your store 
(so if wasb  add their own, you'd get that one too). Is there a reason this 
didn't work (i.e. is there something we can do to get it working?)

> Add support for using Hadoop's S3A OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch
>
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-09 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392751#comment-16392751
 ] 

Steve Loughran commented on HIVE-18861:
---

thx for your help nurturing this in.

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391555#comment-16391555
 ] 

Steve Loughran commented on HIVE-18861:
---

I don't see these tests being related' there's nothing druid or aws related 
directly in the stacks. e.g. TestSequenceFileHCatStorer.testWriteChar
{code}
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6017: 
JobID: job_local1096640948_0048 Reason: ExitCodeException exitCode=1: chmod: 
cannot access 
‘/tmp/hadoop/mapred/staging/hiveptest1096640948/.staging/job_local1096640948_0048/job.split’:
 No such file or directory
{code}

However, without  experience running the Hive tests, I could be utterly wrong. 

What to do now?

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Patch Available  (was: Open)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Attachment: HIVE-18861.patch

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390366#comment-16390366
 ] 

Steve Loughran commented on HIVE-18861:
---

Not seeing any updates after 9h. Cancelling and reattaching

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Open  (was: Patch Available)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch, HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Patch Available  (was: Open)

got it; cut the -version marker. You must be using a different yetus version

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Open  (was: Patch Available)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-07 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Attachment: HIVE-18861.patch

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch, 
> HIVE-18861.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Open  (was: Patch Available)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Attachment: HIVE-18861-001.patch

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-06 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Patch Available  (was: Open)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch, HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387697#comment-16387697
 ] 

Steve Loughran commented on HIVE-18861:
---

[~ashutoshc]: I dont see jira running tests here...is there something I need to 
do?

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386345#comment-16386345
 ] 

Steve Loughran commented on HIVE-18861:
---

thanks! If this goes it in, it will be first contribution to hive...removing my 
own code from the classpath :)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Status: Patch Available  (was: Open)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386228#comment-16386228
 ] 

Steve Loughran commented on HIVE-18861:
---

Dependencies before the patch when built against hadoop branch-3.1
{code}
[INFO] |  +- io.druid.extensions:druid-hdfs-storage:jar:0.11.0:compile
[INFO] |  |  +- (org.apache.hadoop:hadoop-client:jar:3.1.0-SNAPSHOT:compile - 
version managed from 2.7.3; omitted for duplicate)
[INFO] |  |  +- org.apache.hadoop:hadoop-aws:jar:2.7.3:compile
[INFO] |  |  |  \- (org.apache.hadoop:hadoop-common:jar:3.1.0-SNAPSHOT:compile 
- version managed from 2.7.3; omitted for duplicate)
[INFO] |  |  \- com.amazonaws:aws-java-sdk-s3:jar:1.10.77:compile
[INFO] |  | +- com.amazonaws:aws-java-sdk-kms:jar:1.10.77:compile
[INFO] |  | |  \- (com.amazonaws:aws-java-sdk-core:jar:1.10.77:compile - 
omitted for duplicate)
[INFO] |  | \- com.amazonaws:aws-java-sdk-core:jar:1.10.77:compile
[INFO] |  |+- (commons-logging:commons-logging:jar:1.1.3:compile - 
omitted for conflict with 1.0.4)
[INFO] |  |+- (org.apache.httpcomponents:httpclient:jar:4.5.2:compile - 
version managed from 4.3.6; omitted for duplicate)
[INFO] |  |+- 
com.fasterxml.jackson.dataformat:jackson-dataformat-cbor:jar:2.5.3:compile
{code}

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386231#comment-16386231
 ] 

Steve Loughran commented on HIVE-18861:
---

And after
{code}
[INFO] |  +- io.druid.extensions:druid-hdfs-storage:jar:0.11.0:compile
[INFO] |  |  \- (org.apache.hadoop:hadoop-client:jar:3.1.0-SNAPSHOT:compile - 
version managed from 2.7.3; omitted for duplicate)
{code}


> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Attachment: HIVE-18861-001.patch

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386225#comment-16386225
 ] 

Steve Loughran commented on HIVE-18861:
---

Patch 001: pulls the hadoop JAR and the aws sdk version, which is imported in 
parallel.

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HIVE-18861-001.patch
>
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Summary: druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, 
creating classpath problems on hadoop 3.x  (was: druid-hdfs-storage is pulling 
in hadoop-aws-2.7.2, creating classpath problems on hadoop 3.x)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.x and aws SDK, creating 
> classpath problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.2, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Description: 
druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, even 
with Hadoop 3 & its move to a full aws-sdk-bundle JAR.

Two options
# exclude the dependency
# force it up to whatever ${hadoop.version} is, so make it consistent



  was:
Druid server JAR is transitively pulling in hadoop-aws JAR 2.7.3, which creates 
classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, even with 
Hadoop 3 & its move to a full aws-sdk-bundle JAR.

Two options
# exclude the dependency
# force it up to whatever ${hadoop.version} is, so make it consistent




> druid-hdfs-storage is pulling in hadoop-aws-2.7.2, creating classpath 
> problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> druid-hdfs-storage JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18861) druid-hdfs-storage is pulling in hadoop-aws-2.7.2, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HIVE-18861:
--
Summary: druid-hdfs-storage is pulling in hadoop-aws-2.7.2, creating 
classpath problems on hadoop 3.x  (was: druid-server is pulling in 
hadoop-aws-2.7.2, creating classpath problems on hadoop 3.x)

> druid-hdfs-storage is pulling in hadoop-aws-2.7.2, creating classpath 
> problems on hadoop 3.x
> 
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Druid server JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-18861) druid-server is pulling in hadoop-aws-2.7.2, creating classpath problems on hadoop 3.x

2018-03-05 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned HIVE-18861:
-


> druid-server is pulling in hadoop-aws-2.7.2, creating classpath problems on 
> hadoop 3.x
> --
>
> Key: HIVE-18861
> URL: https://issues.apache.org/jira/browse/HIVE-18861
> Project: Hive
>  Issue Type: Sub-task
>  Components: Druid integration
>Affects Versions: 3.0.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> Druid server JAR is transitively pulling in hadoop-aws JAR 2.7.3, which 
> creates classpath problems as a set of aws-sdk 1.10.77 JARs get on the CP, 
> even with Hadoop 3 & its move to a full aws-sdk-bundle JAR.
> Two options
> # exclude the dependency
> # force it up to whatever ${hadoop.version} is, so make it consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-1620) Patch to write directly to S3 from Hive

2018-03-05 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386192#comment-16386192
 ] 

Steve Loughran commented on HIVE-1620:
--

This is the wrong way to  handle variations in FS semantics; once we add the 
ability to query FS Capabilities (Hadoop 3.2?) then all filesystems could be 
probed for their semantics. Even so, I dont think this is correct. What we've 
done in HADOOP-13786 gives you atomic task commit and fast job-commit semantics 
without playing any rename games at all.

I'd recommend closing this as a WONTFIX, but reemphasise the underlying 
problem, "how to commit work to a store with neither consistency nor O(1) 
atomic renames" remains, at least for S3 & Openstack Swift.

> Patch to write directly to S3 from Hive
> ---
>
> Key: HIVE-1620
> URL: https://issues.apache.org/jira/browse/HIVE-1620
> Project: Hive
>  Issue Type: New Feature
>Reporter: Vaibhav Aggarwal
>Assignee: Vaibhav Aggarwal
>Priority: Major
> Attachments: HIVE-1620.patch
>
>
> We want to submit a patch to Hive which allows user to write files directly 
> to S3.
> This patch allow user to specify an S3 location as the table output location 
> and hence eliminates the need  of copying data from HDFS to S3.
> Users can run Hive queries directly over the data stored in S3.
> This patch helps integrate hive with S3 better and quicker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16983) getFileStatus on accessible s3a://[bucket-name]/folder: throws com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error

2017-07-11 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082117#comment-16082117
 ] 

Steve Loughran commented on HIVE-16983:
---

* The joda time update will be mandatory for S3A to auth
* even there, it won't like v4 endpoints. Which version are you up against.

Like I said, we're somewhat constrained by the requirement of "don't log 
secrets". Sometimes putting up a debugger and setting a breakpoint on the auth 
pipeline is the tactic. For hadoop 2.8, thats 
{{org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider.getCredentials()}} .

> getFileStatus on accessible s3a://[bucket-name]/folder: throws 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden;
> -
>
> Key: HIVE-16983
> URL: https://issues.apache.org/jira/browse/HIVE-16983
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.1
> Environment: Hive 2.1.1 on Ubuntu 14.04 AMI in AWS EC2, connecting to 
> S3 using s3a:// protocol
>Reporter: Alex Baretto
>Assignee: Vlad Gudikov
> Fix For: 2.1.1
>
> Attachments: HIVE-16983-branch-2.1.patch
>
>
> I've followed various published documentation on integrating Apache Hive 
> 2.1.1 with AWS S3 using the `s3a://` scheme, configuring `fs.s3a.access.key` 
> and 
> `fs.s3a.secret.key` for `hadoop/etc/hadoop/core-site.xml` and 
> `hive/conf/hive-site.xml`.
> I am at the point where I am able to get `hdfs dfs -ls s3a://[bucket-name]/` 
> to work properly (it returns s3 ls of that bucket). So I know my creds, 
> bucket access, and overall Hadoop setup is valid. 
> hdfs dfs -ls s3a://[bucket-name]/
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files
> ...etc. 
> hdfs dfs -ls s3a://[bucket-name]/files
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files/my-csv.csv
> However, when I attempt to access the same s3 resources from hive, e.g. run 
> any `CREATE SCHEMA` or `CREATE EXTERNAL TABLE` statements using `LOCATION 
> 's3a://[bucket-name]/files/'`, it fails. 
> for example:
> >CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, 
> >my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED 
> >BY ',' LOCATION 's3a://[bucket-name]/files/';
> I keep getting this error:
> >FAILED: Execution Error, return code 1 from 
> >org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> >java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus 
> >on s3a://[bucket-name]/files: 
> >com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: 
> >Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> >C9CF3F9C50EF08D1), S3 Extended Request ID: 
> >T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)
> This makes no sense. I have access to the bucket as one can see in the hdfs 
> test. And I've added the proper creds to hive-site.xml. 
> Anyone have any idea what's missing from this equation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16983) getFileStatus on accessible s3a://[bucket-name]/folder: throws com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error

2017-07-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16080062#comment-16080062
 ] 

Steve Loughran commented on HIVE-16983:
---

Patch itself LGTM from an S3a perspective

one thing for Hive to consider is: adding some explicit tests against S3, run 
iff its set up with credentials. That's how I do all my s3 and azure 
integration tests

> getFileStatus on accessible s3a://[bucket-name]/folder: throws 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden;
> -
>
> Key: HIVE-16983
> URL: https://issues.apache.org/jira/browse/HIVE-16983
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.1
> Environment: Hive 2.1.1 on Ubuntu 14.04 AMI in AWS EC2, connecting to 
> S3 using s3a:// protocol
>Reporter: Alex Baretto
>Assignee: Vlad Gudikov
> Fix For: 2.1.1
>
> Attachments: HIVE-16983-branch-2.1.patch
>
>
> I've followed various published documentation on integrating Apache Hive 
> 2.1.1 with AWS S3 using the `s3a://` scheme, configuring `fs.s3a.access.key` 
> and 
> `fs.s3a.secret.key` for `hadoop/etc/hadoop/core-site.xml` and 
> `hive/conf/hive-site.xml`.
> I am at the point where I am able to get `hdfs dfs -ls s3a://[bucket-name]/` 
> to work properly (it returns s3 ls of that bucket). So I know my creds, 
> bucket access, and overall Hadoop setup is valid. 
> hdfs dfs -ls s3a://[bucket-name]/
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files
> ...etc. 
> hdfs dfs -ls s3a://[bucket-name]/files
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files/my-csv.csv
> However, when I attempt to access the same s3 resources from hive, e.g. run 
> any `CREATE SCHEMA` or `CREATE EXTERNAL TABLE` statements using `LOCATION 
> 's3a://[bucket-name]/files/'`, it fails. 
> for example:
> >CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, 
> >my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED 
> >BY ',' LOCATION 's3a://[bucket-name]/files/';
> I keep getting this error:
> >FAILED: Execution Error, return code 1 from 
> >org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> >java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus 
> >on s3a://[bucket-name]/files: 
> >com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: 
> >Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> >C9CF3F9C50EF08D1), S3 Extended Request ID: 
> >T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)
> This makes no sense. I have access to the bucket as one can see in the hdfs 
> test. And I've added the proper creds to hive-site.xml. 
> Anyone have any idea what's missing from this equation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16983) getFileStatus on accessible s3a://[bucket-name]/folder: throws com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error

2017-07-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072760#comment-16072760
 ] 

Steve Loughran commented on HIVE-16983:
---

good point

Everyone: look at the S3A troubleshooting docs before filing bugreps, thanks: 
[http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Troubleshooting_S3A]

> getFileStatus on accessible s3a://[bucket-name]/folder: throws 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden;
> -
>
> Key: HIVE-16983
> URL: https://issues.apache.org/jira/browse/HIVE-16983
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.1
> Environment: Hive 2.1.1 on Ubuntu 14.04 AMI in AWS EC2, connecting to 
> S3 using s3a:// protocol
>Reporter: Alex Baretto
>
> I've followed various published documentation on integrating Apache Hive 
> 2.1.1 with AWS S3 using the `s3a://` scheme, configuring `fs.s3a.access.key` 
> and 
> `fs.s3a.secret.key` for `hadoop/etc/hadoop/core-site.xml` and 
> `hive/conf/hive-site.xml`.
> I am at the point where I am able to get `hdfs dfs -ls s3a://[bucket-name]/` 
> to work properly (it returns s3 ls of that bucket). So I know my creds, 
> bucket access, and overall Hadoop setup is valid. 
> hdfs dfs -ls s3a://[bucket-name]/
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files
> ...etc. 
> hdfs dfs -ls s3a://[bucket-name]/files
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files/my-csv.csv
> However, when I attempt to access the same s3 resources from hive, e.g. run 
> any `CREATE SCHEMA` or `CREATE EXTERNAL TABLE` statements using `LOCATION 
> 's3a://[bucket-name]/files/'`, it fails. 
> for example:
> >CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, 
> >my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED 
> >BY ',' LOCATION 's3a://[bucket-name]/files/';
> I keep getting this error:
> >FAILED: Execution Error, return code 1 from 
> >org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> >java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus 
> >on s3a://[bucket-name]/files: 
> >com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: 
> >Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> >C9CF3F9C50EF08D1), S3 Extended Request ID: 
> >T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)
> This makes no sense. I have access to the bucket as one can see in the hdfs 
> test. And I've added the proper creds to hive-site.xml. 
> Anyone have any idea what's missing from this equation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16983) getFileStatus on accessible s3a://[bucket-name]/folder: throws com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error

2017-07-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072239#comment-16072239
 ] 

Steve Loughran commented on HIVE-16983:
---

Clearly, somehow, your credentials aren't getting picked up. One problem here 
is that the S3A code can't log what's going on in any detail for security 
reasons (logging secrets is considered harmful), so not sure what could be done 
here.

> getFileStatus on accessible s3a://[bucket-name]/folder: throws 
> com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon 
> S3; Status Code: 403; Error Code: 403 Forbidden;
> -
>
> Key: HIVE-16983
> URL: https://issues.apache.org/jira/browse/HIVE-16983
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.1.1
> Environment: Hive 2.1.1 on Ubuntu 14.04 AMI in AWS EC2, connecting to 
> S3 using s3a:// protocol
>Reporter: Alex Baretto
>
> I've followed various published documentation on integrating Apache Hive 
> 2.1.1 with AWS S3 using the `s3a://` scheme, configuring `fs.s3a.access.key` 
> and 
> `fs.s3a.secret.key` for `hadoop/etc/hadoop/core-site.xml` and 
> `hive/conf/hive-site.xml`.
> I am at the point where I am able to get `hdfs dfs -ls s3a://[bucket-name]/` 
> to work properly (it returns s3 ls of that bucket). So I know my creds, 
> bucket access, and overall Hadoop setup is valid. 
> hdfs dfs -ls s3a://[bucket-name]/
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files
> ...etc. 
> hdfs dfs -ls s3a://[bucket-name]/files
> 
> drwxrwxrwx   - hdfs hdfs  0 2017-06-27 22:43 
> s3a://[bucket-name]/files/my-csv.csv
> However, when I attempt to access the same s3 resources from hive, e.g. run 
> any `CREATE SCHEMA` or `CREATE EXTERNAL TABLE` statements using `LOCATION 
> 's3a://[bucket-name]/files/'`, it fails. 
> for example:
> >CREATE EXTERNAL TABLE IF NOT EXISTS mydb.my_table ( my_table_id string, 
> >my_tstamp timestamp, my_sig bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED 
> >BY ',' LOCATION 's3a://[bucket-name]/files/';
> I keep getting this error:
> >FAILED: Execution Error, return code 1 from 
> >org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> >java.nio.file.AccessDeniedException s3a://[bucket-name]/files: getFileStatus 
> >on s3a://[bucket-name]/files: 
> >com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: 
> >Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 
> >C9CF3F9C50EF08D1), S3 Extended Request ID: 
> >T2xZ87REKvhkvzf+hdPTOh7CA7paRpIp6IrMWnDqNFfDWerkZuAIgBpvxilv6USD0RSxM9ymM6I=)
> This makes no sense. I have access to the bucket as one can see in the hdfs 
> test. And I've added the proper creds to hive-site.xml. 
> Anyone have any idea what's missing from this equation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-9012) Not able to move and populate the data fully on to the table when the scratch directory is on S3

2017-07-03 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072238#comment-16072238
 ] 

Steve Loughran commented on HIVE-9012:
--

This is just rename() being emulated in S3 with a copy-and-delete.

> Not able to move and populate the data fully on to the table when the scratch 
> directory is on S3
> 
>
> Key: HIVE-9012
> URL: https://issues.apache.org/jira/browse/HIVE-9012
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.1
> Environment: Amazon AMI and S3 as storage service
>Reporter: Kolluru Som Shekhar Sharma
>Priority: Blocker
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I have set the hive.exec.scratchDir to point to a directory on S3 and 
> external table is on S3 level. 
> I ran a simple query which extracts the key value pairs from JSON string 
> without any WHERE clause, and the about of data is ~500GB.  The query ran 
> fine, but when it is trying to move the data from the scratch directory it 
> doesn't complete. So i need to kill the process and manually need to move the 
> data.
> The data size in the scratch directory was nearly ~550GB
> I tried the same scenario with less data and putting where clause, it 
> completed successfully and data also gets populated in the table. I checked 
> the size in the table and in the scratch directory. The data in the table was 
> showing 2MB and the data in the scratch directory is 48.6GB



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16913) Support per-session S3 credentials

2017-06-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068127#comment-16068127
 ] 

Steve Loughran commented on HIVE-16913:
---

You are going to need a multi-tenant Hive service, such as LLAP.  Or start a 
new Hive Tez app within a YARN cluster, as a new user.

Workflow would be the same as passing HDFS delegation tokens around

# client starts query
# client enums all filesystems used in query
# For each FS, if they support delegation tokens, their DTs are requested and 
added to the list of tokens
# This list of tokens is serialized and sent with the query
# At the far end, these are unmarshalled and added to the UGI user entry for 
the caller (Hadoop RPC does this)
# When the service then does {{currentUser.doAs()}}, those DTs will be 
available. 
# When a new FS instance is looked up, it will be mapped to (user, URL), so no 
user shares a filesystem instance
# Hence the S3 session tokens will be avaiable for auth by a new S3A/AWS 
authenticator *only* for that user's FS instance
# And when the call is finished, if the filesystems for that user are released, 
they get cleaned up.

This is almost exactly what's done with HDFS access today, the big diff being 
the delegation token is actually forwarded to HDFS itself (same for HBase). 
Here I'm saying "we only need to get it to the other client"

> Support per-session S3 credentials
> --
>
> Key: HIVE-16913
> URL: https://issues.apache.org/jira/browse/HIVE-16913
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>
> Currently, the credentials needed to support Hive-on-S3 (or any other 
> cloud-storage) need to be to the hive-site.xml. Either using a hadoop 
> credential provider or by adding the keys in the hive-site.xml in plain text 
> (unsecure)
> This limits the usecase to using a single S3 key. If we configure per bucket 
> s3 keys like described [here | 
> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
>  it exposes the access to all the buckets to all the hive users.
> It is possible that there are different sets of users who would not like to 
> share there buckets and still be able to process the data using Hive. 
> Enabling session level credentials will help solve such use-cases. For 
> example, currently this doesn't work
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> {noformat}
> Because metastore is unaware of the the keys. This doesn't work either
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> set metaconf:fs.s3a.secret.key=my_secret_key;
> set metaconf:fs.s3a.access.key=my_access_key;
> {noformat}
> This is because only a certain metastore configurations defined in 
> {{HiveConf.MetaVars}} are allowed to be set by the user. If we enable the 
> above approaches we could potentially allow multiple S3 credentials on a 
> per-session level basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16913) Support per-session S3 credentials

2017-06-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066255#comment-16066255
 ] 

Steve Loughran commented on HIVE-16913:
---

Note that if you try and be clever about key names, then you need to consider 
that jceks files need to come down too, and per-bucket config options like 
fs.s3a.bucket.test3.secret.key

> Support per-session S3 credentials
> --
>
> Key: HIVE-16913
> URL: https://issues.apache.org/jira/browse/HIVE-16913
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>
> Currently, the credentials needed to support Hive-on-S3 (or any other 
> cloud-storage) need to be to the hive-site.xml. Either using a hadoop 
> credential provider or by adding the keys in the hive-site.xml in plain text 
> (unsecure)
> This limits the usecase to using a single S3 key. If we configure per bucket 
> s3 keys like described [here | 
> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
>  it exposes the access to all the buckets to all the hive users.
> It is possible that there are different sets of users who would not like to 
> share there buckets and still be able to process the data using Hive. 
> Enabling session level credentials will help solve such use-cases. For 
> example, currently this doesn't work
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> {noformat}
> Because metastore is unaware of the the keys. This doesn't work either
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> set metaconf:fs.s3a.secret.key=my_secret_key;
> set metaconf:fs.s3a.access.key=my_access_key;
> {noformat}
> This is because only a certain metastore configurations defined in 
> {{HiveConf.MetaVars}} are allowed to be set by the user. If we enable the 
> above approaches we could potentially allow multiple S3 credentials on a 
> per-session level basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16913) Support per-session S3 credentials

2017-06-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066253#comment-16066253
 ] 

Steve Loughran commented on HIVE-16913:
---

# credentials on Hadoop 2.7+ can go in JCEKs files too. this is the recommended 
best practise. Consult your Hadoop supplier about backporting that feature if 
required.
# Filesystems which support delegation tokens (Azure may) can have them handled 
automatically. HADOOP-14556 dicusses the possibility of adding them to S3 so 
that a user with full credentials (not session, not IAM) may create a triple of 
session credentials and pass them in a DT for later auth.

> Support per-session S3 credentials
> --
>
> Key: HIVE-16913
> URL: https://issues.apache.org/jira/browse/HIVE-16913
> Project: Hive
>  Issue Type: Improvement
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>
> Currently, the credentials needed to support Hive-on-S3 (or any other 
> cloud-storage) need to be to the hive-site.xml. Either using a hadoop 
> credential provider or by adding the keys in the hive-site.xml in plain text 
> (unsecure)
> This limits the usecase to using a single S3 key. If we configure per bucket 
> s3 keys like described [here | 
> http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]
>  it exposes the access to all the buckets to all the hive users.
> It is possible that there are different sets of users who would not like to 
> share there buckets and still be able to process the data using Hive. 
> Enabling session level credentials will help solve such use-cases. For 
> example, currently this doesn't work
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> {noformat}
> Because metastore is unaware of the the keys. This doesn't work either
> {noformat}
> set fs.s3a.secret.key=my_secret_key;
> set fs.s3a.access.key=my_access.key;
> set metaconf:fs.s3a.secret.key=my_secret_key;
> set metaconf:fs.s3a.access.key=my_access_key;
> {noformat}
> This is because only a certain metastore configurations defined in 
> {{HiveConf.MetaVars}} are allowed to be set by the user. If we enable the 
> above approaches we could potentially allow multiple S3 credentials on a 
> per-session level basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16446) org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting t

2017-05-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017265#comment-16017265
 ] 

Steve Loughran commented on HIVE-16446:
---

# try with s3a URS and the fs.s3a secret and access keys
# do not put secrets in your URIs, it's a security leak waiting to be 
discovered. That's why you get told off about it. See HADOOP-3733

> org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:java.lang.IllegalArgumentException: AWS Access Key ID 
> and Secret Access Key must be specified by setting the fs.s3n.awsAccessKeyId 
> and fs.s3n.awsSecretAccessKey properties
> -
>
> Key: HIVE-16446
> URL: https://issues.apache.org/jira/browse/HIVE-16446
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.0
>Reporter: Kalexin Baoerjiin
>Assignee: Vihang Karajgaonkar
>
> After upgrading our Cloudera cluster to CDH 5.10.1 we are experiencing the 
> following problem during some Hive DDL.
> 
> SET fs.s3n.awsSecretAccessKey=;
> SET fs.s3n.awsAccessKeyId=;
> 
> ALTER TABLE hive_1k_partitions ADD IF NOT EXISTS partition (year='2014', 
> month='2014-01', dt='2014-01-01', hours='00', minutes='16', seconds='22') 
> location 's3n://'
> 
> Stack trace I was able to recover: 
> [ Message content over the limit has been removed. ]
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:383)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:318)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:416)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:432)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:726)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:693)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:628)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Job Submission failed with exception ‘java.lang.IllegalArgumentException(AWS 
> Access Key ID and Secret Access Key must be specified by setting the 
> fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties 
> (respectively).)’
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask
> [9:31] 
> Logging initialized using configuration in 
> jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/hive-common-1.1.0-cdh5.10.1.jar!/hive-log4j.properties
> In the past we did not have to set s3 key and ID in core-site.xml because we 
> were using them dynamically inside our hive DDL scripts.
> After setting S3 secret key and Access ID in core-site.xml this problem goes 
> away. However this is an incompatibility change from the previous Hive 
> shipped in CDH 5.9. 
> Cloudera 5.10.x release note mentioned (HIVE-14269 : Enhanced write 
> performance for Hive tables stored on Amazon S3.) is the only Hive related 
> changes. 
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_new_in_cdh_510.html
> https://issues.apache.org/jira/browse/HIVE-14269



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16446) org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting t

2017-04-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979869#comment-15979869
 ] 

Steve Loughran commented on HIVE-16446:
---

you should switch to using s3a:// URLs in things based on Hadoop 2.7.+, which 
the latest CDH versions are. 

Set up securiy as per : 
https://hortonworks.github.io/hdp-aws/s3-security/index.html
Then test on the command line before worrying about Hive: 
https://hortonworks.github.io/hdp-aws/s3-s3aclient/index.html

setting up core-site.xml & testing via the hdfs fs commands will let you get up 
and running faster


> org.apache.hadoop.hive.ql.exec.DDLTask. 
> MetaException(message:java.lang.IllegalArgumentException: AWS Access Key ID 
> and Secret Access Key must be specified by setting the fs.s3n.awsAccessKeyId 
> and fs.s3n.awsSecretAccessKey properties
> -
>
> Key: HIVE-16446
> URL: https://issues.apache.org/jira/browse/HIVE-16446
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.0
>Reporter: Kalexin Baoerjiin
>Assignee: Vihang Karajgaonkar
>
> After upgrading our Cloudera cluster to CDH 5.10.1 we are experiencing the 
> following problem during some Hive DDL.
> 
> SET fs.s3n.awsSecretAccessKey=;
> SET fs.s3n.awsAccessKeyId=;
> 
> ALTER TABLE hive_1k_partitions ADD IF NOT EXISTS partition (year='2014', 
> month='2014-01', dt='2014-01-01', hours='00', minutes='16', seconds='22') 
> location 's3n://'
> 
> Stack trace I was able to recover: 
> [ Message content over the limit has been removed. ]
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:383)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:318)
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:416)
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:432)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:726)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:693)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:628)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Job Submission failed with exception ‘java.lang.IllegalArgumentException(AWS 
> Access Key ID and Secret Access Key must be specified by setting the 
> fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties 
> (respectively).)’
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask
> [9:31] 
> Logging initialized using configuration in 
> jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/hive-common-1.1.0-cdh5.10.1.jar!/hive-log4j.properties
> In the past we did not have to set s3 key and ID in core-site.xml because we 
> were using them dynamically inside our hive DDL scripts.
> After setting S3 secret key and Access ID in core-site.xml this problem goes 
> away. However this is an incompatibility change from the previous Hive 
> shipped in CDH 5.9. 
> Cloudera 5.10.x release note mentioned (HIVE-14269 : Enhanced write 
> performance for Hive tables stored on Amazon S3.) is the only Hive related 
> changes. 
> https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_new_in_cdh_510.html
> https://issues.apache.org/jira/browse/HIVE-14269



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16295) Add support for using Hadoop's OutputCommitter

2017-04-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965660#comment-15965660
 ] 

Steve Loughran commented on HIVE-16295:
---

Thanks for starting this

1. We're making changes to FileOutputFormat so that it doesn't require an 
instance of {{FileOutputCommitter}}, just any committer which also supplies a 
working directory. This lets us add new committers alongside the existing one, 
without playing games trying to subclass what is already a complex game.
1. All work is focused on getting the netflix "staging" Committer out the door 
first; the other one, which I'd started before netflix offered theres, does 
things inside S3a which could best be viewed as "dark magic". It will offer 
even more performance, but I'm neglecting it for now. The netflix one is in use 
in production, and has all its failure/abort algorithms thought out and 
implemented.
1. I'm keeping the magic committer tests working, but not going to consider 
that one ready to use until it passes lots of tests. Consider it a speedup for 
the future.

The netflix committer itself has two subclasess, "directory" and "partitioned", 
the directory one propagating a directory tree, the partitioned one expects 
paths like "dateint=20161116/hour=14"; it has a different conflict policy than 
the directory one.

Algorithm for the staging committer is

# tasks write to a local temp dir
# task abort: delete the files
# task commit: PUT the files as multipart uploads to their final destinations, 
*do not commit the put*. Instead the data needed for the commit is saved to the 
cluster FS, and committed using the normal algorithm
# job commit: load in the output of all committed tasks, commit them. Failure 
to commit triggers revert: delete all files already committed, abort the rest 
of the list.
# job abort: abort the output of all uncommitted tasks by reading in the files 
and aborting those uploads.
# retry logic? Whatever we is implemented by the AWS SDK (mutliple attempts to 
POST/PUT parts) and in S3A (retries of that final commit POST)

Nothing is visible until job commit; there's still a window of non-atomicity 
there, but its the time for N posts where N=#of files; this can be parallelised 
easily as it uses little bandwidth per post (unlike the uploads).

In tests, the dir committer works for the intermediate output of MR jobs saving 
data to part-000x directories; the partitioned one good for spark output which 
doesn't save the intermediate data, and wants to output partitioned style.


> Add support for using Hadoop's OutputCommitter
> --
>
> Key: HIVE-16295
> URL: https://issues.apache.org/jira/browse/HIVE-16295
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a 
> {{NullOutputCommitter}} and uses its own commit logic spread across 
> {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}.
> The Hadoop community is building an {{OutputCommitter}} that integrates with 
> S3Guard and does a safe, coordinate commit of data on S3 inside individual 
> tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} 
> there would be a lot of benefits to Hive-on-S3:
> * Data is only written once; directly committing data at a task level means 
> no renames are necessary
> * The commit is done safely, in a coordinated manner; duplicate tasks (from 
> task retries or speculative execution) should not step on each other



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory

2017-03-04 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895818#comment-15895818
 ] 

Steve Loughran commented on HIVE-14864:
---

{{FileSystem.getContentSummary()}} does a recursive treewalk, so is 
pathologically bad on a blobstore which has to mock directories through many., 
many HTTP requests.

If you need to use it, could you actually supply a patch (+ FS contract tests) 
for the method so that it uses listFiles(path, recursive=true)? That does the 
same treewalk against HDFS, but blobstores can do it as an O(1) listing call 
instead. If you can get that patch in, then enumerating the size of a blobstore 
tree will be fast

> Distcp is not called from MoveTask when src is a directory
> --
>
> Key: HIVE-14864
> URL: https://issues.apache.org/jira/browse/HIVE-14864
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Sahil Takiar
> Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, 
> HIVE-14864.3.patch, HIVE-14864.patch
>
>
> In FileUtils.java the following code does not get executed even when src 
> directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because 
> srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We 
> should use srcFS.getContentSummary(src).getLength() instead.
> {noformat}
> /* Run distcp if source file/dir is too big */
> if (srcFS.getUri().getScheme().equals("hdfs") &&
> srcFS.getFileStatus(src).getLen() > 
> conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) {
>   LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. 
> (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + 
> ")");
>   LOG.info("Launch distributed copy (distcp) job.");
>   HiveConfUtil.updateJobCredentialProviders(conf);
>   copied = shims.runDistCp(src, dst, conf);
>   if (copied && deleteSource) {
> srcFS.delete(src, true);
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15502) CTAS on S3 is broken with credentials exception

2017-03-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890221#comment-15890221
 ] 

Steve Loughran commented on HIVE-15502:
---

probably comes down to the ordering of the FS creation vs when the first 
HiveConfiguration is instantiated. It's only after the latter than 
hive-site.xml is added as a default resource for all configs, hence picked up.

Best to just put it in core-site.xml

> CTAS on S3 is broken with credentials exception
> ---
>
> Key: HIVE-15502
> URL: https://issues.apache.org/jira/browse/HIVE-15502
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>
> Simple CTAS queries that read from S3, and write to the local fs throw the 
> following exception:
> {code}
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2308)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2304)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3013)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:342)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2168)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1824)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1511)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1222)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1212)
>   at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>   at 
> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:777)
>   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:715)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:642)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Job Submission failed with exception 
> 'com.amazonaws.AmazonClientException(Unable to load AWS credentials from any 
> provider in the chain)'
> {code}
> Seems to only happen when trying to connect to S3 from map tasks. My 
> {{hive-site.xml}} has the following entries:
> {code}
> 
>   
> mapreduce.framework.name
> local
>   
>   
> mapred.job.tracker
> local
>   
>   
> fs.default.name
> file:///
>   
>   
> fs.s3a.access.key
> [ACCESS-KEY]
>   
>   
> fs.s3a.secret.key
> [SECRET-KEY]
>   
> 
> {code}
> I've also noticed that now I need to copy the AWS S3 SDK jars into the lib 
> folder before running Hive locally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15368) consider optimizing Utilities::handleMmTableFinalPath

2017-03-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890216#comment-15890216
 ] 

Steve Loughran commented on HIVE-15368:
---

If you can use {{FileSystem.listFiles(path, recursive=true}} then s3a on Hadoop 
2.8 can give you an O(files/5000) listing, rather than the treewalk you can get 
today. I'd recommend trying to use it; the other object stores can no doubt 
implement the same operation.

> consider optimizing Utilities::handleMmTableFinalPath
> -
>
> Key: HIVE-15368
> URL: https://issues.apache.org/jira/browse/HIVE-15368
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.3.0
>Reporter: Rajesh Balamohan
> Attachments: HIVE-15368.branch.14535.1.patch
>
>
> Branch: hive-14535
> https://github.com/apache/hive/blob/hive-14535/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L4049
> When running "insert overwrite...on partitioned table" with 2000+ partitions, 
> good amount of time (~245 seconds) was spent in iterating every mmDirectory 
> entry and checking its file listings in S3. Creating this jira to consider 
> optimizing this codepath, as information from {{getMmDirectoryCandidates}} 
> could be used in terms of reducing the number of times S3 needs to be 
> contacted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15016) Run tests with Hadoop 3.0.0-alpha1

2016-12-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785633#comment-15785633
 ] 

Steve Loughran commented on HIVE-15016:
---

don't think Hadoop is making much use of codahale or committed to any specific 
JAR, if you need to update the one there

> Run tests with Hadoop 3.0.0-alpha1
> --
>
> Key: HIVE-15016
> URL: https://issues.apache.org/jira/browse/HIVE-15016
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: Hadoop3Upstream.patch
>
>
> Hadoop 3.0.0-alpha1 was released back on Sep/16 to allow other components run 
> tests against this new version before GA.
> We should start running tests with Hive to validate compatibility against 
> Hadoop 3.0.
> NOTE: The patch used to test must not be committed to Hive until Hadoop 3.0 
> GA is released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15016) Run tests with Hadoop 3.0.0-alpha1

2016-12-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756877#comment-15756877
 ] 

Steve Loughran commented on HIVE-15016:
---

if you check out hadoop trunk, all you need to do is make a build with the 
declared version changed.
{code}
mvn install -DskipTests -Ddeclared.hadoop.version=2.11
{code}

This *does not* change the version numbers enough to bring up HDFS; all it does 
is trick hive into thinking it knows about Hadoop 3. The real fix will have to 
be in hive & cherry picked into the Spark fork.


> Run tests with Hadoop 3.0.0-alpha1
> --
>
> Key: HIVE-15016
> URL: https://issues.apache.org/jira/browse/HIVE-15016
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: Hadoop3Upstream.patch
>
>
> Hadoop 3.0.0-alpha1 was released back on Sep/16 to allow other components run 
> tests against this new version before GA.
> We should start running tests with Hive to validate compatibility against 
> Hadoop 3.0.
> NOTE: The patch used to test must not be committed to Hive until Hadoop 3.0 
> GA is released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15326) Hive shims report Unrecognized Hadoop major version number: 3.0.0-alpha2-SNAPSHOT

2016-12-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714928#comment-15714928
 ] 

Steve Loughran commented on HIVE-15326:
---

HIVE-15016 includes a fix for that, simply by changing the case statement to 
consider 3.x as needing the same shims as 2.x

> Hive shims report Unrecognized Hadoop major version number: 
> 3.0.0-alpha2-SNAPSHOT
> -
>
> Key: HIVE-15326
> URL: https://issues.apache.org/jira/browse/HIVE-15326
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1
> Environment: Hadoop trunk branch
>Reporter: Steve Loughran
>
> Hive built against Hadoop 2 fails to run against Hadoop 3.x, 
> declaring:{{Unrecognized Hadoop major version number: 3.0.0-alpha2-SNAPSHOT}}
> Refusing to play on Hadoop 3.x may actually be the correct behaviour, though 
> ideally we've retained API compatibility to everything works (maybe with some 
> CP tweaking).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15016) Run tests with Hadoop 3.0.0-alpha1

2016-12-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712671#comment-15712671
 ] 

Steve Loughran commented on HIVE-15016:
---

What's the issue with the codahale JAR? Incompatible with something already on 
the CP?

> Run tests with Hadoop 3.0.0-alpha1
> --
>
> Key: HIVE-15016
> URL: https://issues.apache.org/jira/browse/HIVE-15016
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
> Attachments: Hadoop3Upstream.patch
>
>
> Hadoop 3.0.0-alpha1 was released back on Sep/16 to allow other components run 
> tests against this new version before GA.
> We should start running tests with Hive to validate compatibility against 
> Hadoop 3.0.
> NOTE: The patch used to test must not be committed to Hive until Hadoop 3.0 
> GA is released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15326) Hive shims report Unrecognized Hadoop major version number: 3.0.0-alpha2-SNAPSHOT

2016-12-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711687#comment-15711687
 ] 

Steve Loughran commented on HIVE-15326:
---

Test is easy; attempt to instantiate a HiveConf

{code}

*** RUN ABORTED ***
  java.lang.ExceptionInInitializerError:
  at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply$mcV$sp(S3DependencyCheckSuite.scala:62)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply(S3DependencyCheckSuite.scala:62)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply(S3DependencyCheckSuite.scala:62)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  ...
  Cause: java.lang.IllegalArgumentException: Unrecognized Hadoop major version 
number: 3.0.0-alpha2-SNAPSHOT
  at 
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
  at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
  at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)
  at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply$mcV$sp(S3DependencyCheckSuite.scala:62)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply(S3DependencyCheckSuite.scala:62)
  at 
com.hortonworks.spark.cloud.s3.S3DependencyCheckSuite$$anonfun$7.apply(S3DependencyCheckSuite.scala:62)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
{code}


> Hive shims report Unrecognized Hadoop major version number: 
> 3.0.0-alpha2-SNAPSHOT
> -
>
> Key: HIVE-15326
> URL: https://issues.apache.org/jira/browse/HIVE-15326
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.2.1
> Environment: Hadoop trunk branch
>Reporter: Steve Loughran
>
> Hive built against Hadoop 2 fails to run against Hadoop 3.x, 
> declaring:{{Unrecognized Hadoop major version number: 3.0.0-alpha2-SNAPSHOT}}
> Refusing to play on Hadoop 3.x may actually be the correct behaviour, though 
> ideally we've retained API compatibility to everything works (maybe with some 
> CP tweaking).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687108#comment-15687108
 ] 

Steve Loughran commented on HIVE-15199:
---

I do think I'd rather fix this in s3, because it is adding 2 GET calls and a 
LIST before each rename, calls which take place in the rename itself. And of 
course, when Hadoop 2.8 or derivatives change s3a's rename to == HDFS, the 
check will be superfluous. Similarly, once you have a consistent FS view 
(s3guard, etc),  you are less likely to see a mismatch between listing and 
stat-ing. If you are, it means something else is writing to the same dir, 
putting you in trouble.

Would it be possible to set this up to make it easy to turn off in future. For 
example: create the JIRA on stripping the exists check out?

> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
> Attachments: HIVE-15199.1.patch, HIVE-15199.2.patch, 
> HIVE-15199.3.patch, HIVE-15199.4.patch, HIVE-15199.5.patch, 
> HIVE-15199.6.patch, HIVE-15199.7.patch, HIVE-15199.8.patch
>
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684283#comment-15684283
 ] 

Steve Loughran edited comment on HIVE-15199 at 11/22/16 3:19 PM:
-

you are right, I am wrong: serves me right for commenting without staring at 
the code. you should be calling; I got confused by naming.

you should be invoking

{code}
RemoteIterator listFiles(Path f,  boolean recursive) 
{code}
with recursive = true.

this defaults to a standard recursive treewalk; on object stores we can do an 
O(1) listing of all child files, irrespective of directory depth and width. For 
anything other than a flat directory, this is a significant speedup


was (Author: ste...@apache.org):
you are right, I am wrong: serves me right for commenting without staring at 
the code. you should be calling; I got confused by naming.

you should be invoking

{code}
RemoteIterator listFiles(Path f,  boolean recursive) 
{Code}
with recursive = true.

this defaults to a standard recursive treewalk; on object stores we can do an 
O(1) listing of all child files, irrespective of directory depth and width. For 
anything other than a flat directory, this is a significant speedup

> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
> Attachments: HIVE-15199.1.patch, HIVE-15199.2.patch, 
> HIVE-15199.3.patch, HIVE-15199.4.patch, HIVE-15199.5.patch, 
> HIVE-15199.6.patch, HIVE-15199.7.patch, HIVE-15199.8.patch
>
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-21 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684283#comment-15684283
 ] 

Steve Loughran commented on HIVE-15199:
---

you are right, I am wrong: serves me right for commenting without staring at 
the code. you should be calling; I got confused by naming.

you should be invoking

{code}
RemoteIterator listFiles(Path f,  boolean recursive) 
{Code}
with recursive = true.

this defaults to a standard recursive treewalk; on object stores we can do an 
O(1) listing of all child files, irrespective of directory depth and width. For 
anything other than a flat directory, this is a significant speedup

> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
> Attachments: HIVE-15199.1.patch, HIVE-15199.2.patch, 
> HIVE-15199.3.patch, HIVE-15199.4.patch, HIVE-15199.5.patch, 
> HIVE-15199.6.patch, HIVE-15199.7.patch
>
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679190#comment-15679190
 ] 

Steve Loughran commented on HIVE-15199:
---

if you do listStatus(path, recursive=true) you don't get back a filestatus 
array, you get an interator back; on s3a branch 2.8+ this goes through the 
results of the list, triggering new listing requests on demand: 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Listing.java#L171

To make effective use of this feature, you do have to list through the results, 
otherwise it won't do the listing operation...you may as well build the set up 
from that iteration

{code}

RemoteIterator it = fs.listStatus(path, true)
while (it.hasNext()) {
  FileStatus s = it.next()
 if (!fileSet.contains(s)) {
   fileSet.add(s);
 }
}

{code}

On other filesystems listStatus does a recursive treewalk, no more/less 
expensive than doing it in your own code


> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
> Attachments: HIVE-15199.1.patch, HIVE-15199.2.patch, 
> HIVE-15199.3.patch, HIVE-15199.4.patch, HIVE-15199.5.patch
>
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673317#comment-15673317
 ] 

Steve Loughran commented on HIVE-15199:
---

# as sahil notes, blobstore copy calls must be in object store to get internal 
bandwidth *and avoid download costs*.
# that can currently only be done in rename(); if we ever added a copy() 
command to the FS API, your life would be better
# the fact that rename returns "false" without details makes things worse. FWIW 
I'm modifying spark's inner rename to throw exceptions —but as as that isn't 
the public FS API, it's of no use here. Though I could add an option to s3a to 
always throw those exceptions (a subclass of IOE) if the caller needed it. 
Would that help? it'd only be broadly useful if HDFS  also did that

Now, what is needed to be done here? 
# handle the situation where a list of an object store path can be out of sync 
with the actual contents, something that surfaces in S3 and swift. the 
exists/getFileStatus() call is more robust there.
# handle the situation where rename(src, dest), src is a file and dest is a 
file, will copy src onto dest.  

Short term: check the destination. I know some people will worry about the cost 
of the existence check, but remember this is going to trigger a copy operation, 
which is O(data) at about 6-8 MB/s: if you are committing large files, things 
will be slow.

Looking at the rename semantics as tested in {{AbstractContractRenameTest}}, 
the test {{testRenameFileOverExistingFile}} has the comment "handles 
filesystems that will overwrite the destination as well as those that do not 
(i.e. HDFS).". It's not just s3a which lets you rename over a test, so 
apparently does local file://. Interestingly: azure wasb:// doesn't, nor does 
s3n.


I think we can consider the fact that s3a lets you overwrite an existing 
destination to be a bug, based on its inconsistency with HDFS, Azure, s3n, etc. 
Indeed, the difference between s3a and s3n makes it harder to say "s3a is a 
drop-in replacement for s3n". Created HADOOP-13823 for you.


> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
> Attachments: HIVE-15199.1.patch, HIVE-15199.2.patch, 
> HIVE-15199.3.patch
>
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones

2016-11-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667931#comment-15667931
 ] 

Steve Loughran commented on HIVE-15199:
---

sounds related to HADOOP-13402

I am not going to express any opinion about what is "the correct" behaviour we 
should expect from rename, as I don't think anyone knows that. If you look at 
the [FS 
Specification|https://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/filesystem/filesystem.html]
 we're pretty explicit that rename is hard, and that there are different 
behaviours by different filesystems are.

I'm not defending S3A here, just noting I'm not 100% sure of what HDFS does 
itself here, and how that compares to the semantics of posix's rename call 
(which is different from the unix command line {{mv}} operation).

> INSERT INTO data on S3 is replacing the old rows with the new ones
> --
>
> Key: HIVE-15199
> URL: https://issues.apache.org/jira/browse/HIVE-15199
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Sergio Peña
>Assignee: Sergio Peña
>Priority: Critical
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is 
> saved on S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1   name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2   name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15093) S3-to-S3 Renames: Files should be moved individually rather than at a directory level

2016-11-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653694#comment-15653694
 ] 

Steve Loughran edited comment on HIVE-15093 at 11/10/16 10:32 AM:
--

# I've just started HADOOP-13600, though busy with preparation and attendance 
at ApacheCon big data means expect no real progress for the next 10 days
# there's a recent discussion on common dev about when 2.8 RC comes out

as far as HDP goes, all the s3a phase II read pipeline work is in HDP-2.5; the 
HDP-cloud in AWS product adds the HADOOP-13560 write pipeline; with a faster 
update cycle it'd be out the door fairly rapidly too (disclaimer, no forward 
looking statements, etc etc). CDH hasn't shipped with any of the phase II 
changes in yet, that's something to discuss with your colleagues. Given the 
emphasis on Impala & S3, I'd expect it sooner rather than later

Here's [the work in 
progress|https://github.com/steveloughran/hadoop/blob/s3/HADOOOP-13600-rename/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L802];
 as I note in the code, I'm not doing it right. We should have the list and 
delete operations working in parallel too, because list is pretty slow too, and 
I want to eliminate all sequential points in the code.

I know it's complicated, but it shows why this routine is so much better down 
in the layers beneath: we can optimise every single HTTP request to S3a, order 
the copy calls for maximum overlapping operations, *and write functional tests 
against real s3 endpoints*. object stores are so different from filesystems 
that testing against localfs is misleading.


was (Author: ste...@apache.org):
#. I've just started HADOOP-13600, though busy with preparation and attendance 
at ApacheCon big data means expect no real progress for the next 10 days
# discussion on common dev about when 2.8 RC comes out

as far as HDP goes, all the s3a phase II read pipeline work is in HDP-2.5; the 
HDP-cloud in AWS product adds the HADOOP-13560 write pipeline; with a faster 
update cycle it'd be out the door fairly rapidly too (disclaimer, no forward 
looking statements, etc etc). CDH hasn't shipped with any of the phase II 
changes in yet, that's something to discuss with your colleagues. Given the 
emphasis on Impala & S3, I'd expect it sooner rather than later

Here's [the work in 
progress|https://github.com/steveloughran/hadoop/blob/s3/HADOOOP-13600-rename/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L802];
 as I note in the code, I'm not doing it right. We should have the list and 
delete operations working in parallel too, because list is pretty slow too, and 
I want to eliminate all sequential points in the code.

I know it's complicated, but it shows why this routine is so much better down 
in the layers beneath: we can optimise every single HTTP request to S3a, order 
the copy calls for maximum overlapping operations, *and write functional tests 
against real s3 endpoints*. object stores are so different from filesystems 
that testing against localfs is misleading.

> S3-to-S3 Renames: Files should be moved individually rather than at a 
> directory level
> -
>
> Key: HIVE-15093
> URL: https://issues.apache.org/jira/browse/HIVE-15093
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15093.1.patch, HIVE-15093.2.patch, 
> HIVE-15093.3.patch, HIVE-15093.4.patch, HIVE-15093.5.patch, 
> HIVE-15093.6.patch, HIVE-15093.7.patch, HIVE-15093.8.patch, HIVE-15093.9.patch
>
>
> Hive's MoveTask uses the Hive.moveFile method to move data within a 
> distributed filesystem as well as blobstore filesystems.
> If the move is done within the same filesystem:
> 1: If the source path is a subdirectory of the destination path, files will 
> be moved one by one using a threapool of workers
> 2: If the source path is not a subdirectory of the destination path, a single 
> rename operation is used to move the entire directory
> The second option may not work well on blobstores such as S3. Renames are not 
> metadata operations and require copying all the data. Client connectors to 
> blobstores may not efficiently rename directories. Worst case, the connector 
> will copy each file one by one, sequentially rather than using a threadpool 
> of workers to copy the data (e.g. HADOOP-13600).
> Hive already has code to rename files using a threadpool of workers, but this 
> only occurs in case number 1.
> This JIRA aims to modify the code so that case 1 is triggered when copying 
> within a blobstore. The focus is on copies within a blobstore because 
> needToCopy will return true if the 

[jira] [Commented] (HIVE-15093) S3-to-S3 Renames: Files should be moved individually rather than at a directory level

2016-11-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653694#comment-15653694
 ] 

Steve Loughran commented on HIVE-15093:
---

#. I've just started HADOOP-13600, though busy with preparation and attendance 
at ApacheCon big data means expect no real progress for the next 10 days
# discussion on common dev about when 2.8 RC comes out

as far as HDP goes, all the s3a phase II read pipeline work is in HDP-2.5; the 
HDP-cloud in AWS product adds the HADOOP-13560 write pipeline; with a faster 
update cycle it'd be out the door fairly rapidly too (disclaimer, no forward 
looking statements, etc etc). CDH hasn't shipped with any of the phase II 
changes in yet, that's something to discuss with your colleagues. Given the 
emphasis on Impala & S3, I'd expect it sooner rather than later

Here's [the work in 
progress|https://github.com/steveloughran/hadoop/blob/s3/HADOOOP-13600-rename/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L802];
 as I note in the code, I'm not doing it right. We should have the list and 
delete operations working in parallel too, because list is pretty slow too, and 
I want to eliminate all sequential points in the code.

I know it's complicated, but it shows why this routine is so much better down 
in the layers beneath: we can optimise every single HTTP request to S3a, order 
the copy calls for maximum overlapping operations, *and write functional tests 
against real s3 endpoints*. object stores are so different from filesystems 
that testing against localfs is misleading.

> S3-to-S3 Renames: Files should be moved individually rather than at a 
> directory level
> -
>
> Key: HIVE-15093
> URL: https://issues.apache.org/jira/browse/HIVE-15093
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15093.1.patch, HIVE-15093.2.patch, 
> HIVE-15093.3.patch, HIVE-15093.4.patch, HIVE-15093.5.patch, 
> HIVE-15093.6.patch, HIVE-15093.7.patch, HIVE-15093.8.patch, HIVE-15093.9.patch
>
>
> Hive's MoveTask uses the Hive.moveFile method to move data within a 
> distributed filesystem as well as blobstore filesystems.
> If the move is done within the same filesystem:
> 1: If the source path is a subdirectory of the destination path, files will 
> be moved one by one using a threapool of workers
> 2: If the source path is not a subdirectory of the destination path, a single 
> rename operation is used to move the entire directory
> The second option may not work well on blobstores such as S3. Renames are not 
> metadata operations and require copying all the data. Client connectors to 
> blobstores may not efficiently rename directories. Worst case, the connector 
> will copy each file one by one, sequentially rather than using a threadpool 
> of workers to copy the data (e.g. HADOOP-13600).
> Hive already has code to rename files using a threadpool of workers, but this 
> only occurs in case number 1.
> This JIRA aims to modify the code so that case 1 is triggered when copying 
> within a blobstore. The focus is on copies within a blobstore because 
> needToCopy will return true if the src and target filesystems are different, 
> in which case a different code path is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-15093) S3-to-S3 Renames: Files should be moved individually rather than at a directory level

2016-11-09 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651394#comment-15651394
 ] 

Steve Loughran commented on HIVE-15093:
---

-1 (non binding)

Doing parallel rename is a stop-gap solution which will be obsolete the moment 
someone sits down to do it in s3a with an implementation that its more 
efficient in its scheduling of copies calls, and, with tests and broader use, 
more tested.

HADOOP-13600 proposes parallel renames. Nobody has written that yet, —but I 
promise to review a patch people provide, with tests. Get that patch into 
Hadoop and there's only one place to maintain this stuff, no need to 
document/test another switch, maintain the option, have another codepath to 
keep alive, etc. 
The algorithm I proposed there would initially sorts the files by size, so the 
larger renames are scheduled first. Given a thread pool smaller than the list 
of files to rename, this should ensure that the scheduling is more optimal. the 
listing. If you really, really, want to do this in a separate piece of code, 
you should do the same.

Also, there are enough other s3a speedups that you should be testing against 
Hadoop 2.8+, both to avoid optimising against a now-obsolete codepath, but also 
to help find and report any problems in our code.

To summarise: go on, fix the code in Hadoop, simplify everyone's lives. 

> S3-to-S3 Renames: Files should be moved individually rather than at a 
> directory level
> -
>
> Key: HIVE-15093
> URL: https://issues.apache.org/jira/browse/HIVE-15093
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Affects Versions: 2.1.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-15093.1.patch, HIVE-15093.2.patch, 
> HIVE-15093.3.patch, HIVE-15093.4.patch, HIVE-15093.5.patch, 
> HIVE-15093.6.patch, HIVE-15093.7.patch, HIVE-15093.8.patch, HIVE-15093.9.patch
>
>
> Hive's MoveTask uses the Hive.moveFile method to move data within a 
> distributed filesystem as well as blobstore filesystems.
> If the move is done within the same filesystem:
> 1: If the source path is a subdirectory of the destination path, files will 
> be moved one by one using a threapool of workers
> 2: If the source path is not a subdirectory of the destination path, a single 
> rename operation is used to move the entire directory
> The second option may not work well on blobstores such as S3. Renames are not 
> metadata operations and require copying all the data. Client connectors to 
> blobstores may not efficiently rename directories. Worst case, the connector 
> will copy each file one by one, sequentially rather than using a threadpool 
> of workers to copy the data (e.g. HADOOP-13600).
> Hive already has code to rename files using a threadpool of workers, but this 
> only occurs in case number 1.
> This JIRA aims to modify the code so that case 1 is triggered when copying 
> within a blobstore. The focus is on copies within a blobstore because 
> needToCopy will return true if the src and target filesystems are different, 
> in which case a different code path is triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >