[jira] [Updated] (HDDS-1889) Add support for verifying multiline log entry

2020-08-06 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-1889:
---
Labels: newbie pull-request-available  (was: newbie)

> Add support for verifying multiline log entry
> -
>
> Key: HDDS-1889
> URL: https://issues.apache.org/jira/browse/HDDS-1889
> Project: Hadoop Distributed Data Store
>  Issue Type: Test
>  Components: test
>Reporter: Dinesh Chitlangia
>Assignee: Peter Orova
>Priority: Major
>  Labels: newbie, pull-request-available
> Attachments: image.png
>
>
> This jira aims to test the failure scenario where a multi-line stack trace 
> will be added in the audit log. Currently, for test assumes that even in 
> failure scenario we don't have multi-line log entry.
> Example:
> {code:java}
> private static final AuditMessage READ_FAIL_MSG =
>   new AuditMessage.Builder()
>   .setUser("john")
>   .atIp("192.168.0.1")
>   .forOperation(DummyAction.READ_VOLUME.name())
>   .withParams(PARAMS)
>   .withResult(FAILURE)
>   .withException(null).build();
> {code}
> Therefore in verifyLog() we only compare for first line of the log file with 
> the expected message.
> The test would fail if in future someone were to create a scenario with 
> multi-line log entry.
> 1. Update READ_FAIL_MSG so that it has multiple lines of Exception stack 
> trace.
> This is what multi-line log entry could look like:
> {code:java}
> ERROR | OMAudit | user=dchitlangia | ip=127.0.0.1 | op=GET_ACL 
> {volume=volume80100, bucket=bucket83878, key=null, aclType=CREATE, 
> resourceType=volume, storeType=ozone} | ret=FAILURE
> org.apache.hadoop.ozone.om.exceptions.OMException: User dchitlangia doesn't 
> have CREATE permission to access volume
>  at org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:1809) 
> ~[classes/:?]
>  at org.apache.hadoop.ozone.om.OzoneManager.checkAcls(OzoneManager.java:1769) 
> ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.om.OzoneManager.createBucket(OzoneManager.java:2092) 
> ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.createBucket(OzoneManagerRequestHandler.java:526)
>  ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handle(OzoneManagerRequestHandler.java:185)
>  ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:192)
>  ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:110)
>  ~[classes/:?]
>  at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  ~[classes/:?]
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  ~[hadoop-common-3.2.0.jar:?]
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) 
> ~[hadoop-common-3.2.0.jar:?]
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) 
> ~[hadoop-common-3.2.0.jar:?]
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) 
> ~[hadoop-common-3.2.0.jar:?]
>  at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_144]
>  at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_144]
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ~[hadoop-common-3.2.0.jar:?]
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 
> ~[hadoop-common-3.2.0.jar:?]
> {code}
> 2. Update verifyLog method to accept variable number of arguments.
> 3. Update the assertion so that it compares beyond the first line when the 
> expected is a multi-line log entry.
> {code:java}
> assertTrue(expected.equalsIgnoreCase(lines.get(0)));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4020) ACL commands like getacl and setacl should return a response only when Native Authorizer is enabled

2020-07-23 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163785#comment-17163785
 ] 

Istvan Fajth commented on HDDS-4020:


I would like to suggest a few things for consideration on this.

If we have an external authorizer, like Ranger, then we should fail any ACL 
creation or modification commands, with a proper error message that says 
modification of any ACL should happen via the external authorizer used.
On the other hand read operations should not fail.
Now we get this error message on a getACL when external authorizer is enabled:
{{[# ozone sh volume getacl o3://ozone1/test}}
{{PERMISSION_DENIED User u...@example.com doesn't have READ_ACL permission to 
access volume}}

I think, reading the ACLs from the external authorizer, and showing it to the 
users would be a way more nicer approach, though I agree this should probably 
go into a separate JIRA as this might need modifications in the 
IAccessAuthorizer that has to be followed up by the Ranger plugin itself as 
well.

> ACL commands like getacl and setacl should return a response only when Native 
> Authorizer is enabled
> ---
>
> Key: HDDS-4020
> URL: https://issues.apache.org/jira/browse/HDDS-4020
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: Ozone CLI, Ozone Manager
>Affects Versions: 0.5.0
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Bharat Viswanadham
>Priority: Major
>
> Currently, the getacl and setacl commands return wrong information when an 
> external authorizer such as Ranger is enabled. There should be a check to 
> verify if Native Authorizer is enabled before returning any response for 
> these two commands.
> If an external authorizer is enabled, it should show a nice message about 
> managing acls in external authorizer.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-671) Hive HSI insert tries to create data in Hdfs for Ozone external table

2020-07-23 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth resolved HDDS-671.
---
Resolution: Not A Bug

This one does not seem to be a bug, but a permission issue only, which probably 
already have changed.

When you run a Hive query, the underlying yarn job has to have the Hive related 
classpath elements, and therefore needs to access them, afaik these resources 
are usually reside on HDFS, and sometimes copied together into a temporary 
directory for all the containers to ensure access to the runtime dependencies. 
Based on the code path visible in the exception, I think this logic is 
collecting things to HDFS into the home directory of the user running the job, 
as said by Arpit most likely because the default FS is still HDFS in this case.
Not sure why the username became anonymous in this case, but probably that is 
not the case anymore.

I am closing this as not a bug for now, feel free to reopen if anyone disagrees.

> Hive HSI insert tries to create data in Hdfs for Ozone external table
> -
>
> Key: HDDS-671
> URL: https://issues.apache.org/jira/browse/HDDS-671
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Namit Maheshwari
>Priority: Major
>  Labels: app-compat
>
> Hive HSI insert tries to create data in Hdfs for Ozone external table, when 
> "hive.server2.enable.doAs" is set to true 
> Exception details in comment below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3925) SCM Pipeline DB should directly use UUID bytes for key rather than rely on proto serialization for key.

2020-07-12 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156338#comment-17156338
 ] 

Istvan Fajth commented on HDDS-3925:


After discussions with [~avijayan] and [~nanda619], we decided that 
impelmenting remove via the iterator should be a good approach.

The final solution this way can preserve the pipelines that were in the 
database, while swapping the key for the stored value, hence the in memory 
structures after restart are as they would without the patch, while we can 
maintain the possibility of properly deleting pipelines from RocksDB in SCM 
when it is appropriate.

The problem in a bit more detail:
Protobuf does not guarantee that the byte[] serialization of an object will 
yield to the same array of bytes, therefore using a protobuf serialized 
object's byte array representation as a DB table's key is unstable, and can 
lead to unwanted behaviour.
Pipeline table has two type of access during the pipeline lifecycle:
 - a read via an iterator, where we read all the elements from the table, 
during this operation keys are not used, as the key is present in the stored 
value object which is a Pipeline object serialized and deserialized via 
protobuf, which operation is guaranteed to be backward compatible, we use the 
PipelineID in the in memory SCM structures read from this deserialization
 - delete a particular Pipeline based on its ID, in which case we serialize the 
ID via the PipelineIDCodec implementation, and delete the Pipeline from RocksDB 
based on the PipelineID

It is inevitable to change the key serialization to have a stable 
implementation, and to do so, we serialize the byte representation of the UUID 
object inside the PipelineID.

With the proposed solution we do not loose any functionality, as we can read 
the Pipelines from RocksDB, and during initialization time we fix the keys, and 
store the values back to the DB table with the new key format, while removing 
the old format. If this operation fails, then we let SCM start, as the value 
(the Pipeline object) was read properly, and we can use it, and during next 
startup we can attempt to clean up the Pipeline again. If the pipeline should 
have already been deleted before the next startup, then we will attempt to 
replace the key again at the next startup, and it gets into the in-memory 
structures again, but will be cleaned up later as the Pipeline will be 
invalidated after SCM realizes that the DataNodes already removed the pipeline, 
with that it will not cause any problem and will be cleaned up eventually.

> SCM Pipeline DB should directly use UUID bytes for key rather than rely on 
> proto serialization for key.
> ---
>
> Key: HDDS-3925
> URL: https://issues.apache.org/jira/browse/HDDS-3925
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Aravindan Vijayan
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: pull-request-available, upgrade-p0
>
> Relying on Protobuf serialization for exact match is unreliable according to 
> the docs. Hence, we have to move away from using proto.toByteArray() for on 
> disk RocksDB keys. For more details, check parent JIRA.
> cc [~nanda619]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3925) SCM Pipeline DB should directly use UUID bytes for key rather than rely on proto serialization for key.

2020-07-09 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154884#comment-17154884
 ] 

Istvan Fajth commented on HDDS-3925:


Ok, so the problem here is the following:
- in SCM during initialisation we read the pipelines into in memory structures, 
these structures hold the PipelineID as a key, and the Pipeline as the value. 
This initialisation iterates through the pipeline table, and reads just the 
values, as the Pipeline itself contains the PipelineID as well.
- if we change the codec that transforms the PipelineID objects to byte arrays, 
then after reading the database we can not do anything with the pipelines in 
RocksDB with the old key format, as the access will use the new transformation.

We have two possibilities:
- either we add a remove() function to the TableIterator, and remove the values 
via the iterator while iterating over the table, for this, we need a table 
reference in the TableIterator, and not just the RocksIterator. This as I read 
should work, as the RocksIterator itself iterates over an implicit snapshot, so 
we should be fine removing values from the table in the meantime.
- we need to detect the change in the PipelineID storage, and if there is a 
change, we need to drop the whole table and re-create it in RocksDB

Both ways, we need to accept the fact that we are loosing pipeline data, which 
should be fine, however if we do not delete the data at startup when the keys 
are not matching, we won't be able to close and remove the pipelines from 
RocksDB, and they will be read every time when SCM starts up, which can 
possibly lead to issues later during normal operations. Note that the deletion 
is not possible, as it is working based on the PipelineID, and its transition 
to byte[] in the codec, and inside the codec there is no way to detect an 
access failure unfortunately.

It is as well a possibility that after this change we require the users to 
manually delete the pipeline table of SCM, but I don't feel that it is 
realistic, if we go with this requirement, then just the codec requires the 
change, but that seems to be the least user friendly way. I am open to have any 
other ideas though.

> SCM Pipeline DB should directly use UUID bytes for key rather than rely on 
> proto serialization for key.
> ---
>
> Key: HDDS-3925
> URL: https://issues.apache.org/jira/browse/HDDS-3925
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Aravindan Vijayan
>Assignee: Prashant Pogde
>Priority: Major
>  Labels: upgrade-p0
>
> Relying on Protobuf serialization for exact match is unreliable according to 
> the docs. Hence, we have to move away from using proto.toByteArray() for on 
> disk RocksDB keys. For more details, check parent JIRA.
> cc [~nanda619]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3722) Network Topology improvements post 0.6.0

2020-06-26 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146436#comment-17146436
 ] 

Istvan Fajth commented on HDDS-3722:


I have done some research, and considered a few options to the way forward, 
considering EC desgin doc and suggested Storage class as well.

Please feel free to share your thoughts on this, I am planning to put some 
effort into this after 0.6.0 is out, but I am happy to have comments and 
further thoughts.

> Network Topology improvements post 0.6.0
> 
>
> Key: HDDS-3722
> URL: https://issues.apache.org/jira/browse/HDDS-3722
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Siddharth Wagle
>Priority: Major
> Attachments: Placement policies in Ozone.pdf
>
>
> This is an umbrella Jira to address improvements suggested in HDDS-698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3722) Network Topology improvements post 0.6.0

2020-06-26 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3722:
---
Attachment: Placement policies in Ozone.pdf

> Network Topology improvements post 0.6.0
> 
>
> Key: HDDS-3722
> URL: https://issues.apache.org/jira/browse/HDDS-3722
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.6.0
>Reporter: Siddharth Wagle
>Priority: Major
> Attachments: Placement policies in Ozone.pdf
>
>
> This is an umbrella Jira to address improvements suggested in HDDS-698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-21 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141477#comment-17141477
 ] 

Istvan Fajth edited comment on HDDS-3721 at 6/21/20, 1:15 PM:
--

As I see the call, the use cases where it is important to have the value 
correctly are:
- define and validate quota
- discover how much data a cluster have in certain places and how much it grows 
over time
- operators to review and report data usage of different tenants in order to 
help negotiate how to optimize space between them

For quota we need exact values at any point in time, however for that we need 
it on the server side.
For other use cases we might be fine with close to accurate data, but still I 
guess we need some kind of summary anyway but I would definitely avoid the way 
how it works in HDFS.

I think this might be best if we hold the data with the keys in rocksdb, we 
might not need immediate updates but probably after any writes we might update 
the space occupied in a bg thread outside of the locks, and provide a 
re-calculation or validation after restart or on demand. I am unsure how much 
extra load it puts to the db, but with the prefix table we can aggregate the 
data for all the directories on write that way making it a single rpc call...
Does this sound viable? This means teo extra long value so 16 bytes for all 
keys in the db, and a calculation overhead outside locks for all directories, 
if we do not require strong consistency between the real and stored data, if we 
require strong consistency then we need it inside the lock but it means an 
update for all parent directories...


was (Author: pifta):
As I see the call, the use cases where it is important to have the value 
correctly are:
- define and validate quota
- discover how much data a cluster have in certain places and how much it grows 
over time
- operators to review and report data usage of different tenants in order to 
help negotiate how to optimize space between them

For quota we need exact values at any point in time, hiwever for that we need 
it on the server side.
For other use cases we might be fine semi accurate data, but still I guess we 
need some kind of summary anyway but I would definitely avoid the way how it 
works in HDFS.

I think this might be best if we hold the data with the keys in rocksdb, we 
might not need immediate updates but probably after any writes we might update 
the space occupied in a bg thread outside of the locks, and provide a 
re-calculation or validation after restart or on demand. I am unsure how much 
extra load it puts to the db, but with the prefix table we can aggregate the 
data for all the directories on write that way making it a single rpc call...
Is it sounds viable? This means an extra long value so 8 bytes for all keys in 
the db, and a calculation overhead outside locks for all directories, if we do 
not require strong consistency between the real and stored data.

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-21 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141477#comment-17141477
 ] 

Istvan Fajth commented on HDDS-3721:


As I see the call, the use cases where it is important to have the value 
correctly are:
- define and validate quota
- discover how much data a cluster have in certain places and how much it grows 
over time
- operators to review and report data usage of different tenants in order to 
help negotiate how to optimize space between them

For quota we need exact values at any point in time, hiwever for that we need 
it on the server side.
For other use cases we might be fine semi accurate data, but still I guess we 
need some kind of summary anyway but I would definitely avoid the way how it 
works in HDFS.

I think this might be best if we hold the data with the keys in rocksdb, we 
might not need immediate updates but probably after any writes we might update 
the space occupied in a bg thread outside of the locks, and provide a 
re-calculation or validation after restart or on demand. I am unsure how much 
extra load it puts to the db, but with the prefix table we can aggregate the 
data for all the directories on write that way making it a single rpc call...
Is it sounds viable? This means an extra long value so 8 bytes for all keys in 
the db, and a calculation overhead outside locks for all directories, if we do 
not require strong consistency between the real and stored data.

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-18 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139500#comment-17139500
 ] 

Istvan Fajth commented on HDDS-3721:


Discussed this with [~weichiu] and he bought up an other point snapshots with 
which on the client side we can not deal with. Possible other thing is quota, 
and EC containers might be a problem as well to calculate occupied space for 
them on the client side... (but EC depending on the implementation might be too 
complex or impossible as well on the server side :) )

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2695) SCM is not able to start under certain conditions

2020-06-16 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136559#comment-17136559
 ] 

Istvan Fajth commented on HDDS-2695:


Probably it was specific to an environment back then when I ran into this 
problem, but it did not get investigated deeply at that time. I am unsure if it 
is still reproducible, I can give it a try, however there were a lot of 
improvements regarding writes, and ratis recently, so I assume this was already 
fixed probably, if it wasn't strictly related to our filesystem implementation. 
We are having plans for testing this further out, but it is down on the list, 
so I am open to close this issue know if noone else have seen this, and it 
seems there are noone else, and reopen if we get back to that testing.

> SCM is not able to start under certain conditions
> -
>
> Key: HDDS-2695
> URL: https://issues.apache.org/jira/browse/HDDS-2695
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>  Labels: Triaged
>
> Given
> - a cluster where RATIS-677 happened, and DataNodes are already failing to 
> start properly due to the issue
> When
> - I restart the cluster and start to see the exceptions as described in 
> RATIS-677
> - I stop the 3 DN that has the failing pipeline
> - remove the ratis metadata for the pipeline
> - close the pipeline with scmcli
> - restart the 3 DN
> Then
> - SCM is unable to come out of safe mode, the log shows the following 
> possible reason:
> {code}
> 2019-12-09 01:13:38,437 INFO 
> org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 
> Pipelines with at least one datanode reported count is 0, required at least 
> one datanode reported per pipeline count is 4
> {code}
> If after this I restart the SCM, it fails without logging any exception, and 
> the standard error contains the following message es the last one:
> {code}
> PipelineID= not found
> {code}
> Also scmcli did not list the closed pipeline after I closed it and checked 
> the active pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-12 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134151#comment-17134151
 ] 

Istvan Fajth edited comment on HDDS-3721 at 6/12/20, 11:49 AM:
---

At first I thought this one is simply a client side problem, but after going 
into the details a bit, I realised that there might be a reason why HDFS has 
this on the server side, and started to check into it, but then I had to put 
this one aside a bit.

The benefits of approaching this from the client side, is that it stays on the 
client side, and avoids a heavy implementation on the OM side, on the other 
hand on the OM side, on the other hand, it is painfully slow, and the runtime 
scales up with the number of elements in a directory, it was running for ~25 
seconds on a folder with 82k files in 3.5k subfolders.
The problem of approaching this from the client side, is that it leads to 4 
calls per subdirectory (14k calls in this case)... 1 READ_BUCKET, then 1 
GET_FILE_STATUS (to see if this is a file or a dir), then if it is a directory 
1 READ_BUCKET again, and finally a LIST_STATUS, which then can not be 
controlled or throttled by the server side much as these are coming from the 
client side and from possibly multiple clients at some times. PErhaps 
READ_BUCKET is just a separate audit log message that does not represent an 
RPC, I haven't checked this yet, still it means 7k RPC calls at least.


The benefit of having something similar in the OM API, is to have just one 
call, and we can do throttling and any kind of optimisation on the OM side as 
needed, and we might ultimately cache the values even if that becomes necessary.
The problem of this approach is that it possibly requires a lock, and is an 
operation that is blocking OM for too long... I am unsure though whether we 
need even the read lock.


[~arp], can you give some insight why you would like to avoid implementing this 
on OM side, perhaps why at the end was it implemented on the server side for 
HDFS?


was (Author: pifta):
At first I thought this one is simply a client side problem, but after going 
into the details a bit, I realised that there might be a reason why HDFS has 
this on the server side, and started to check into it, but then I had to put 
this one aside a bit.

The benefits of approaching this from the client side, is that it stays on the 
client side, and avoids a heavy implementation on the OM side, on the other 
hand on the OM side, on the other hand, it is painfully slow, and the runtime 
scales up with the number of elements in a directory, it was running for ~25 
seconds on a folder with 82k files in 3.5k subfolders.
The problem of approaching this from the client side, is that it leads to 4 
calls per subdirectory (14k calls in this case)... 1 READ_BUCKET, then 1 
GET_FILE_STATUS (to see if this is a file or a dir), then if it is a directory 
1 READ_BUCKET again, and finally a LIST_STATUS, which then can not be 
controlled or throttled by the server side much as these are coming from the 
client side and from possibly multiple clients at some times.


The benefit of having something similar in the OM API, is to have just one 
call, and we can do throttling and any kind of optimisation on the OM side as 
needed, and we might ultimately cache the values even if that becomes necessary.
The problem of this approach is that it possibly requires a lock, and is an 
operation that is blocking OM for too long... I am unsure though whether we 
need even the read lock.


[~arp], can you give some insight why you would like to avoid implementing this 
on OM side, perhaps why at the end was it implemented on the server side for 
HDFS?

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second 

[jira] [Comment Edited] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-12 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134151#comment-17134151
 ] 

Istvan Fajth edited comment on HDDS-3721 at 6/12/20, 11:48 AM:
---

At first I thought this one is simply a client side problem, but after going 
into the details a bit, I realised that there might be a reason why HDFS has 
this on the server side, and started to check into it, but then I had to put 
this one aside a bit.

The benefits of approaching this from the client side, is that it stays on the 
client side, and avoids a heavy implementation on the OM side, on the other 
hand on the OM side, on the other hand, it is painfully slow, and the runtime 
scales up with the number of elements in a directory, it was running for ~25 
seconds on a folder with 82k files in 3.5k subfolders.
The problem of approaching this from the client side, is that it leads to 4 
calls per subdirectory (14k calls in this case)... 1 READ_BUCKET, then 1 
GET_FILE_STATUS (to see if this is a file or a dir), then if it is a directory 
1 READ_BUCKET again, and finally a LIST_STATUS, which then can not be 
controlled or throttled by the server side much as these are coming from the 
client side and from possibly multiple clients at some times.


The benefit of having something similar in the OM API, is to have just one 
call, and we can do throttling and any kind of optimisation on the OM side as 
needed, and we might ultimately cache the values even if that becomes necessary.
The problem of this approach is that it possibly requires a lock, and is an 
operation that is blocking OM for too long... I am unsure though whether we 
need even the read lock.


[~arp], can you give some insight why you would like to avoid implementing this 
on OM side, perhaps why at the end was it implemented on the server side for 
HDFS?


was (Author: pifta):
At first I thought this one is simply a client side problem, but after going 
into the details a bit, I realised that there might be a reason why HDFS has 
this on the server side, and started to check into it, but then I had to put 
this one aside a bit.

The benefits of approaching this from the client side, is that it stays on the 
client side, and avoids a heavy implementation on the OM side, on the other 
hand on the OM side, on the other hand, it is painfully slow, and the runtime 
scales up with the number of elements in a directory, it was running for ~25 
seconds on a folder with 82k files in 3.5k subfolders.
The problem of approaching this from the client side, is that it leads to 4 
calls per subdirectory... 1 READ_BUCKET, then 1 GET_FILE_STATUS (to see if this 
is a file or a dir), then if it is a directory 1 READ_BUCKET again, and finally 
a LIST_STATUS, which then can not be controlled or throttled by the server side 
much as these are coming from the client side and from possibly multiple 
clients at some times.


The benefit of having something similar in the OM API, is to have just one 
call, and we can do throttling and any kind of optimisation on the OM side as 
needed, and we might ultimately cache the values even if that becomes necessary.
The problem of this approach is that it possibly requires a lock, and is an 
operation that is blocking OM for too long... I am unsure though whether we 
need even the read lock.


[~arp], can you give some insight why you would like to avoid implementing this 
on OM side, perhaps why at the end was it implemented on the server side for 
HDFS?

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to 

[jira] [Commented] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-12 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134151#comment-17134151
 ] 

Istvan Fajth commented on HDDS-3721:


At first I thought this one is simply a client side problem, but after going 
into the details a bit, I realised that there might be a reason why HDFS has 
this on the server side, and started to check into it, but then I had to put 
this one aside a bit.

The benefits of approaching this from the client side, is that it stays on the 
client side, and avoids a heavy implementation on the OM side, on the other 
hand on the OM side, on the other hand, it is painfully slow, and the runtime 
scales up with the number of elements in a directory, it was running for ~25 
seconds on a folder with 82k files in 3.5k subfolders.
The problem of approaching this from the client side, is that it leads to 4 
calls per subdirectory... 1 READ_BUCKET, then 1 GET_FILE_STATUS (to see if this 
is a file or a dir), then if it is a directory 1 READ_BUCKET again, and finally 
a LIST_STATUS, which then can not be controlled or throttled by the server side 
much as these are coming from the client side and from possibly multiple 
clients at some times.


The benefit of having something similar in the OM API, is to have just one 
call, and we can do throttling and any kind of optimisation on the OM side as 
needed, and we might ultimately cache the values even if that becomes necessary.
The problem of this approach is that it possibly requires a lock, and is an 
operation that is blocking OM for too long... I am unsure though whether we 
need even the read lock.


[~arp], can you give some insight why you would like to avoid implementing this 
on OM side, perhaps why at the end was it implemented on the server side for 
HDFS?

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: Triaged
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-04 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-3721:
--

Assignee: Istvan Fajth

> Implement getContentSummary to provide replicated size properly to dfs -du 
> command
> --
>
> Key: HDDS-3721
> URL: https://issues.apache.org/jira/browse/HDDS-3721
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
> the default implementation from FileSystem class in the Hadoop project, and 
> that does not care to calculate with replication factor by default. In 
> DistributedFileSystem and in a couple of FileSystem implementation there is 
> an override to calculate the full replicated size properly.
> Currently the output is something like this for a folder which has file with 
> replication factor of 3:
> {code}
> hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
> 931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
> {code}
> The command in Ozone's case as well should report the replicated size az the 
> second number so something around 2.7TB in this case.
> In order to do so, we should implement getContentSummary and calculate the 
> replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3721) Implement getContentSummary to provide replicated size properly to dfs -du command

2020-06-04 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-3721:
--

 Summary: Implement getContentSummary to provide replicated size 
properly to dfs -du command
 Key: HDDS-3721
 URL: https://issues.apache.org/jira/browse/HDDS-3721
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Istvan Fajth


Currently when you run hdfs dfs -du command against a path on Ozone, it uses 
the default implementation from FileSystem class in the Hadoop project, and 
that does not care to calculate with replication factor by default. In 
DistributedFileSystem and in a couple of FileSystem implementation there is an 
override to calculate the full replicated size properly.

Currently the output is something like this for a folder which has file with 
replication factor of 3:
{code}
hdfs dfs -du -s -h o3fs://perfbucket.volume.ozone1/terasort/datagen
931.3 G  931.3 G  o3fs://perfbucket.volume.ozone1/terasort/datagen
{code}

The command in Ozone's case as well should report the replicated size az the 
second number so something around 2.7TB in this case.
In order to do so, we should implement getContentSummary and calculate the 
replicated size in the response properly in order to get there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-729) OzoneFileSystem doesn't support modifyAclEntries

2020-06-02 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-729:
-

Assignee: Istvan Fajth

> OzoneFileSystem doesn't support modifyAclEntries
> 
>
> Key: HDDS-729
> URL: https://issues.apache.org/jira/browse/HDDS-729
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.3.0
>Reporter: Soumitra Sulav
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: TriagePending
>
> Hive service while starting does modifyAcl operation and as the same isn't 
> supported it fails to start.
> {code:java}
> hdfs dfs -setfacl -m default:user:hive:rwx 
> /warehouse/tablespace/external/hive{code}
> Exception encountered :
> {code:java}
> [hdfs@ctr-e138-1518143905142-541600-02-02 ~]$ hdfs dfs -setfacl -m 
> default:user:hive:rwx /warehouse/tablespace/external/hive
> 18/10/24 08:39:35 INFO conf.Configuration: Removed undeclared tags:
> 18/10/24 08:39:37 INFO conf.Configuration: Removed undeclared tags:
> -setfacl: Fatal internal error
> java.lang.UnsupportedOperationException: OzoneFileSystem doesn't support 
> modifyAclEntries
> at org.apache.hadoop.fs.FileSystem.modifyAclEntries(FileSystem.java:2926)
> at 
> org.apache.hadoop.fs.shell.AclCommands$SetfaclCommand.processPath(AclCommands.java:256)
> at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> 18/10/24 08:39:37 INFO conf.Configuration: Removed undeclared tags:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-729) OzoneFileSystem doesn't support modifyAclEntries

2020-06-02 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123507#comment-17123507
 ] 

Istvan Fajth commented on HDDS-729:
---

Hi [~arp],

unsure yet, but I can take a look at it.


> OzoneFileSystem doesn't support modifyAclEntries
> 
>
> Key: HDDS-729
> URL: https://issues.apache.org/jira/browse/HDDS-729
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.3.0
>Reporter: Soumitra Sulav
>Priority: Major
>  Labels: TriagePending
>
> Hive service while starting does modifyAcl operation and as the same isn't 
> supported it fails to start.
> {code:java}
> hdfs dfs -setfacl -m default:user:hive:rwx 
> /warehouse/tablespace/external/hive{code}
> Exception encountered :
> {code:java}
> [hdfs@ctr-e138-1518143905142-541600-02-02 ~]$ hdfs dfs -setfacl -m 
> default:user:hive:rwx /warehouse/tablespace/external/hive
> 18/10/24 08:39:35 INFO conf.Configuration: Removed undeclared tags:
> 18/10/24 08:39:37 INFO conf.Configuration: Removed undeclared tags:
> -setfacl: Fatal internal error
> java.lang.UnsupportedOperationException: OzoneFileSystem doesn't support 
> modifyAclEntries
> at org.apache.hadoop.fs.FileSystem.modifyAclEntries(FileSystem.java:2926)
> at 
> org.apache.hadoop.fs.shell.AclCommands$SetfaclCommand.processPath(AclCommands.java:256)
> at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> 18/10/24 08:39:37 INFO conf.Configuration: Removed undeclared tags:
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-664) Creating hive table on Ozone fails

2020-06-02 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123505#comment-17123505
 ] 

Istvan Fajth commented on HDDS-664:
---

I guess so based on the fact that the cause is not fixed yet.

In Cloudera Manager this seems to be set automatically, but in a general 
deployment the workaround has to be in place in non-kerberized environments as 
I understand the issue [~arp]

> Creating hive table on Ozone fails
> --
>
> Key: HDDS-664
> URL: https://issues.apache.org/jira/browse/HDDS-664
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: documentation
>Reporter: Namit Maheshwari
>Assignee: Hanisha Koneru
>Priority: Major
>  Labels: TriagePending, app-compat
>
> Modified HIVE_AUX_JARS_PATH to include Ozone jars. Tried creating Hive 
> external table on Ozone. It fails with "Error: Error while compiling 
> statement: FAILED: HiveAuthzPluginException Error getting permissions for 
> o3://bucket2.volume2/testo3: User: hive is not allowed to impersonate 
> anonymous (state=42000,code=4)"
> {code:java}
> -bash-4.2$ beeline
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/hdp/3.0.3.0-63/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/hdp/3.0.3.0-63/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Connecting to 
> jdbc:hive2://ctr-e138-1518143905142-510793-01-11.hwx.site:2181,ctr-e138-1518143905142-510793-01-06.hwx.site:2181,ctr-e138-1518143905142-510793-01-08.hwx.site:2181,ctr-e138-1518143905142-510793-01-10.hwx.site:2181,ctr-e138-1518143905142-510793-01-07.hwx.site:2181/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
> Enter username for 
> jdbc:hive2://ctr-e138-1518143905142-510793-01-11.hwx.site:2181,ctr-e138-1518143905142-510793-01-06.hwx.site:2181,ctr-e138-1518143905142-510793-01-08.hwx.site:2181,ctr-e138-1518143905142-510793-01-10.hwx.site:2181,ctr-e138-1518143905142-510793-01-07.hwx.site:2181/default:
> Enter password for 
> jdbc:hive2://ctr-e138-1518143905142-510793-01-11.hwx.site:2181,ctr-e138-1518143905142-510793-01-06.hwx.site:2181,ctr-e138-1518143905142-510793-01-08.hwx.site:2181,ctr-e138-1518143905142-510793-01-10.hwx.site:2181,ctr-e138-1518143905142-510793-01-07.hwx.site:2181/default:
> 18/10/15 21:36:55 [main]: INFO jdbc.HiveConnection: Connected to 
> ctr-e138-1518143905142-510793-01-04.hwx.site:1
> Connected to: Apache Hive (version 3.1.0.3.0.3.0-63)
> Driver: Hive JDBC (version 3.1.0.3.0.3.0-63)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0.3.0.3.0-63 by Apache Hive
> 0: jdbc:hive2://ctr-e138-1518143905142-510793> create external table testo3 ( 
> i int, s string, d float) location "o3://bucket2.volume2/testo3";
> Error: Error while compiling statement: FAILED: HiveAuthzPluginException 
> Error getting permissions for o3://bucket2.volume2/testo3: User: hive is not 
> allowed to impersonate anonymous (state=42000,code=4)
> 0: jdbc:hive2://ctr-e138-1518143905142-510793> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2939) Ozone FS namespace

2020-05-29 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119653#comment-17119653
 ] 

Istvan Fajth commented on HDDS-2939:


Hi [~sdeka] and [~rakeshr],

I have gone through the doc, and one question arise during reading about 
deletion. As it seems, that one still won't be atomic for directories, as the 
doc says deletes will be a client driven thing, and directories can not be 
removed until there are entries in them.
This got me an idea, what if we introduce a preserved prefix for deleted stuff, 
and implement a garbage collector background thread that removes unnecessary 
stuff, just as with block deletion with this we can defer the heavy work to a 
place where there is no need for locking and with that we can change deletes to 
a rename first which happens atomically, while we can ensure that there won't 
be collisions or problems from removing a folder while in parallel creating a 
new file in it that is not removed.

So for example if we choose the 0 prefix id to be reserved for everything that 
was deleted, then we can implement delete as a rename to under the 0 prefix id. 
With this all things can be made unavailable immediately under any other 
prefix, if we do not assign a path to this prefix, or the path is 0 bytes for 
example. As in these cases if the path translation algorithm does not allow a 
path element to be 0 length, the prefixes and keys under this prefix will not 
be available anymore. Now with this a background thread can clean up all the 
elements periodically under this prefix predefined prefix.
Pros I see:
- atomic delete through rename
- we do not need extensive locking under the special prefix, as we can allow 
any possible name collision there because the contents of this prefix should 
not be accessible anyways, the deletion on the other hand in the background 
work based on ids, which is not ambiguous
- we do not need locking during the actual key/prefix removals, as that does 
not affect other parts of the prefix/key space, and even if there are multiple 
background cleanup thread, the operation can be seen as idempotent, but with 
just one background thread the operations are serial anyways

Cons I see:
- there might be a possibility to have orphan prefixes/keys if we do not 
introduce locking here, so we definitely will need to involve orphan cleanup as 
well in the background, probably on the same thread, or as in the original 
proposal we need to synchronize the renames to this special prefix, and the 
deletions happening on the background thread, and fail prefix removal if there 
are new entries under the prefix. I am not sure which one is easier/more 
beneficial... orphan detection, or synchronization?

Was this possibility considered? If so why the idea got rejected at the end? If 
not what do you think about this approach?

> Ozone FS namespace
> --
>
> Key: HDDS-2939
> URL: https://issues.apache.org/jira/browse/HDDS-2939
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Manager
>Reporter: Supratim Deka
>Assignee: Rakesh Radhakrishnan
>Priority: Major
> Attachments: Ozone FS Namespace Proposal v1.0.docx
>
>
> Create the structures and metadata layout required to support efficient FS 
> namespace operations in Ozone - operations involving folders/directories 
> required to support the Hadoop compatible Filesystem interface.
> The details are described in the attached document. The work is divided up 
> into sub-tasks as per the task list in the document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-3267) Replace ContainerCache in BlockUtils by LoadingCache

2020-05-28 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth resolved HDDS-3267.

Resolution: Won't Fix

> Replace ContainerCache in BlockUtils by LoadingCache
> 
>
> Key: HDDS-3267
> URL: https://issues.apache.org/jira/browse/HDDS-3267
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Isa Hekmatizadeh
>Assignee: Isa Hekmatizadeh
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in [here|https://github.com/apache/hadoop-ozone/pull/705] 
> current version of ContainerCache is just used by BlockUtils and has several 
> architectural issues. for example:
>  * It uses a ReentrantLock which could be replaced by synchronized methods
>  * It should maintain a referenceCount for each DBHandler
>  * It extends LRUMap while it would be better to hide it by the composition 
> and not expose LRUMap related methods.
> As [~pifta] suggests, we could replace all ContainerCache functionality by 
> using Guava LoadingCache.
> This new LoadingCache could be configured to evict by size, by this 
> configuration the functionality would be slightly different as it may evict 
> DBHandlers while they are in use (referenceCount>0) but we can configure it 
> to use reference base eviction based on CacheBuilder.weakValues() 
> I want to open this discussion here instead of Github so I created this 
> ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3630) Merge rocksdb in datanode

2020-05-28 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118639#comment-17118639
 ] 

Istvan Fajth commented on HDDS-3630:


One important thing that came out from HDDS-3267 is that the ContainerCache 
implementation is used to cache RocksDB connections inside the DataNode. If we 
merge the RocksDB instances, we most probably should drop the ContainerCache 
and the underlying RocksDB connection caching/pooling implemented there as it 
will not be needed anymore.

NOTE
If for some reason we do not drop the implementation, then please file a follow 
up JIRA to fix the following flaw in that code before closing this one:
The ContainerCache code uses this.put() to put the value to the LRUMap, but in 
LRUMap put is not overridden, it provides addMapping() which checks properly 
for boundaries, and since the ContainerCache code does not check for 
boundaries, basically in the current implementation we have an unlimited cache 
which lies about itself and says it is limited by maxSize.

> Merge rocksdb in datanode
> -
>
> Key: HDDS-3630
> URL: https://issues.apache.org/jira/browse/HDDS-3630
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: Merge RocksDB in Datanode-v1.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3267) Replace ContainerCache in BlockUtils by LoadingCache

2020-05-28 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118635#comment-17118635
 ] 

Istvan Fajth commented on HDDS-3267:


As discussed in the PR, the problem to solve with this cache is covered with 
the following 4 requirement:
- we want to cache db connections and keep them live
- we want to keep the size restriction on the cache
- we want to close the db connection in case an unused cached connection is 
evicted from the cache
- we want to close the db connection in case it is used and got evicted from 
the cache only at the end of the current use, or - god forbid us - uses on 
multiple threads.

As we realised it is not easily solved by LoadingCache.
Also during the discussion we realised the following flaw with the current 
caching implementation in ContainerCache:
The ContainerCache code uses this.put() to put the value to the LRUMap, but in 
LRUMap put is not overridden, it provides addMapping() which checks properly 
for boundaries, and since the ContainerCache code does not check for 
boundaries, basically in the current implementation we have an unlimited cache 
which lies about itself and says it is limited by maxSize.

At the end of the discussion, we realised that with HDDS-3630 probably we won't 
need this whole cache anymore, so if [~esa.hekmat] agrees with [~elek] and me, 
we can close the PR and this JIRA also.

> Replace ContainerCache in BlockUtils by LoadingCache
> 
>
> Key: HDDS-3267
> URL: https://issues.apache.org/jira/browse/HDDS-3267
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Isa Hekmatizadeh
>Assignee: Isa Hekmatizadeh
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed in [here|https://github.com/apache/hadoop-ozone/pull/705] 
> current version of ContainerCache is just used by BlockUtils and has several 
> architectural issues. for example:
>  * It uses a ReentrantLock which could be replaced by synchronized methods
>  * It should maintain a referenceCount for each DBHandler
>  * It extends LRUMap while it would be better to hide it by the composition 
> and not expose LRUMap related methods.
> As [~pifta] suggests, we could replace all ContainerCache functionality by 
> using Guava LoadingCache.
> This new LoadingCache could be configured to evict by size, by this 
> configuration the functionality would be slightly different as it may evict 
> DBHandlers while they are in use (referenceCount>0) but we can configure it 
> to use reference base eviction based on CacheBuilder.weakValues() 
> I want to open this discussion here instead of Github so I created this 
> ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-3607) Lot of warnings at DN startup

2020-05-18 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-3607:
--

Assignee: Istvan Fajth

> Lot of warnings at DN startup
> -
>
> Key: HDDS-3607
> URL: https://issues.apache.org/jira/browse/HDDS-3607
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Minor
>  Labels: pull-request-available
>
> During DataNode startup, when we replay the edits to the StateMachine, a lot 
> of warnings are emitted to the log depending on the amount of replayed 
> transactions, one warning for each. The warning states the edit is ignored at 
> this point in the 
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl class:
> {code}
> WARN org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl: 
> blockCommitSequenceId 1128422 in the Container Db is greater than the 
> supplied value 1120656. Ignoring it
> {code}
> I think as this does not show any problem, and there can be an awful lot from 
> this message just spamming the log at startup, we should move this to DEBUG 
> level instead of WARN.
> Attaching a PR with the change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3607) Lot of warnings at DN startup

2020-05-18 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-3607:
--

 Summary: Lot of warnings at DN startup
 Key: HDDS-3607
 URL: https://issues.apache.org/jira/browse/HDDS-3607
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Istvan Fajth


During DataNode startup, when we replay the edits to the StateMachine, a lot of 
warnings are emitted to the log depending on the amount of replayed 
transactions, one warning for each. The warning states the edit is ignored at 
this point in the 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl class:

{code}
2020-05-14 18:22:53,672 WARN 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl: 
blockCommitSequenceId 1128422 in the Container Db is greater than the supplied 
value 1120656. Ignoring it
{code}

I think as this does not show any problem, and there can be an awful lot from 
this message just spamming the log at startup, we should move this to DEBUG 
level instead of WARN.
Attaching a PR with the change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3607) Lot of warnings at DN startup

2020-05-18 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3607:
---
Description: 
During DataNode startup, when we replay the edits to the StateMachine, a lot of 
warnings are emitted to the log depending on the amount of replayed 
transactions, one warning for each. The warning states the edit is ignored at 
this point in the 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl class:

{code}
WARN org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl: 
blockCommitSequenceId 1128422 in the Container Db is greater than the supplied 
value 1120656. Ignoring it
{code}

I think as this does not show any problem, and there can be an awful lot from 
this message just spamming the log at startup, we should move this to DEBUG 
level instead of WARN.
Attaching a PR with the change.

  was:
During DataNode startup, when we replay the edits to the StateMachine, a lot of 
warnings are emitted to the log depending on the amount of replayed 
transactions, one warning for each. The warning states the edit is ignored at 
this point in the 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl class:

{code}
2020-05-14 18:22:53,672 WARN 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl: 
blockCommitSequenceId 1128422 in the Container Db is greater than the supplied 
value 1120656. Ignoring it
{code}

I think as this does not show any problem, and there can be an awful lot from 
this message just spamming the log at startup, we should move this to DEBUG 
level instead of WARN.
Attaching a PR with the change.


> Lot of warnings at DN startup
> -
>
> Key: HDDS-3607
> URL: https://issues.apache.org/jira/browse/HDDS-3607
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Priority: Minor
>
> During DataNode startup, when we replay the edits to the StateMachine, a lot 
> of warnings are emitted to the log depending on the amount of replayed 
> transactions, one warning for each. The warning states the edit is ignored at 
> this point in the 
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl class:
> {code}
> WARN org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl: 
> blockCommitSequenceId 1128422 in the Container Db is greater than the 
> supplied value 1120656. Ignoring it
> {code}
> I think as this does not show any problem, and there can be an awful lot from 
> this message just spamming the log at startup, we should move this to DEBUG 
> level instead of WARN.
> Attaching a PR with the change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-3506) TestOzoneFileInterfaces is flaky

2020-04-28 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-3506:
--

Assignee: Istvan Fajth

> TestOzoneFileInterfaces is flaky
> 
>
> Key: HDDS-3506
> URL: https://issues.apache.org/jira/browse/HDDS-3506
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Istvan Fajth
>Priority: Critical
> Attachments: 
> TEST-org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.xml
>
>
> TestOzoneFileInterfaces.testOzoneManagerLocatedFileStatusBlockOffsetsWithMultiBlockFile
>  is flaky and failed multiple times on master:
> {code}
> ./2020/04/24/822/it-filesystem/hadoop-ozone/integration-test/org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.txt
> ./2020/04/24/822/it-filesystem/hadoop-ozone/integration-test/TEST-org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.xml
> ./2020/04/24/822/it-filesystem/output.log
> ./2020/04/27/830/it-filesystem/hadoop-ozone/integration-test/org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.txt
> ./2020/04/27/830/it-filesystem/hadoop-ozone/integration-test/TEST-org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.xml
> ./2020/04/27/830/it-filesystem/output.log
> ./2020/04/28/831/it-filesystem/hadoop-ozone/integration-test/org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.txt
> ./2020/04/28/831/it-filesystem/hadoop-ozone/integration-test/TEST-org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.xml
> ./2020/04/28/831/it-filesystem/output.log
> ./2020/04/28/833/it-filesystem/hadoop-ozone/integration-test/org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.txt
> ./2020/04/28/833/it-filesystem/hadoop-ozone/integration-test/TEST-org.apache.hadoop.fs.ozone.TestOzoneFileInterfaces.xml
> ./2020/04/28/833/it-filesystem/output.log
> {code}
> I am disabling it until the fix



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3033) Remove TestContainerPlacement add comprehensive unit level tests to container placement policies

2020-04-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3033:
---
Summary: Remove TestContainerPlacement add comprehensive unit level tests 
to container placement policies  (was: Remove TestContainerPlacement add 
comprehensive junit level tests to container placement policies)

> Remove TestContainerPlacement add comprehensive unit level tests to container 
> placement policies
> 
>
> Key: HDDS-3033
> URL: https://issues.apache.org/jira/browse/HDDS-3033
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> The TestContainerPlacement test is currently ignored. Its internals after 
> thoroughly reviewing it are heavily dependent on internal implementations of 
> the container placement logic and related classes, hence it makes the test 
> dependent on internals of unrelated stuff, and hard to maintain due to 
> constant need to tune the test to how the internals behave.
> Based on this the suggestion is to remove this test, instead of fixing and 
> re-enabling it, and complement the missing test with unit level tests and 
> test the container placement policies' logic from closer and more thoroughly.
> During review it turned out that the SCMContainerPlacementRackAware class, 
> along with the PipelinePlacementPolicy inherits from the 
> SCMCommonPlacementPolicy, but misses the unified checks performed by all 
> other placement policies via the implementation in the 
> SCMCommonPlacementPolicy. The checks are:
> - the policy is there and can be loaded (Factory level)
> - the policy construction was successful
> - the number of available healthy nodes are sufficient to serve the requested 
> amount of nodes to the pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3033) Remove TestContainerPlacement add comprehensive junit level tests to container placement policies

2020-04-20 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088131#comment-17088131
 ] 

Istvan Fajth commented on HDDS-3033:


Beside adding the unit level tests for all the placement policies, the 
followings were found and targeted by this JIRA (the list may be evolving as I 
am working on the PR):
- The PipelinePlacementPolicy and SCMContainerPlacementRackAware classes are 
not using these unified checks, and does not conform with the error messages of 
the other policies.
- The rack aware policies in case of fallback are not falling back in certain 
conditions as shown by some test that were failing with the original code.
- Few error messages did not supply a unique error code with the SCMException 
thrown
- Small refactoring in how the Factory creates policies and code reorg
- Some changes needed to fix failures in newly added tests for 
SCMContainerPlacementCapacity class

> Remove TestContainerPlacement add comprehensive junit level tests to 
> container placement policies
> -
>
> Key: HDDS-3033
> URL: https://issues.apache.org/jira/browse/HDDS-3033
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> The TestContainerPlacement test is currently ignored. Its internals after 
> thoroughly reviewing it are heavily dependent on internal implementations of 
> the container placement logic and related classes, hence it makes the test 
> dependent on internals of unrelated stuff, and hard to maintain due to 
> constant need to tune the test to how the internals behave.
> Based on this the suggestion is to remove this test, instead of fixing and 
> re-enabling it, and complement the missing test with unit level tests and 
> test the container placement policies' logic from closer and more thoroughly.
> During review it turned out that the SCMContainerPlacementRackAware class, 
> along with the PipelinePlacementPolicy inherits from the 
> SCMCommonPlacementPolicy, but misses the unified checks performed by all 
> other placement policies via the implementation in the 
> SCMCommonPlacementPolicy. The checks are:
> - the policy is there and can be loaded (Factory level)
> - the policy construction was successful
> - the number of available healthy nodes are sufficient to serve the requested 
> amount of nodes to the pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3033) Remove TestContainerPlacement add comprehensive junit level tests to container placement policies

2020-04-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3033:
---
Description: 
The TestContainerPlacement test is currently ignored. Its internals after 
thoroughly reviewing it are heavily dependent on internal implementations of 
the container placement logic and related classes, hence it makes the test 
dependent on internals of unrelated stuff, and hard to maintain due to constant 
need to tune the test to how the internals behave.
Based on this the suggestion is to remove this test, instead of fixing and 
re-enabling it, and complement the missing test with unit level tests and test 
the container placement policies' logic from closer and more thoroughly.

During review it turned out that the SCMContainerPlacementRackAware class, 
along with the PipelinePlacementPolicy inherits from the 
SCMCommonPlacementPolicy, but misses the unified checks performed by all other 
placement policies via the implementation in the SCMCommonPlacementPolicy. The 
checks are:
- the policy is there and can be loaded (Factory level)
- the policy construction was successful
- the number of available healthy nodes are sufficient to serve the requested 
amount of nodes to the pipeline.


  was:Remove Ignore annotation and fix the test.


> Remove TestContainerPlacement add comprehensive junit level tests to 
> container placement policies
> -
>
> Key: HDDS-3033
> URL: https://issues.apache.org/jira/browse/HDDS-3033
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> The TestContainerPlacement test is currently ignored. Its internals after 
> thoroughly reviewing it are heavily dependent on internal implementations of 
> the container placement logic and related classes, hence it makes the test 
> dependent on internals of unrelated stuff, and hard to maintain due to 
> constant need to tune the test to how the internals behave.
> Based on this the suggestion is to remove this test, instead of fixing and 
> re-enabling it, and complement the missing test with unit level tests and 
> test the container placement policies' logic from closer and more thoroughly.
> During review it turned out that the SCMContainerPlacementRackAware class, 
> along with the PipelinePlacementPolicy inherits from the 
> SCMCommonPlacementPolicy, but misses the unified checks performed by all 
> other placement policies via the implementation in the 
> SCMCommonPlacementPolicy. The checks are:
> - the policy is there and can be loaded (Factory level)
> - the policy construction was successful
> - the number of available healthy nodes are sufficient to serve the requested 
> amount of nodes to the pipeline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3033) Remove TestContainerPlacement add comprehensive junit level tests to container placement policies

2020-04-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3033:
---
Summary: Remove TestContainerPlacement add comprehensive junit level tests 
to container placement policies  (was: Remove TestContainerPlaement add 
comprehensive junit level tests to container placement policies)

> Remove TestContainerPlacement add comprehensive junit level tests to 
> container placement policies
> -
>
> Key: HDDS-3033
> URL: https://issues.apache.org/jira/browse/HDDS-3033
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> Remove Ignore annotation and fix the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-3033) Remove TestContainerPlaement add comprehensive junit level tests to container placement policies

2020-04-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HDDS-3033:
---
Summary: Remove TestContainerPlaement add comprehensive junit level tests 
to container placement policies  (was: Fix TestContainerPlacement)

> Remove TestContainerPlaement add comprehensive junit level tests to container 
> placement policies
> 
>
> Key: HDDS-3033
> URL: https://issues.apache.org/jira/browse/HDDS-3033
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> Remove Ignore annotation and fix the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3386) Remove unnecessary transitive hadoop-common dependencies on server side (addendum)

2020-04-14 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-3386:
--

 Summary: Remove unnecessary transitive hadoop-common dependencies 
on server side (addendum)
 Key: HDDS-3386
 URL: https://issues.apache.org/jira/browse/HDDS-3386
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Istvan Fajth
Assignee: Istvan Fajth


In HDDS-3353 we have created two new modules to manage the hadoop-common 
dependency and excludes related to it in a common place.
hadoop-hdds-dependency-test, and hadoop-hdds-dependency-server, similarly we 
had hadoop-hdds-dependency-client before.

The following modules still depend on hadoop-common for tests instead of the 
new test dependency:
hadoop-hdds/client
hadoop-hdds/container-service
hadoop-ozone/common
hadoop-ozone/fault-injection-test/mini-chaos-tests
hadoop-ozone/insight
hadoop-ozone/integration-test
hadoop-ozone/tools

In hadoop-dependency-client, we exclude named curator packages, similar to the 
new modules we should instead exclude all curator packages.

In TestVolumeSetDiskChecks.java we still have an accidental shaded import from 
curator:  import 
org.apache.curator.shaded.com.google.common.collect.ImmutableSet;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3351) Remove unnecessary dependency Curator

2020-04-06 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-3351:
--

 Summary: Remove unnecessary dependency Curator
 Key: HDDS-3351
 URL: https://issues.apache.org/jira/browse/HDDS-3351
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: build
Reporter: Istvan Fajth


It seems when we have separated the main pom.xml from Hadoop pom, we copied 
most of the dependencies blindly to ozone pom, and we still cary some parts of 
this.

Due to an internal dependency checking, it turned out we still have curator 
defined as a dependency, even though we don't have Zookeeper already, and we do 
not use it.

I am posting a PR to remove curator from the dependencies, locally mvn clean 
install ran fine with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-3077) Lost volume after changing access key

2020-03-30 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070956#comment-17070956
 ] 

Istvan Fajth commented on HDDS-3077:


Hi [~bharat],

the scenario we were trying out with Beata, is to set up an HBase cluster that 
uses Ozone via S3 Gateway as an S3 object store behind the HBase data files, 
while HBase was using HDFS still for the WAL files.

We ran into some problems during the trials, and we were unsure whether the 
access key or the secret key format was causing a problem, and we realized that 
we are not allowed to change the access key if we do not want to loose the 
reachability of our data in the S3 buckets we already created.
I suggested to report this behaviour because, in S3 at least, user is allowed 
to re-generate the access key, but if we bound the volume which contains the 
buckets to the access key, we effectively make the user unable to re-generate 
it, and this is something which might be unexpected from the user's 
perspective, who wants to use us as he uses S3.

> Lost volume after changing access key
> -
>
> Key: HDDS-3077
> URL: https://issues.apache.org/jira/browse/HDDS-3077
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Beata Sudi
>Priority: Critical
>
> When using the S3 API, Ozone generates the volume depending on the user's 
> access key.  When the access key is changed, it becomes no longer reachable 
> by the user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-3033) Fix TestContainerPlacement

2020-02-19 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-3033:
--

 Summary: Fix TestContainerPlacement
 Key: HDDS-3033
 URL: https://issues.apache.org/jira/browse/HDDS-3033
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
Reporter: Istvan Fajth
Assignee: Istvan Fajth


Remove Ignore annotation and fix the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-02-18 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2914:
--

Assignee: Istvan Fajth  (was: Aravindan Vijayan)

> Certain Hive queries started to fail on generating splits
> -
>
> Key: HDDS-2914
> URL: https://issues.apache.org/jira/browse/HDDS-2914
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After updating a cluster where I am running TPCDS queries, some queries 
> started to fail.
> The update happened from an early dec state to the jan 10 state of master.
> Most likely the addition of HDDS-2188 is related to the problem, but it is 
> still under investigation.
> The exception I see in the queries:
> {code}
> [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
> inventory initializer failed, vertex=vertex_ [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.RuntimeException: ORC split generation failed with exception: 
> java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
>   at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
>   at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   ... 5 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> 

[jira] [Commented] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-02-18 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039295#comment-17039295
 ] 

Istvan Fajth commented on HDDS-2914:


After some testing we realised that the problem is not that the clients of the 
LocatedFileStatus api expects the index of the block in the array of blocks, 
but it expects the offset of the block inside the file as the documentations 
states.

The problem is that the OmKeyLocationInfo instances that we get back from Ozone 
Manager in the BasicOzoneClientAdapterImpl code, are containing an offset which 
is always yields to zero. My understanding is that - based on what we have 
discussed with Aravindan - the offset in Ozone manager is semantically 
different, and is valid to be zero, but in this case we can not use it, as with 
the BlockLocation we need to return the offset in the file.
If the initial assumption that the 0 offset for all the blocks of a particular 
file is what is needed and considered correct inside Ozone Manager's metadata, 
then we need to solve this in the client side. For this approach I am 
submitting a PR that has been tested in my environment and works.

> Certain Hive queries started to fail on generating splits
> -
>
> Key: HDDS-2914
> URL: https://issues.apache.org/jira/browse/HDDS-2914
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Istvan Fajth
>Assignee: Aravindan Vijayan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After updating a cluster where I am running TPCDS queries, some queries 
> started to fail.
> The update happened from an early dec state to the jan 10 state of master.
> Most likely the addition of HDDS-2188 is related to the problem, but it is 
> still under investigation.
> The exception I see in the queries:
> {code}
> [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
> inventory initializer failed, vertex=vertex_ [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.RuntimeException: ORC split generation failed with exception: 
> java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
>   at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
>   at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 

[jira] [Created] (HDDS-2998) Improve test coverage of audit logging

2020-02-11 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2998:
--

 Summary: Improve test coverage of audit logging
 Key: HDDS-2998
 URL: https://issues.apache.org/jira/browse/HDDS-2998
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Istvan Fajth


Review audit logging tests, and add assertions about the different audit log 
contents we expect to have in the audit log.
A good place to start with is TestOMKeyRequest where we create an audit logger 
mock, via that one most likely the assertions can be done for all the requests.

This is a follow up on HDDS-2946.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2998) Improve test coverage of audit logging

2020-02-11 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2998:
--

Assignee: Istvan Fajth

> Improve test coverage of audit logging
> --
>
> Key: HDDS-2998
> URL: https://issues.apache.org/jira/browse/HDDS-2998
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: newbie++
>
> Review audit logging tests, and add assertions about the different audit log 
> contents we expect to have in the audit log.
> A good place to start with is TestOMKeyRequest where we create an audit 
> logger mock, via that one most likely the assertions can be done for all the 
> requests.
> This is a follow up on HDDS-2946.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2946) Rename audit log should contain both srcKey and dstKey not just key

2020-01-29 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2946:
--

Assignee: Istvan Fajth

> Rename audit log should contain both srcKey and dstKey not just key
> ---
>
> Key: HDDS-2946
> URL: https://issues.apache.org/jira/browse/HDDS-2946
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>  Labels: newbie
>
> Currently a rename key operation logs just the key to be renamed, in the 
> audit log there should be the source and destination present as well for a 
> rename operation if we want to have traceability over a file properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2946) Rename audit log should contain both srcKey and dstKey not just key

2020-01-27 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2946:
--

 Summary: Rename audit log should contain both srcKey and dstKey 
not just key
 Key: HDDS-2946
 URL: https://issues.apache.org/jira/browse/HDDS-2946
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Istvan Fajth


Currently a rename key operation logs just the key to be renamed, in the audit 
log there should be the source and destination present as well for a rename 
operation if we want to have traceability over a file properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2936) Hive queries fail at readFully

2020-01-26 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023762#comment-17023762
 ] 

Istvan Fajth commented on HDDS-2936:


According to what we have find so far, it seems that the problem stems from the 
write path, and it is not the read which is causing the failure.
We were able to read back the same data that was written to the DataNodes via 
standard CLI calls, and with ORC tools the meta verification, or reading the 
data from the file that was written to the DataNodes fails with the same 
readFully or an other ORC layer exception we both saw at the Hive queries.

Examining the question further, it seems to be an ORC related internal which 
determines what will be the exception at the end. We compared the file sizes 
that are written by Hive to HDFS and to Ozone from the same original dataset, 
and we see that there are differences in the size of all files. Hive devs whom 
I have asked so far could not confirm if the partitions in ORC format has to be 
the same size in this case, so this might nor mean too much, however the 
readFully exception above is thrown if we read from a partition that differs 
128KiB+ in size compared to HDFS, and we see a buffer size problem reported 
from the ORC layer if the difference is over 8KiB but under 128KiB.

The next steps concentrate on to find a way to prove if a file is written with 
the corruption at data generation time, but files that are exhibit the problem 
with an exception are just a few percent of all the written files tops as we 
see so far.

> Hive queries fail at readFully
> --
>
> Key: HDDS-2936
> URL: https://issues.apache.org/jira/browse/HDDS-2936
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>
> When running Hive queries on a 1TB dataset for TPC-DS tests, we started to 
> see an exception coming out from FSInputStream.readFully.
> This does not happen with a smaller 100GB dataset, so possibly multi block 
> long files are the cause of the trouble, and the issue was not seen with a 
> build from early december, so we most likely to blame a recent change since 
> then. The build I am running with is from the hash 
> 929f2f85d0379aab5aabeded8a4d3a505606 of master branch but with HDDS-2188 
> reverted from the code.
> The exception I see:
> {code}
> Error while running task ( failure ) : 
> attempt_1579615091731_0060_9_05_29_3:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.EOFException: End of 
> file reached before reading fully.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
> at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.EOFException: End of file reached before reading fully.
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
> at 
> 

[jira] [Assigned] (HDDS-2936) Hive queries fail at readFully

2020-01-23 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2936:
--

Assignee: Istvan Fajth

> Hive queries fail at readFully
> --
>
> Key: HDDS-2936
> URL: https://issues.apache.org/jira/browse/HDDS-2936
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>
> When running Hive queries on a 1TB dataset for TPC-DS tests, we started to 
> see an exception coming out from FSInputStream.readFully.
> This does not happen with a smaller 100GB dataset, so possibly multi block 
> long files are the cause of the trouble, and the issue was not seen with a 
> build from early december, so we most likely to blame a recent change since 
> then. The build I am running with is from the hash 
> 929f2f85d0379aab5aabeded8a4d3a505606 of master branch but with HDDS-2188 
> reverted from the code.
> The exception I see:
> {code}
> Error while running task ( failure ) : 
> attempt_1579615091731_0060_9_05_29_3:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.io.IOException: java.io.EOFException: End of 
> file reached before reading fully.
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
> at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: java.io.IOException: 
> java.io.EOFException: End of file reached before reading fully.
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
> at 
> org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
> at 
> org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
> at 
> org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
> at 
> org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
> at 
> org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
> at 
> org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:532)
> at 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:178)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
> ... 16 more
> Caused by: java.io.IOException: java.io.EOFException: End of file reached 
> before reading fully.
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at 
> 

[jira] [Created] (HDDS-2936) Hive queries fail at readFully

2020-01-23 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2936:
--

 Summary: Hive queries fail at readFully
 Key: HDDS-2936
 URL: https://issues.apache.org/jira/browse/HDDS-2936
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Istvan Fajth


When running Hive queries on a 1TB dataset for TPC-DS tests, we started to see 
an exception coming out from FSInputStream.readFully.
This does not happen with a smaller 100GB dataset, so possibly multi block long 
files are the cause of the trouble, and the issue was not seen with a build 
from early december, so we most likely to blame a recent change since then. The 
build I am running with is from the hash 
929f2f85d0379aab5aabeded8a4d3a505606 of master branch but with HDDS-2188 
reverted from the code.

The exception I see:
{code}
Error while running task ( failure ) : 
attempt_1579615091731_0060_9_05_29_3:java.lang.RuntimeException: 
java.lang.RuntimeException: java.io.IOException: java.io.EOFException: End of 
file reached before reading fully.
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.io.IOException: 
java.io.EOFException: End of file reached before reading fully.
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
at 
org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at 
org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:662)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
at 
org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:532)
at 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:178)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
... 16 more
Caused by: java.io.IOException: java.io.EOFException: End of file reached 
before reading fully.
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:422)
at 
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
... 27 more
Caused by: java.io.EOFException: End of file reached before reading fully.
at 

[jira] [Commented] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-01-20 Thread Istvan Fajth (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019562#comment-17019562
 ] 

Istvan Fajth commented on HDDS-2914:


I have updated the cluster version to a fresh build from master but reverted 
the implementation committed for HDDS-2188 this way the same queries that were 
failing seems to be working again, it is under testing at the moment, but it 
seems HDDS-2188 is introducing an issue that is happening with particular 
queries only. If I find anything else I will update it here.

> Certain Hive queries started to fail on generating splits
> -
>
> Key: HDDS-2914
> URL: https://issues.apache.org/jira/browse/HDDS-2914
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Istvan Fajth
>Assignee: Aravindan Vijayan
>Priority: Critical
>
> After updating a cluster where I am running TPCDS queries, some queries 
> started to fail.
> The update happened from an early dec state to the jan 10 state of master.
> Most likely the addition of HDDS-2188 is related to the problem, but it is 
> still under investigation.
> The exception I see in the queries:
> {code}
> [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
> inventory initializer failed, vertex=vertex_ [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.RuntimeException: ORC split generation failed with exception: 
> java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
>   at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
>   at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   ... 5 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: File 
> 

[jira] [Assigned] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-01-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2914:
--

Assignee: Aravindan Vijayan  (was: Istvan Fajth)

> Certain Hive queries started to fail on generating splits
> -
>
> Key: HDDS-2914
> URL: https://issues.apache.org/jira/browse/HDDS-2914
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Istvan Fajth
>Assignee: Aravindan Vijayan
>Priority: Critical
>
> After updating a cluster where I am running TPCDS queries, some queries 
> started to fail.
> The update happened from an early dec state to the jan 10 state of master.
> Most likely the addition of HDDS-2188 is related to the problem, but it is 
> still under investigation.
> The exception I see in the queries:
> {code}
> [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
> inventory initializer failed, vertex=vertex_ [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.RuntimeException: ORC split generation failed with exception: 
> java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
>   at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
>   at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   ... 5 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1909)
>   ... 17 more
> Caused by: 

[jira] [Assigned] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-01-20 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2914:
--

Assignee: Istvan Fajth  (was: Aravindan Vijayan)

> Certain Hive queries started to fail on generating splits
> -
>
> Key: HDDS-2914
> URL: https://issues.apache.org/jira/browse/HDDS-2914
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Critical
>
> After updating a cluster where I am running TPCDS queries, some queries 
> started to fail.
> The update happened from an early dec state to the jan 10 state of master.
> Most likely the addition of HDDS-2188 is related to the problem, but it is 
> still under investigation.
> The exception I see in the queries:
> {code}
> [ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
> inventory initializer failed, vertex=vertex_ [Map 1]
> org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
> java.lang.RuntimeException: ORC split generation failed with exception: 
> java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
>   at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
>   at 
> com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
>   at 
> com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
>   at 
> com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
>   at 
> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
>   at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: ORC split generation failed with 
> exception: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
>   at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
>   at 
> org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
>   at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
>   ... 5 more
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: File 
> o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
>  should have had overlap on block starting at 0
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1909)
>   ... 17 more
> Caused by: 

[jira] [Created] (HDDS-2914) Certain Hive queries started to fail on generating splits

2020-01-20 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2914:
--

 Summary: Certain Hive queries started to fail on generating splits
 Key: HDDS-2914
 URL: https://issues.apache.org/jira/browse/HDDS-2914
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Istvan Fajth


After updating a cluster where I am running TPCDS queries, some queries started 
to fail.
The update happened from an early dec state to the jan 10 state of master.

Most likely the addition of HDDS-2188 is related to the problem, but it is 
still under investigation.

The exception I see in the queries:
{code}
[ERROR] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex Input: 
inventory initializer failed, vertex=vertex_ [Map 1]
org.apache.tez.dag.app.dag.impl.AMUserCodeException: 
java.lang.RuntimeException: ORC split generation failed with exception: 
java.io.IOException: File 
o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
 should have had overlap on block starting at 0
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallback.onFailure(RootInputInitializerManager.java:328)
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1764)
at 
com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
at 
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
at 
com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
at 
com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:634)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:110)
at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: ORC split generation failed with 
exception: java.io.IOException: File 
o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
 should have had overlap on block starting at 0
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1915)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:2002)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
... 5 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: File 
o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
 should have had overlap on block starting at 0
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1909)
... 17 more
Caused by: java.io.IOException: File 
o3fs://hive.warehouse.fqdn:9862/warehouse/tablespace/managed/hive/100/inventory/delta_001_001_/bucket_0
 should have had overlap on block starting at 0
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.createSplit(OrcInputFormat.java:1506)
at 

[jira] [Assigned] (HDDS-2777) Add bytes read statistics to Ozone FileSystem implementation

2019-12-19 Thread Istvan Fajth (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth reassigned HDDS-2777:
--

Assignee: Istvan Fajth

> Add bytes read statistics to Ozone FileSystem implementation
> 
>
> Key: HDDS-2777
> URL: https://issues.apache.org/jira/browse/HDDS-2777
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> In Hive, or in MR jobs the FileSystem counters are reported based on the 
> statistics inside the FileSystem implementation, at the moment we do not have 
> any read bytes statistics reported, while we have the number of bytes written 
> and the read and write operation count.
> This JIRA is to add the number of bytes read statistics and record it in the 
> FileSystem implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2777) Add bytes read statistics to Ozone FileSystem implementation

2019-12-19 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2777:
--

 Summary: Add bytes read statistics to Ozone FileSystem 
implementation
 Key: HDDS-2777
 URL: https://issues.apache.org/jira/browse/HDDS-2777
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone Client
Reporter: Istvan Fajth


In Hive, or in MR jobs the FileSystem counters are reported based on the 
statistics inside the FileSystem implementation, at the moment we do not have 
any read bytes statistics reported, while we have the number of bytes written 
and the read and write operation count.

This JIRA is to add the number of bytes read statistics and record it in the 
FileSystem implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2697) SCM log is flooded with block deletion txId mismatch messages

2019-12-09 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2697:
--

 Summary: SCM log is flooded with block deletion txId mismatch 
messages
 Key: HDDS-2697
 URL: https://issues.apache.org/jira/browse/HDDS-2697
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: SCM
Reporter: Istvan Fajth


When you run Hive queries on the cluster, but I think this is true for other 
MapReduce stuff as well, then the interim and temporary data is created and 
deleted quite often.
This leads to the flood of similar messages in the SCM log:
{code}
2019-12-07 05:00:41,112 INFO 
org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService: Block deletion txnID 
mismatch in datanode e590d08a-4a4e-428a-82e8-80f7221f639e for containerID 307. 
Datanode delete txnID: 25145, SCM txnID: 25148
{code}

Either we need to decrease the log level of this messages, or we need to get 
rid of the cause of the message. In a single log file I see over 21k lines 
containing this message from ~37k lines of log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2695) SCM is not able to start under certain conditions

2019-12-09 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2695:
--

 Summary: SCM is not able to start under certain conditions
 Key: HDDS-2695
 URL: https://issues.apache.org/jira/browse/HDDS-2695
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: SCM
Reporter: Istvan Fajth


Given
- a cluster where RATIS-677 happened, and DataNodes are already failing to 
start properly due to the issue
When
- I restart the cluster and start to see the exceptions as described in 
RATIS-677
- I stop the 3 DN that has the failing pipeline
- remove the ratis metadata for the pipeline
- close the pipeline with scmcli
- restart the 3 DN
Then
- SCM is unable to come out of safe mode, the log shows the following possible 
reason:
{code}
2019-12-09 01:13:38,437 INFO 
org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 
Pipelines with at least one datanode reported count is 0, required at least one 
datanode reported per pipeline count is 4
{code}

If after this I restart the SCM, it fails without logging any exception, and 
the standard error contains the following message es the last one:
{code}
PipelineID= not found
{code}

Also scmcli did not list the closed pipeline after I closed it and checked the 
active pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2696) Document recovery from RATIS-677

2019-12-09 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2696:
--

 Summary: Document recovery from RATIS-677
 Key: HDDS-2696
 URL: https://issues.apache.org/jira/browse/HDDS-2696
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone Datanode
Reporter: Istvan Fajth


As RATIS-677 is solved in a way where a setting needs to be changed, and set 
for the RatisServer implementation to ignore the corruption, and at the moment 
due to HDDS-2647, we do not have a clear recovery path from a ratis corruption 
in the pipeline data.

We should document how this can be recovered. I have an idea which includes 
closing the pipeline in SCM and remove the ratis metadata for the pipeline in 
the DataNodes, which effectively clears out the corrupted pipeline from the 
system.

There are two problems I have with finding a recovery path, and document it:
- I am not sure if we have strong enough guarantees that the writes happened 
properly if the ratis metadata could become corrupt so this needs to be 
investigated.
- At the moment I can not validate this approach, as if I do the steps (stop 
the 3 DN, move out ratis data for pipeline, close the pipeline with scmcli, 
then restart the DNs) the pipeline is not closed properly, and SCM fails as 
described in HDDS-2695



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2647) Ozone DataNode does not set raft.server.log.corruption.policy to the RaftServer implementation it uses

2019-11-28 Thread Istvan Fajth (Jira)
Istvan Fajth created HDDS-2647:
--

 Summary: Ozone DataNode does not set 
raft.server.log.corruption.policy to the RaftServer implementation it uses
 Key: HDDS-2647
 URL: https://issues.apache.org/jira/browse/HDDS-2647
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
  Components: Ozone Datanode
Reporter: Istvan Fajth


In the XceiverServerRatis class which is used by the DataNode as well to create 
the RaftServer implementation that is used, there is a method called 
newRaftProperties() which is there to set up the RaftProperties object 
specified for the RaftServer it starts.

This method is pretty hard to keep in sync with all the ratis properties, and 
due to an issue where I was turned to RATIS-677 which introduced a new 
configuration, I was not able to set this new property via the DataNode's 
ozone-site.xml, as it was not forwarded to the Ratis server.

On the long run we would need a better implementation that does not need tuning 
and follow up for every new Ratis property, however at the moment as a wuick 
fix we can just provide the property. Depending on the implementor, if we go 
with the easy way, then please create a new JIRA for a better solution after 
finishing this one. Also if I am wrong, and Ratis properties can be defined for 
the DN properly elsewhere, please let me know.

As OM is also using Ratis in HA configuration, this one should be checked there 
as well, however this one is not really important until RATIS-762 is fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org