[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-21 Thread Kai Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819450#comment-17819450
 ] 

Kai Zheng commented on HDFS-17316:
--

>>I think it should be a hadoop one for more than hdfs
OK, let me try and do it.

>>one thing with the contract tests is we need the ability to declare when a 
>>store doesn't quite meet expectations...
Sounds like good suggestions for the existing contract test framework to 
improve and do such checks, and would be better addressed in other issues? The 
contract tests and this compatible tests would be better not to overlap.

>>some operations raise different exceptions, permissions may be different.
This is for example. If the mentioned exceptions and permissions are not 
defined in the spec or the interface, this compatible check tool should allow 
such diff, as they're not compatible issues. 

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-21 Thread Kai Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819447#comment-17819447
 ] 

Kai Zheng commented on HDFS-17316:
--

>>I think it should be a hadoop one for more than hdfs
OK, let me try and do it.

>>one thing with the contract tests is we need the ability to declare when a 
>>store doesn't quite meet expectations...
Sounds like good suggestions for the existing contract test framework to 
improve and do such checks, and would be better addressed in other issues? The 
contract tests and this compatible tests would be better not to overlap.

>>some operations raise different exceptions, permissions may be different.

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-21 Thread Kai Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819444#comment-17819444
 ] 

Kai Zheng commented on HDFS-17316:
--

>>I think it should be a hadoop one for more than hdfs
OK, let me try and do it.

>>one thing with the contract tests is we need the ability to declare when a 
>>store doesn't quite meet expectations...

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-21 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819271#comment-17819271
 ] 

Steve Loughran commented on HDFS-17316:
---

* I think it should be a hadoop one for more than hdfs
* hdfs/webhdfs work well as unit tests for the functionality
* but can/should also target other stores, with s3a and abfs connectors key 
ones for me.

one thing with the contract tests is we need the ability to declare when a 
store doesn't quite meet expectations. s3a fs lets you create files under files 
if you try hard; some operations raise different exceptions, permissions may be 
different. so a design which allows for downgrading is critical

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-21 Thread Kai Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819149#comment-17819149
 ] 

Kai Zheng commented on HDFS-17316:
--

This work looks great, thanks [~han.liu]! Maybe this jira would convert to be a 
HADOOP one. [~ste...@apache.org] how do think? Long time no see. :)

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815466#comment-17815466
 ] 

ASF GitHub Bot commented on HDFS-17316:
---

hadoop-yetus commented on PR #6535:
URL: https://github.com/apache/hadoop/pull/6535#issuecomment-1933237018

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +1 :green_heart: |  @author  |   0m  1s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 13 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m 59s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m  7s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   7m 23s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |  12m  9s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   4m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   4m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |  16m 29s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 51s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |  17m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   7m 47s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   7m 16s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6535/1/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   1m 56s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   6m 38s |  |  the patch passed  |
   | -1 :x: |  shellcheck  |   0m  0s | 
[/results-shellcheck.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6535/1/artifact/out/results-shellcheck.txt)
 |  The patch generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)  |
   | +1 :green_heart: |  javadoc  |   4m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   4m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | -1 :x: |  spotbugs  |  16m 31s | 
[/new-spotbugs-root.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6535/1/artifact/out/new-spotbugs-root.html)
 |  root generated 19 new + 0 unchanged - 0 fixed = 19 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  19m 16s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 628m 32s | 
[/patch-unit-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6535/1/artifact/out/patch-unit-root.txt)
 |  root in the patch failed.  |
   | -1 :x: |  asflicense  |   0m 48s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6535/1/artifact/out/results-asflicense.txt)
 |  The patch generated 18 ASF License warnings.  |
   |  |   | 826m 36s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | module:root |
   |  |  Random object created and used only once in 
org.apache.hadoop.compat.AbstractHdfsCompatCase.getUniquePath(Path)  At 
AbstractHdfsCompatCase.java:only once in 
org.apache.hadoop.compat.AbstractHdfsCompatCase.getUniquePath(Path)  At 
AbstractHdfsCompatCase.java:[line 60] |
   |  |  org.apache.hadoop.compat.HdfsCompatEnvironment.getStoragePolicyNames() 

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-07 Thread Han Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815288#comment-17815288
 ] 

Han Liu commented on HDFS-17316:


{quote}that's fine; what I would like is be able to identify a profile to use 
and have all the settings picked up there. for the s3a test suites i can choose 
which store to run against, but the other options (storage class, fips, aws 
roles, ...) have to be configured too. Switching from one store to another 
requires me to comment out the ones not available. If we could be profile 
driven from the start, i'd do a run -Dprofile=google vs -Dprofile=aws-s3express 
and have the relevant profiles picked up with my configs for each of them
{quote}
Good comments from [~ste...@apache.org] . There can be a new option '-profile' 
corresponding to a set of options, while the way the options could be passed to 
the benchmark process should be well designed. Or the profile could also be 
achieved outside the benchmark, i.e. at a layer upper than the benchmark, I 
think.

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-07 Thread Han Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815265#comment-17815265
 ] 

Han Liu commented on HDFS-17316:


A new pr is created about this issue:

[https://github.com/apache/hadoop/pull/6535]

Any comments are welcome!

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815262#comment-17815262
 ] 

ASF GitHub Bot commented on HDFS-17316:
---

HanFreedom opened a new pull request, #6535:
URL: https://github.com/apache/hadoop/pull/6535

   ### Description of PR
   
   A new hadoop-compat-bench module which is directly related to HDFS-17316.
   
   ### How was this patch tested?
   
   Unit test
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?  YES
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?  Not 
applicable
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?  No new dependency
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?  Not applicable
   
   




> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814392#comment-17814392
 ] 

Steve Loughran commented on HDFS-17316:
---

bq.  However, only one FS instance can be supported per run. This is because 
the benchmark should be simple and fast.

that's fine; what I would like is be able to identify a profile to use and have 
all the settings picked up there. for the s3a test suites i can choose which 
store to run against, but the other options (storage class, fips, aws roles, 
...) have to be configured too. Switching from one store to another requires me 
to comment out the ones not available. If we could be profile driven from the 
start, i'd do a run -Dprofile=google vs -Dprofile=aws-s3express and have the 
relevant profiles picked up with my configs for each of them

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-02-03 Thread Han Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814042#comment-17814042
 ] 

Han Liu commented on HDFS-17316:


Thanks for comments from [~ste...@apache.org] 
{quote}If you look at the hadoop test classes
 * we always need an explicit timeout
 * on abfs/wasb/s3a tests we want to support parallel test execution, where you 
can run multiple JNIT threads in parallel… each path and even temporary 
directories must be uniquely allocated to the given thread, which we do by 
passing a thread ID down.{quote}
Great suggestions. Im preparing the initial code, which has not take them into 
account. I'll add to the doc and make a plan for update.
{quote} * In HADOOP-18508 I'm trying to support having different (local/remote) 
hadoop source trees running test against the same S3 store. Goal: I can run a 
hadoop public branch from one directory well debugging a different branch 
locally. Something like that is needed here, given that it seems intended to 
run against live HDFS clusters. Note that also brings authentication into the 
mix, e.g. the option to use kinit to log in before running the tests. This is 
not exclusive to HDFS either.{quote}
Exactly. All files/directories are created under the uri, so executions would 
not affect each other if only the uri is uniquely generated.

For running tests with different hadoop branches against the same S3 cloud 
service, my understanding is that this is achieved naturally as long as the uri 
is uniquely generated each time.
{quote}It will be good -if not initially supported then at least designed as a 
future option- to allow me to provide a list of stores to run the tests 
against. This is because four S3A testing I now have to qualify with: amazon 
s3, amazon s3 express, google gcs and at least one other implementation of the 
API. All of which have slightly different capabilities -test process is going 
to need to somehow be driven so that for the different implementation it knows 
which features to test/results to expect. The current hadoop-aws/contract test 
design is not up to this.
{quote}
Definitely. I personally run community supported S3A FS implementation. I am 
glad that there would be many more targets so that we could comprehensively 
evaluate the benchmark. However, only one FS instance can be supported per run. 
This is because the benchmark should be simple and fast.
{quote}Microsoft are pushing hard at windows support. For the shell operations 
it might be very good if rather than using bash/sh/zsh that python and pyunit 
was the test runner, which could then invoke Windows commands as well as shall 
scripts. Pyunit test report can be aggregated displayed in Jenkins, which is 
another nice feature of them.
{quote}
Windows support is important while has not been taken into account. It is in 
the future plan.

Some initial code would be finished soon, then I'll update a link.

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS 

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-01-29 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811840#comment-17811840
 ] 

Steve Loughran commented on HDFS-17316:
---

good read. 

If you look at the hadoop test classes 
* we always need an explicit timeout
* on abfs/wasb/s3a tests we want to support parallel test execution, where you 
can run multiple JNIT threads in parallel… each path and even temporary 
directories must be uniquely allocated to the given thread, which we do by 
passing a thread ID down.
* In HADOOP-18508 I'm trying to support having different (local/remote) hadoop 
source trees running test against the same S3 store. Goal: I can run a hadoop 
public branch from one directory well debugging a different branch locally. 
Something like that is needed here, given that it seems intended to run against 
live HDFS clusters. Note that also brings authentication into the mix, e.g. the 
option to use kinit to log in before running the tests. This is not exclusive 
to HDFS either.

It will be good -if not initially supported then at least designed as a future 
option- to allow me to provide a list of stores to run the tests against. This 
is because four S3A testing I now have to qualify with: amazon s3, amazon s3 
express, google gcs and at least one other implementation of the API. All of 
which have slightly different capabilities -test process is going to need to 
somehow be driven so that for the different implementation it knows which 
features to test/results to expect. The current hadoop-aws/contract test design 
is not up to this. 

Microsoft are pushing hard at windows support. For the shell operations it 
might be very good if rather than using bash/sh/zsh that python and pyunit was 
the test runner, which could then invoke Windows commands as well as shall 
scripts. Pyunit test report can be aggregated displayed in Jenkins, which is 
another nice feature of them.



> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
> Attachments: HDFS Compatibility Benchmark Design.pdf
>
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS 

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-01-16 Thread Han Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807603#comment-17807603
 ] 

Han Liu commented on HDFS-17316:


Thank you for the detailed comments from [~ste...@apache.org]  and apologize 
for the late reply. Recently I reviewed my code so that it fits the Hadoop code 
format.
{quote}1. filesystem contract tests are designed to do this from junit; If your 
FS implementation doesn't subclass and run these, you need to start there.
{quote}
Contract tests play an essential role in evaluation of storage service 
abilities, where a closely examination of core FS functions are performed, such 
as create, open, delete, etc. There is an overlap between contract tests and 
the benchmark we discussed here. The main mismatch between them is that 
contract tests mainly focus on quality of a most important subset of FS APIs, 
where a series of cases are designed for each API.

The goal of the benchmark proposed here is to provide a general way to check 
basic compatibility of FS public APIs, which treats the interfaces as the same 
and cover all of them, including ACL, XAttr, StoragePolicy, Snapshot, Symlink, 
etc. It should be ensured that for a new FS implementation the benchmark 
examination can be performed quickly as long as the implementation jar file is 
supplied. The benchmark would also introduce a conception of 'suite' 
corresponding to a subset of APIs, aiming to check compatibility of specific 
scenarios such as 'tpcds'.
{quote}2. filesystem API specification is intended to specify the API and 
document where problems surface. maintenance there always welcome -and as the 
contract tests are derived from it, enhancements in those tests to follow
{quote}
It is significant that API specification should keep maintenance. As the Hadoop 
ecosystem develops, the API core functions might evolve and require new 
contract cases. MultipartUploaderTest is an example. I am glad to keep an eye 
on it, and contribute more cases when needed.
{quote}3. there's also terasort to validate commit protocols
{quote}
I agree that TeraSort can be used as part of the compatibility benchmark. There 
can be an individual suite for the validity of MapReduce file output committer.
{quote}4. + distcp contract tests for its semantics
{quote}
The validity of DistCp can also be an individual suite, where the test case is 
a DistCp Job from MiniDFSCluster to target storage service.
{quote}5. dfsio does a lot, but needs maintenance -it only targets the 
clusterfs, when really you should be able to point at cloud storage from your 
own computer. extending that to take a specific target fs would be good.
{quote}
I agree that DFSIO should be extended to general targets. This should be done 
in Hadoop as a separate task, so that benchmark tool can use it. Good idea!
{quote}6. output must go into the class ant junit xml format so jenkins can 
present it.
{quote}
Good suggestion. The design of the benchmark is a tool quickly evaluating 
compatibility score of a FS implementation. It might be inappropriate to be 
treated as a unit test system. All cases must be simple, and after a quick run 
a report is automatically generated showing an overall score and a list of 'not 
compatible APIs'. The framework contains both Java cases and pjdfstest-style 
shell scripts. Thus, the benchmark framework is more flexible and do not need a 
junit report.
{quote}We can create a new hadoop git repo for this. Do you have existing code 
and any detailed specification/docs. this also allows you to add dependencies 
on other things, e.g. spark.
{quote}
Yes, I already have some initial codes and will submit a PR for easier 
reference later. I'm also preparing a design doc for more details, will share 
the link here when ready. The goal of the benchmark we discussed does not need 
extra dependencies on spark or hive. On the contrary, the design may limit the 
dependency to only Hadoop itself. Thus, a small submodule of Hadoop repo might 
be OK, maybe a hadoop-compat-bench module under hadoop-tools I think.

Welcome further discussion and the next code review together!

> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba 

[jira] [Commented] (HDFS-17316) Compatibility Benchmark over HCFS Implementations

2024-01-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801917#comment-17801917
 ] 

Steve Loughran commented on HDFS-17316:
---

I'd propose decoupling this from the core hadoop/ source tree so it can be 
built against 3.3 and 

bq. there is no formal suite to do compatibility assessment of a file system 
for all such HCFS implementations. Thus, whether the functionality is well 
accomplished and meets the core compatible expectations mainly relies on 
service provider's own report. 

# filesystem contract tests are designed to do this from junit;  If your FS 
implementation doesn't subclass and run these, you need to start there.
# filesystem API specification is intended to specify the API and document 
where problems surface. maintenance there always welcome -and as the contract 
tests are derived from it, enhancements in those tests to follow
# there's also terasort to validate commit protocols
# + distcp contract tests for its semantics
# dfsio does a lot, but needs maintenance -it only targets the clusterfs, when 
really you should be able to point at cloud storage from your own computer. 
extending that to take a specific target fs would be good.
# output must go into the class ant junit xml format so jenkins can present it.

We can create a new hadoop git repo for this. Do you have existing code and any 
detailed specification/docs. this also allows you to add dependencies on other 
things, e.g. spark.




> Compatibility Benchmark over HCFS Implementations
> -
>
> Key: HDFS-17316
> URL: https://issues.apache.org/jira/browse/HDFS-17316
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Han Liu
>Priority: Major
>
> {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in 
> big data storage ecosystem, providing unified interfaces and generally clear 
> semantics, and has become the de-factor standard for industry storage systems 
> to follow and conform with. There have been a series of HCFS implementations 
> in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for 
> Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object 
> Storage, and more from storage service's providers on their own.
> {*}Problems:{*}However, as indicated by introduction.md, there is no formal 
> suite to do compatibility assessment of a file system for all such HCFS 
> implementations. Thus, whether the functionality is well accomplished and 
> meets the core compatible expectations mainly relies on service provider's 
> own report. Meanwhile, Hadoop is also developing and new features are 
> continuously contributing to HCFS interfaces for existing implementations to 
> follow and update, in which case, Hadoop also needs a tool to quickly assess 
> if these features are supported or not for a specific HCFS implementation. 
> Besides, the known hadoop command line tool or hdfs shell is used to directly 
> interact with a HCFS storage system, where most commands correspond to 
> specific HCFS interfaces and work well. Still, there are cases that are 
> complicated and may not work, like expunge command. To check such commands 
> for an HCFS, we also need an approach to figure them out.
> {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility 
> benchmark and provide corresponding tool to do the compatibility assessment 
> for an HCFS storage system. The benchmark and tool should consider both HCFS 
> interfaces and hdfs shell commands. Different scenarios require different 
> kinds of compatibilities. For such consideration, we could define different 
> suites in the benchmark.
> *Benefits:* We intend the benchmark and tool to be useful for both storage 
> providers and storage users. For end users, it can be used to evalute the 
> compatibility level and determine if the storage system in question is 
> suitable for the required scenarios. For storage providers, it helps to 
> quickly generate an objective and reliable report about core functioins of 
> the storage service. As an instance, if the HCFS got a 100% on a suite named 
> 'tpcds', it is demonstrated that all functions needed by a tpcds program have 
> been well achieved. It is also a guide indicating how storage service 
> abilities can map to HCFS interfaces, such as storage class on S3.
> Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org