Howdy, we did something similar in one Maven Repository Manager codebase (Nx1/Nx2 but same in Nx3), as we had exact same requirements:
See this class, IMO it does exactly what you want: https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/java/org/sonatype/nexus/mime/internal/DefaultMimeSupport.java#L138 Is able to detect several ("unravel" aliases and hierarchy) mime types by content or by filename. Also, it was important to override some Tika defaults (for example in Maven universe ".rar" extension is resource-adaper JAR and not RAR compression format usually), that was achieved by augmenting Tika with rules like the "build ins" are (but is user extensible): https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/resources/builtin-mimetypes.properties and https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml As for your first question, Nx3 does "content validation" as this (uses that class above) https://github.com/sonatype/nexus-public/blob/main/components/nexus-repository-services/src/main/java/org/sonatype/nexus/repository/mime/DefaultContentValidator.java HTH T On Tue, Sep 27, 2022 at 9:24 AM Peter Conrad <[email protected]> wrote: > Hello, > > I'm working on an server application where clients can upload pieces of > data together with the data's MIME type, and I would like to verify > that the given data is valid in terms of the given type (for a broad > definition of "valid"). > > I have tried using Tika.detect in various ways, but the results were > not satisfying so far, the general problem being that multiple MIME > types might be valid for some given data whereas Tika will return only > one match. > > For example, a piece of HTML source that is valid as "text/html" would > also be valid as "text/plain", a piece of text with charset US-ASCII > would also be valid with charset UTF-8. > > While I've has *some* success with giving the client-provided type as > metadata to Tika.detect() at least in the text/* case, there are other > cases where not only multiple subtypes may apply but also multiple > supertypes (e.g. the string "P2 2 2 1 0 1 1 0" is valid text/plain but > also valid image/x-portable-graymap, here Tika always returns the > image type never the text type). Also, using the client-provided > MIME-type sometimes leads to false results, e.g. the byte sequence > (1,2,3,4,5) would be accepted as image/gif which it clearly isn't. > > * Is there a way of using Tika to answer the question "is <data> a valid > instance of <type>? > * Is there a way to ask Tika "give me all possible <type>s for <data>" > instead of just "give me the best match"? > > Thanks for your suggestions, > > Peter > -- > Cyrano UG (haftungsbeschränkt) > Alicestr. 102 > 63263 Neu-Isenburg > Germany > > Tel.: +49 6102 821206 > > Geschäftsführer: Peter Conrad > > AG Offenbach > HRB Nr. 47931 > > USt-ID: DE296491819 >
