On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote:
I am currently trying to validate our Tika setup and was looking for a
set of example data I could use
If you want a small number of files of lots of different types, the test
files in the Tika source tree will work. Main set are in
tika-parsers/src/test/resources/test-documents/
If you want a very large number of files, then the Tika Corpora collection
is a good source. We have a few different collections, including stuff
from common crawl, govdocs and bug trackers. If you can let us know what
sort of file types and how many, we can suggest the best corpora
collection
Nick