Re: Datasets for testing large number of attachments

2022-07-26 Thread Oscar Rieken Jr via user
@tika.apache.org> mailto:corpora-...@tika.apache.org>> Subject: Re: Datasets for testing large number of attachments External Email What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched. If you want to give your pipeline a workout, I also recomme

Re: Datasets for testing large number of attachments

2022-07-26 Thread Oscar Rieken Jr via user
Awesome thanks ill give this a shot! From: Nicholas DiPiazza Date: Tuesday, July 26, 2022 at 3:13 PM To: user@tika.apache.org , talli...@apache.org Cc: Oscar Rieken Jr , corpora-...@tika.apache.org Subject: Re: Datasets for testing large number of attachments External Email Script I used

Re: Datasets for testing large number of attachments

2022-07-26 Thread Nicholas DiPiazza
cutable1455 >> video/mpeg1366 >> application/pkcs7-signature1359 >> application/x-ms-asx1266 >> image/vnd.zbrush.pcx 1247 >> image/vnd.dwg1243 >> application/fits1217 >> application/xslfo+xml1206 >> application/x-sharedlib

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
nd.dwg1243 > application/fits1217 > application/xslfo+xml1206 > application/x-sharedlib1185 > audio/prs.sid1173 > text/x-vcalendar1156 > > > > > On Tue, Jul 26, 2022 at 2:12 PM Oscar Rieken Jr < > oscar.rieke...@cofense.com> wrote: >

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
9:19 AM > *To: *user@tika.apache.org > *Cc: *Oscar Rieken Jr , > corpora-...@tika.apache.org > *Subject: *Re: Datasets for testing large number of attachments > > External Email > > What Nick said... > > > > cc_large is a sample of some of the larger doc

Re: Datasets for testing large number of attachments

2022-07-26 Thread Oscar Rieken Jr via user
: Datasets for testing large number of attachments External Email What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched. If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar

Re: Datasets for testing large number of attachments

2022-07-26 Thread Tim Allison
What Nick said... cc_large is a sample of some of the larger documents from commoncrawl3_refetched. If you want to give your pipeline a workout, I also recommend using the MockParser that is available in the tika-core tests jar. That allows you to instrument an OOM and timeouts and system exits

Re: Datasets for testing large number of attachments

2022-07-26 Thread Nick Burch
On Mon, 25 Jul 2022, Oscar Rieken Jr via user wrote: I am currently trying to validate our Tika setup and was looking for a set of example data I could use If you want a small number of files of lots of different types, the test files in the Tika source tree will work. Main set are in

Datasets for testing large number of attachments

2022-07-25 Thread Oscar Rieken Jr via user
I am currently trying to validate our Tika setup and was looking for a set of example data I could use I found this dir -> Index of /base/docs/cc_large (apache.org) Would I just download that data set or is there another place with multiple