Review Request 31758: TIKA-1330: tika batch code

2015-03-04 Thread Tim Allison
g/r/31758/diff/ Testing --- Code has been in development as part of another fielded project for the last two years. Numerous unit tests...could always use more Thanks, Tim Allison

Re: Review Request 31758: TIKA-1330: tika batch code

2015-03-05 Thread Tim Allison
me! :) I wrote that one. Fixed oap->oa :) - Tim --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31758/#review75289 ------- On M

Re: Review Request 31758: TIKA-1330: tika batch code

2015-03-09 Thread Tim Allison
org/r/31758/#review75632 --- On March 5, 2015, 3:07 a.m., Tim Allison wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/31758/ > --

Re: Review Request 31758: TIKA-1330: tika batch code

2015-03-09 Thread Tim Allison
211 Diff: https://reviews.apache.org/r/31758/diff/ Testing --- Code has been in development as part of another fielded project for the last two years. Numerous unit tests...could always use more Thanks, Tim Allison

Re: Review Request 32291: ISATab parsers (preliminary version)

2015-03-23 Thread Tim Allison
have to do this? metadata.add(values[0], values[i].replaceAll("(^\")|(\"$)","")); +1 to Chris's recommendation to create short dummy test files that are clean with respect to ASL 2.0. - Tim Allison On March 23, 2015, 5:04 p.m., Giuseppe Totaro wrote: > >

3.0.0-BETA2 release?

2024-05-07 Thread Tim Allison
All, I'd like to go for another 3.x beta release and then move fairly quickly to a 3.0.0 release. I was hoping that https://issues.apache.org/jira/browse/TIKA-4221 would be wrapped up soon. It hasn't been, but I can add the workaround we did in 2.x. What do you think? Any blockers? Be

multi-arch support for tika-docker!

2024-05-21 Thread Tim Allison
All, Many thanks to the many community members who helped figure this out and get it out the door! As of tika-docker 2.9.2.1, we now have multi-arch support (and on noble!). Let us know if there are any surprises. Thank you, again! Cheers, Tim Ref: https://hub.docker.com/r

Re: Automatically applying checkstyle fixes

2024-06-22 Thread Tim Allison
https://issues.apache.org/jira/browse/TIKA-4251 Anything that works and doesn't allow wildcard imports I'm good with. Have you had luck with OpenRewrite? On Wed, Jun 19, 2024 at 12:55 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Hey Tim and Team: > > I remember someone stating at

Re: Automatically applying checkstyle fixes

2024-06-24 Thread Tim Allison
lt; nicholas.dipia...@gmail.com> wrote: > I just started using it for a big project and it is awesome > > On Sat, Jun 22, 2024, 6:11 AM Tim Allison wrote: > > > https://issues.apache.org/jira/browse/TIKA-4251 > > > > Anything that works and doesn't allow wild

Re: how do i build a new beta version?

2024-06-26 Thread Tim Allison
LIke a 3.0.0-BETA2 release? On Wed, Jun 26, 2024 at 12:06 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > At some point I would like to build a 3.0.0 beta version. > > How can I go about this? > > -Nicholas >

Re: how do i build a new beta version?

2024-06-27 Thread Tim Allison
cripts. how do i go > > about getting that created any idea? > > > > On Wed, Jun 26, 2024 at 2:41 PM Tim Allison > > wrote:If we > > > >> LIke a 3.0.0-BETA2 release? > >> > >> On Wed, Jun 26, 2024 at 12:06 PM Nicholas DiPiazza

3.0.0-BETA2 next week?

2024-07-03 Thread Tim Allison
All, I think it is time to go for a 3.0.0-BETA2. What do you think about cutting that release this Friday or maybe next week? Best, Tim

Re: Release of Beta2?

2024-07-09 Thread Tim Allison
Doh. Sorry. Starting now... On Tue, Jul 9, 2024 at 12:47 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Hi all, > > Just seeing if we were planning to build Beta2 today? I'd like to tag along > and see how it's done if ya'll don't mind! > > -Nicholas >

Re: Release of Beta2?

2024-07-09 Thread Tim Allison
Let's aim for tomorrow after review of TIKA-4275? Any other fellow devs want to join? On Tue, Jul 9, 2024 at 4:46 PM Tim Allison wrote: > Doh. Sorry. Starting now... > > On Tue, Jul 9, 2024 at 12:47 PM Nicholas DiPiazza < > nicholas.dipia...@gmail.com> wrote: > >&

Re: maven-deploy pulling extraneous dependency's metadata?!

2024-07-10 Thread Tim Allison
Sorry, should have been dev@tika earlier, not private@tika. I rolled back the deploy-plugin to 3.1.1, which was successful for our last deployment of 2.9.2. That worked then. It does not work now with this new tika-grpc module. On Wed, Jul 10, 2024 at 3:42 PM Tim Allison wrote: > Apache Ma

Re: maven-deploy pulling extraneous dependency's metadata?!

2024-07-10 Thread Tim Allison
the same stuff INSTALLed as well, see line 32! > Looking more... > > > On Wed, Jul 10, 2024 at 9:44 PM Tim Allison wrote: > > > Apache Maven 3.9.7 (8b094c9513efc1b9ce2d952b3b9c8eaedaf8cbf0) > > Maven home: /apache/apache-maven-3.9.7 > > Java version: 11.0.23, vendor

[VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-12 Thread Tim Allison
A candidate for the Tika 3.0.0-BETA2 release is available at: https://dist.apache.org/repos/dist/dev/tika/3.0.0-BETA2 The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/3.0.0-BETA2-rc1/ The SHA-512 checksum of the archive is 8a4142f61110f196c550146637994

Re: [VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-12 Thread Tim Allison
; dependencies > (I've added these so we support these other projects by testing them), > and decide about the ffmpeg issue and the hdf5 issue. > > Tilman > > On 12.07.2024 18:08, Tim Allison wrote: > > A candidate for the Tika 3.0.0-BETA2 release is available at: &g

[RESULT][VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-15 Thread Tim Allison
The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s. +1s (binding) Tim Allison Nicholas DiPiazza Tilman Hausherr +1s (non-binding) Kiran Bachu Gary Gregory I'll release the artifacts shortly and update the website. Thank you, all! Best, Tim On Fri, Jul 12, 2024

Re: [RESULT][VOTE] Release Apache Tika 3.0.0-BETA2 Candidate #1

2024-07-15 Thread Tim Allison
I released the artifacts and built the docker images. I'll work on the site and announcement tomorrow. On Mon, Jul 15, 2024 at 1:50 PM Tim Allison wrote: > > The vote has passed with 3 PMC +1s, 2 non-binding +1s and no -1s. > > +1s (binding) > Tim Allison > Nicholas DiP

[ANNOUNCE] Apache Tika 3.0.0-BETA2 released

2024-07-16 Thread Tim Allison
0.0. -- Tim Allison, on behalf of the Apache Tika community

3.0.0 release?

2024-08-21 Thread Tim Allison
All, There are a couple of items documented on https://issues.apache.org/jira/browse/TIKA-4280 that we wanted to take care of before the 3.0.0 release. I can run a comparison btwn 2.x and 3.x on our regression corpus, and I can try to deal with javadocs. Any recs on how to wrap up the othe

Wiki page for advanced document processing

2024-08-22 Thread Tim Allison
All, With the explosion of vision models and methods for creating embeddings from images (and PDFs!), I thought it might be useful to start a wiki page that captures some of the techniques currently in use. There is such dynamism in the document intelligence/document engineering space that w

Re: Using pf4j for tika pipes

2024-08-26 Thread Tim Allison
1pm ET today? On Sat, Aug 24, 2024 at 1:10 PM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Dear Tika Devs: > > Tika pipes in production had a blocker problem for my peoples in that the > extensible Fetcher objects we have loaded into the Tika Server and Tika > Grpc Server would have

Re: Using pf4j for tika pipes

2024-08-26 Thread Tim Allison
e > > On Mon, Aug 26, 2024 at 8:43 AM Nicholas DiPiazza < > nicholas.dipia...@gmail.com> wrote: > > > I just got assigned a candidate to interview at that time. > > > > Could you move it to 2pm EST? > > > > On Mon, Aug 26, 2024 at 8:27 AM Tim Allison wro

[CVE-2016-4434] Apache Tika XML External Entity vulnerability

2016-05-26 Thread Tim Allison
CVE-2016-4434: Apache Tika XML External Entity vulnerability Severity: Important Vendor: The Apache Software Foundation Versions Affected: Apache Tika 0.10 to 1.12 Description: Apache Tika parses XML within numerous file formats. In some instances[1], the initialization ofthe XML parser o

[CVE-2016-4434] Apache Tika XML External Entity vulnerability

2016-05-26 Thread Tim Allison
CVE-2016-4434: Apache Tika XML External Entity vulnerability Severity: Important Vendor: The Apache Software Foundation Versions Affected: Apache Tika 0.10 to 1.12 Description: Apache Tika parses XML within numerous file formats. In some instances[1], the initialization ofthe XML parser or

revoking signing key

2018-12-04 Thread Tim Allison
All, I had to revoke my signing key: EF0CF38A. I have a couple of leads, but if you know of anyone in the Washington, DC region who might be interested in signing my new key (944FFD51), let me know. Best, Tim

Re: Resource Sharing Tika Corpus with Any23

2018-12-10 Thread Tim Allison
Sorry for my delay, send me the usernames and email addresses privately and I'll grant access. We're coming up on a release cycle. On Fri, Nov 30, 2018 at 8:14 PM Lewis John McGibbney wrote: > > Hi Tim, > Thanks for the reply... answer inline > > On 2018/11/30 19:22:23,

Re: 1.20?

2018-12-10 Thread Tim Allison
Any blockers on 1.20? I'm going to kick off the regression tests shortly. On Fri, Nov 30, 2018 at 7:39 PM wrote: > > Hi, > On Wed, 21 Nov 2018 at 13:00, Tim Allison wrote: > > > Dave, > > Should I try to get the Docker plugin working again? > > > > Tha

Re: 1.20?

2018-12-13 Thread Tim Allison
ort, I think we're good to go. Will roll rc1 later today or (more likely) tomorrow unless there are objections. On Mon, Dec 10, 2018 at 9:37 PM Tim Allison wrote: > > Any blockers on 1.20? I'm going to kick off the regression tests shortly. > On Fri, Nov 30, 2018 at 7:39

Re: revoking signing key

2018-12-13 Thread Tim Allison
Tue, 4 Dec 2018, Tim Allison wrote: > > I had to revoke my signing key: EF0CF38A. I have a couple of leads, but > > if you know of anyone in the Washington, DC region who might be > > interested in signing my new key (944FFD51), let me know. > > Send a message to party@ and

Re: 1.20?

2018-12-13 Thread Tim Allison
s > > Em qui, 13 de dez de 2018 às 13:02, Tim Allison > escreveu: > > > Reports are here: > > > > http://162.242.228.174/reports/tika_1_20-pre-rc1.zip > > > > I'm going to revert the mp4 parser, and commit the few dependency > > upgrades I

Re: 1.20?

2018-12-13 Thread Tim Allison
Let me actually take a look before answering. Sorry! On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: > Thank you for reading the reports!!! > > The files are very likely broken. I can take a look. The change was > probably because of an "upgrade" to junrar.

Re: 1.20?

2018-12-14 Thread Tim Allison
e great! Onward! Cheers, Tim On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu, Dec 13, 2018 at 5:30 PM Tim Allison wrote: >> >> Thank you for reading the reports!!! >> >> The file

[VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-17 Thread Tim Allison
A candidate for the Tika 1.20 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.20-rc1/ The SHA-512 checksum of the archive is add29bebe0486f01bd57fe5bec5405df1af9f319a87c74295

Re: 1.20?

2018-12-18 Thread Tim Allison
Reports on mp4s, junrar, msaccess and a random subset of the regression corpus are available here: http://162.242.228.174/reports/reports_tika_1_20-rc1_subset.tgz On Thu, Dec 13, 2018 at 5:34 PM Tim Allison wrote: > > Let me actually take a look before answering. Sorry! > > On Thu,

[RESULT][VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Tim Allison
The vote has passed: +1 from Oleg Tikhonov Ken Krugler Tim Allison no -1 Cheers, Tim On Sat, Dec 22, 2018 at 6:57 AM Oleg Tikhonov wrote: > > *stuff > > On Sat, Dec 22, 2018, 11:01 Oleg Tikhonov > > All basic staff passed. > > +1. > > Oleg > >

[ANNOUNCE] Apache Tika 1.20 released

2018-12-22 Thread Tim Allison
mirror site, please remember to verify the downloads using signatures found: https://www.apache.org/dist/tika/KEYS For more information on Apache Tika, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

[CVE-2018-17197] Apache Tika Denial of Service -- Infinite Loop in Tika's SQLite3Parser

2018-12-22 Thread Tim Allison
ika's SQLite3Parser in versions 1.8-1.19.1 of Apache Tika. Mitigation: Apache Tika users should upgrade to 1.20 or later. Credit: This issue was discovered by Tim Allison on the Apache Tika Team.

Re: [VOTE] Release Apache Tika 1.20 Candidate #1

2018-12-22 Thread Tim Allison
t; >> Hi Tim, > >> > >> Thanks for rolling the release. > >> > >> Built & validated on Mac OS X 10.12 > >> > >> Updated flink-crawler, all tests pass. > >> > >> So here’s my +1 > >> > >> — Ken &

Re: Preferred logging implementation

2019-01-07 Thread Tim Allison
Not at all bothersome...thank you for the ping! Would POI actually ship with an implementation (e.g. log4j2) or only with slf4j? Our app now uses slf4j with a bridge to log4j. At some point -- Tika 2.0(?) -- we might upgrade to log4j2, but I'd think we'd want to keep slf4j. On Mon, Jan 7, 201

Re: revoking signing key

2019-01-09 Thread Tim Allison
All, I’d like to send a belated note of thanks to Kevin McGrail and Dave Fisher for signing my new key. That enabled the release of 1.20. Thank you!!! Cheers, Tim On Thu, Dec 13, 2018 at 10:10 AM Tim Allison wrote: > When I update our keys file, should I leave in my revo

[csv] csv format detector/sniffer?

2019-02-25 Thread Tim Allison
Commons-CSV team, We recently integrated Commons-CSV into Apache Tika. For now, we’re relying strictly on the filename for csv detection, and we’re relying on our AutodetectReader to identify the charset. It would be really useful for us to be able to detect: 1) A csv/tsv file vs a regular .t

Re: [csv] csv format detector/sniffer?

2019-02-25 Thread Tim Allison
ommons IO. > > Path path = Path.get(...); > Charset cs = org.apache.commons.io.CharsetDetector.detect(path); > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat); > > Thoughts? > > Gary > > > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison wrote: > > > Commons-

Re: Wiki migration

2019-03-21 Thread Tim Allison
+1 let me know what I need to do. On Thu, Mar 21, 2019 at 1:02 PM Konstantin Gribov wrote: > > Hi, folks > > What do you think about starting wiki migration (from moin to confluence)? > > I can try it via selfservice.a.o if you consent but I'm not sure if I have > enough access to do so. Maybe on

Tika 1.21?

2019-04-08 Thread Tim Allison
All, PDFBox will be out in a few days, and POI should be out soon as well. I _think_ I'd like to get in a first draft of "auto" mode for OCR'ing PDFs (TIKA-2749), but other than that, I'd be willing to run a release of 1.21 in the next few weeks. WDYT? Best, Tim

Re: Wiki migration

2019-04-17 Thread Tim Allison
/spacepermissions.action?key=TIKA > > > > P. S. Chris, is chrismattmann your legitimate account there? Will you merge > > it with your LDAP account via INFRA ticket later? > > > > -- > > Best regards, > > Konstantin Gribov. > > > > > &

Re: Wiki migration

2019-04-19 Thread Tim Allison
ccount explicitly. > > -- > Best regards, > Konstantin Gribov. > > > On Wed, Apr 17, 2019 at 10:49 PM Tim Allison wrote: > > > Thank you, Konstantin...would someone be able to grant me karma? > > > > The following error(s) occurred: > > > > You do not

Re: Wiki migration

2019-04-22 Thread Tim Allison
in to check when you have a > moment to do it. I removed explicit permissions now 'cause Gavin said that > all Tika committers and PMC are in tika group in cwiki. > > -- > Best regards, > Konstantin Gribov. > > > On Fri, Apr 19, 2019 at 11:22 PM Tim Allison wrote: &g

Re: Tika 1.21?

2019-04-22 Thread Tim Allison
to get in...I'm happy to wait, though, till next week to start the regression tests. WDYT? Cheers, Tim On Mon, Apr 8, 2019 at 2:25 PM Oleg Tikhonov wrote: > > Great! > +1. > Thanks, > Oleg > > On Mon, Apr 8, 2019, 21:11 Tim Allison wrote: >

[COMPRESS] zip-based entry names/metadata data set available

2019-04-22 Thread Tim Allison
All, For some recent work on Apache Tika, I used commons-compress to extract entry names and metadata via a streaming read from roughly 500k zip-based files we have in Tika's regression corpus. I was happy to see we have some POI-generated files in there. :) I noticed some areas for improveme

Re: Wiki migration

2019-04-23 Thread Tim Allison
I'm in. Thank you, Konstantin! On Mon, Apr 22, 2019 at 1:22 PM Tim Allison wrote: > > No luck still. I'm able to login w my Apache credentials, but I don't > appear to have permissions to see anything. Should I open a ticket > with infra or comment on INFRA-18108?

Re: [EXTERNAL] Tika script

2019-04-26 Thread Tim Allison
https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems On Fri, Apr 26, 2019 at 4:00 PM Chris Mattmann wrote: > Hi, > > > > This would be a good question to ask on the dev@tika.a.o list so I’m > CC’ing them. > > > > Cheers, > > Chris > > > > > > From: Djari Imene > Date: Friday,

Re: [EXTERNAL] Tika script

2019-04-30 Thread Tim Allison
Fri, Apr 26, 2019 at 8:05 PM Tim Allison wrote: > https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems > > > On Fri, Apr 26, 2019 at 4:00 PM Chris Mattmann > wrote: > >> Hi, >> >> >> >> This would be a good question to as

Re: Quarkus integration

2019-05-02 Thread Tim Allison
Hmmm... are you still using: sergey_beryozkin as your user name? I see that you're in the PMC group with that username in JIRA. Should I add sergeyb or do you want to ask infra to merge the two identities...if that's possible? On Thu, May 2, 2019 at 1:08 PM Sergey Beryozkin wrote: > > Hi Tim >

Re: Quarkus integration

2019-05-03 Thread Tim Allison
I can add 'sergeyb' if you'd prefer! On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin wrote: > > Though I might need to settle on the 'sergeyb' eventually since it is my > apache committer id. > Thanks... > > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin > wrote: > > > Oh, I forgot I had a 'se

Extracting AppleGPS Coordinates from an MP4

2019-05-03 Thread Tim Allison
Jcodec devs, I'm experimenting with extracting metadata from mp4s with jcodec[1]. I'm able to find the box with the AppleGPS Coordinates in it, but I can't figure out how to extract the string from the Box$LeafBox with id "©xyz"... any recommendations? Thank you! Best,

Re: Tika 1.21?

2019-05-03 Thread Tim Allison
d op etc) but that's not a blocker ,) > > -- > Best regards, > Konstantin Gribov. > > > On Tue, Apr 23, 2019 at 9:04 AM Oleg Tikhonov wrote: > > > +1 to wait if needed. > > > > On Mon, Apr 22, 2019, 23:23 Tim Allison wrote: > > > > > All, > >

Re: Wiki migration

2019-05-06 Thread Tim Allison
ds, > Konstantin Gribov. > > > On Tue, Apr 23, 2019 at 7:18 PM Tim Allison wrote: > > > I'm in. Thank you, Konstantin! > > > > On Mon, Apr 22, 2019 at 1:22 PM Tim Allison wrote: > > > > > > No luck still. I'm able to login w my Apache crede

Re: Tika 1.21?

2019-05-06 Thread Tim Allison
quot; .714" "application/octet-stream" "application/octet-stream" "637102" "450923" " .708" "application/vnd.ms-wordml" "application/vnd.ms-wordml" "10319" "7289" " .706" "t

Re: Tika 1.21?

2019-05-07 Thread Tim Allison
Will kick off regression tests again shortly. On Mon, May 6, 2019 at 8:37 PM Tim Allison wrote: > > Houston, we have a problem... The regression parsing took orders of > magnitude longer than normal. It is looking like something is going > seriously wrong (different?) with rfc

DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
All, Apologies for the broken builds...I'm not able to reproduce this test failure on my mac or Windows machine. I'm testing the build on our regression vm now. If anyone has any idea why this is failing on our build vms (since we upgraded to dl4j-beta3), please let me know. Thank you.

Re: DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
Yay! I am able to reproduce this on our vm... Onward to debugging... On Wed, May 8, 2019 at 9:57 AM Tim Allison wrote: > > All, > Apologies for the broken builds...I'm not able to reproduce this > test failure on my mac or Windows machine. I'm testing the build on &

Re: DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
Tim Allison wrote: > > Yay! I am able to reproduce this on our vm... Onward to debugging... > > On Wed, May 8, 2019 at 9:57 AM Tim Allison wrote: > > > > All, > > Apologies for the broken builds...I'm not able to reproduce this > > test failure on my

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
heers, Tim On Wed, May 8, 2019 at 11:32 AM Chris Mattmann wrote: > > Thejan, Thamme any ideas? > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > Date: Wednesday, May 8, 2019 at 7:50 AM > To: "dev@tika.apache.org"

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
I think so. Works locally now. Actually works locally. :D On Wed, May 8, 2019 at 11:50 AM Chris Mattmann wrote: > Great work ☺ So it’s fixed? > > > > > > > > From: Tim Allison > Reply-To: "dev@tika.apache.org" > Date: Wednesday, May 8, 2019 at 8:43 A

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Tim Allison
ning4j/deeplearning4j/commit/015fd10bddbc59416595f706af6be1d0d23f573f > > > On Wed, May 8, 2019 at 9:48 PM Chris Mattmann wrote: > > > Yayy Tim > > > > > > > > > > > > > > > > From: Tim Allison > > Reply-To: "dev@tik

Re: Tika 1.21?

2019-05-10 Thread Tim Allison
C. Cheers, Tim On Tue, May 7, 2019 at 5:04 PM Tim Allison wrote: > > Will kick off regression tests again shortly. > > On Mon, May 6, 2019 at 8:37 PM Tim Allison wrote: > > > > Houston, we have a problem... The regression parsing took orders of > > magnitude

[VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-13 Thread Tim Allison
A candidate for the Tika 1.21 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.21-rc1/ The SHA-512 checksum of the archive is: 4bc861f3b9ba37df14726d8acf173185a5414b88774c0b00

Re: [VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-14 Thread Tim Allison
All, I'm happy to close rc1 and respin an rc2 after Oleg's findings (TIKA-2871 and TIKA-2872)...many thanks, Oleg! I'm also happy to proceed with rc1 as is...Let me know your preferences. Cheers, Tim On Mon, May 13, 2019 at 1:32 PM Tim Alliso

Lang id performance degrades towards 100k characters?!

2019-05-14 Thread Tim Allison
All, Joern Kottman is working with us on [1], but I thought I'd do the proper community thing and raise this here as well. On Apache Tika, we're considering switching over to OpenNLP for language detection for tika-eval. We know that dumping a 100k chunk of text into OpenNLP for language detec

Re: Configuring mime type detection for password protected OOMXL

2019-05-14 Thread Tim Allison
2019 at 2:05 PM Tucker B wrote: > > On Tue, 14 May 2019, 13:52 Tim Allison, wrote: >> >> Hi Tucker, >> I know only a little about this area, but I think password protected >> xlsx files (and ooxml generally) are encrypted inside an OLE package >> so you ca

[CANCEL][VOTE] Release Apache Tika 1.21 Candidate #1

2019-05-14 Thread Tim Allison
> *Cc: *"u...@tika.apache.org" > *Subject: *Re: [VOTE] Release Apache Tika 1.21 Candidate #1 > > > > :-) > > I'm good with any option. RC1 seems to be good from my point of view. > > Cheers, > > Oleg > > > > On Tue, May 14, 2019 at 3:56

[VOTE] Release Apache Tika 1.21 Candidate #2

2019-05-14 Thread Tim Allison
A candidate for the Tika 1.21 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.21-rc2/ The SHA-512 checksum of the archive is: 67748553a44b3acb009f0e99ac595c5babfe04d4a75abd2e

Re: [VOTE] Release Apache Tika 1.21 Candidate #2

2019-05-18 Thread Tim Allison
Any fellow devs willing to vote? We have 2 votes so far. I should have time on Monday to run the release if the vote passes. Cheers, Tim On Tue, May 14, 2019 at 10:15 PM Tim Allison wrote: > A candidate for the Tika 1.21 release is available at: > > https://dist.apache.org/repos

[RESULT][VOTE] Release Apache Tika 1.21 Candidate #2

2019-05-18 Thread Tim Allison
The vote has passed with +1 from: Sergey Beryozkin Oleg Tikhonov Tim Allison And no -1. I’ll make the release in the next few days. Thank you, all! Cheers, Tim On Sat, May 18, 2019 at 4:09 PM Sergey Beryozkin wrote: > +1 > > Thanks, Sergey > > On Sat, May 18, 2019 at 11:31

[ANNOUNCE] Apache Tika 1.21 released

2019-05-19 Thread Tim Allison
mirror site, please remember to verify the downloads using signatures found: https://www.apache.org/dist/tika/KEYS For more information on Apache Tika, visit the project home page: https://tika.apache.org/ -- Tim Allison, on behalf of the Apache Tika community

Re: TXTParser in Tika 1.21

2019-05-20 Thread Tim Allison
Y, that was by design. Not intended to surprise. We'll be out w 1.21.1 or 1.22 soon enough if that's a breaking change... :( On Mon, May 20, 2019 at 11:12 AM Sergey Beryozkin wrote: > > I don't really mind though :-) as it looks like both parsers can handle the > text content, the reason I had

Re: [jira] [Commented] (TIKA-2878) Update dependencies for 1.21.1 or 1.22

2019-05-20 Thread Tim Allison
Y. Fixing now. Once I get a clean local build, I'll commit... On Mon, May 20, 2019 at 2:18 PM Oleg Tikhonov wrote: > > Today I've also used a master branch and got the same result. > > > On Mon, May 20, 2019 at 8:59 PM Tim Allison (JIRA) wrote: > > > >

1.22?

2019-06-12 Thread Tim Allison
All, Given our dependency, um, issues, any objections to a 1.22 in a few weeks? Any blockers/must haves? Best, Tim

Re: Release 2.0.16 ?

2019-06-13 Thread Tim Allison
Cheers, Tim On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler wrote: > > Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > > Am 12.06.2019 um 03:56 schrieb Tim Allison: > >> Reports are available here for 2.0.16-SNAPSHOT: > >> >

Re: Detection of plain text files

2019-06-25 Thread Tim Allison
Hi Ken, I'm sorry for my delay. I took a short chunk of Japanese and converted it to Shift_JIS. Your memory is largely correct (or we've changed the code base a bit). The TextDetector makes a decision in favor of {{text/plain}} vs {{application/octet}} via TextStatistics (https://github.com/

Tika 1.22?

2019-06-25 Thread Tim Allison
All, The vote for the next version of PDFBox is under way. I think we've had a number of useful upgrades since our last release. Any objections to starting the release process for Tika 1.22 a week or so after we integrate PDFBox? Cheers, Tim

Re: Detection of plain text files

2019-06-26 Thread Tim Allison
) a reasonable number of line ending > chars? > > — Ken > > > On Jun 25, 2019, at 6:56 AM, Tim Allison wrote: > > > > Hi Ken, > > I'm sorry for my delay. I took a short chunk of Japanese and > > converted it to Shift_JIS. > > > > Your

Re: Merge flow

2019-07-10 Thread Tim Allison
Y. Although sometimes I flip the order. :D If it matters or if I’m doing something wrong, let me know! On Wed, Jul 10, 2019 at 4:52 AM Sergey Beryozkin wrote: > Hi Tim > > What is the current process for merging the fixes ? The fix goes to the > master first and then it is cherry-picked into t

1.22?

2019-07-15 Thread Tim Allison
Anyone have anything they want to get into 1.22? If not, I’ll kick off the regression tests shortly. Cheers, Tim

Re: 1.22?

2019-07-18 Thread Tim Allison
Reports are here: http://162.242.228.174/reports/reports_tika_1.22-pre-rc1.zip I need to fix some RTF regressions... On Wed, Jul 17, 2019 at 3:26 AM Ken Krugler wrote: > > +1 > > — Ken > > > On Jul 15, 2019, at 2:37 PM, Tim Allison wrote: > > > > Anyone ha

Re: 1.22?

2019-07-18 Thread Tim Allison
With a commit I'm about to push, we're back on track and slightly better for RTF: http://162.242.228.174/reports/tika_1_22_rtf_reports.tgz On Thu, Jul 18, 2019 at 11:23 AM Tim Allison wrote: > > Reports are here: > > http://162.242.228.174/reports/reports_tika_1.22-pre-rc1

Re: 1.22?

2019-07-18 Thread Tim Allison
thing into 1.22. Cheers, Tim On Thu, Jul 18, 2019 at 2:33 PM Tim Allison wrote: > > With a commit I'm about to push, we're back on track and slightly > better for RTF: > > http://162.242.228.174/reports/tika_1_22_rtf_reports.tgz > > On Thu, Jul 18, 2019 at 11:23 A

Re: 1.22?

2019-07-23 Thread Tim Allison
I ran 1.21 and 1.22 against a random sample of 20k documents. The processing times are comparable. I'm working to push 1.22-rc1 soon. On Thu, Jul 18, 2019 at 3:56 PM Tim Allison wrote: > > All, > > I'm a bit concerned about the amount of time the regression runs took. &g

[VOTE] Release Apache Tika 1.22 Candidate #1

2019-07-23 Thread Tim Allison
A candidate for the Tika 1.22 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.22-rc1/ The SHA-512 checksum of the archive is 7f33f4343b7520ddec5b3c5a6d5f8e076e748a76d3

Re: [VOTE] Release Apache Tika 1.22 Candidate #1

2019-07-23 Thread Tim Allison
Wrong release year in CHANGES.txt...ugh... I don't think _that's_ worth a respin... On Tue, Jul 23, 2019 at 1:54 PM Tim Allison wrote: > > A candidate for the Tika 1.22 release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > > The releas

[CANCEL][VOTE] Release Apache Tika 1.22 Candidate #1

2019-07-23 Thread Tim Allison
-1 Need to cleanup some items. Will respin shortly. On Tue, Jul 23, 2019 at 3:28 PM Tim Allison wrote: > Wrong release year in CHANGES.txt...ugh... > > I don't think _that's_ worth a respin... > > > On Tue, Jul 23, 2019 at 1:54 PM Tim Allison wrote: > >

[VOTE] Release Apache Tika 1.22 Candidate #2

2019-07-24 Thread Tim Allison
A candidate for the Tika 1.22 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.22-rc2/ The SHA-512 checksum of the archive is 5551fd5fe4c890d34158b434d676aaff56e7b29dff

[CANCEL] [VOTE] Release Apache Tika 1.22 Candidate #2

2019-07-25 Thread Tim Allison
More work to do. Will try to roll #3 this afternoon or tomorrow morning. Sorry for the noise. On Wed, Jul 24, 2019 at 1:19 PM Tim Allison wrote: > > A candidate for the Tika 1.22 release is available at: > > https://dist.apache.org/repos/dist/dev/tika/ > > > The relea

Re: [CANCEL] [VOTE] Release Apache Tika 1.22 Candidate #2

2019-07-25 Thread Tim Allison
Sorry, here's my official -1 on rc#2. On Thu, Jul 25, 2019 at 7:07 AM Tim Allison wrote: > > More work to do. Will try to roll #3 this afternoon or tomorrow morning. > > Sorry for the noise. > > On Wed, Jul 24, 2019 at 1:19 PM Tim Allison wrote: > > > > A can

[VOTE] Release Apache Tika 1.22 Candidate #3

2019-07-26 Thread Tim Allison
A candidate for the Tika 1.22 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.22-rc3/ The SHA-512 checksum of the archive is a86964e06464c87a533dfed705b891bf3f519189be

[VOTE] Release Apache Tika 1.22 Candidate #4

2019-07-29 Thread Tim Allison
A candidate for the Tika 1.22 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: https://github.com/apache/tika/tree/1.22-rc4/ The SHA-512 checksum of the archive is bbdf2683a63a0e5fbe66f10eb88c29cd14128c3dd8

Re: Windows Build

2019-07-31 Thread Tim Allison
Sorry for my sloth on that. Thank you! On Wed, Jul 31, 2019 at 1:41 AM David Meikle wrote: > Hello, > > I've changed the config of the Windows build in Jenkins to point to point > to the right path for Maven settings, so hopefully that is it back on track > again. > > Cheers, > Dave >

  1   2   3   4   5   6   7   8   9   10   >