RE: Tika 1.14?

2016-08-12 Thread Luís Filipe Nassif
I think waiting for pdfbox 2.0.3 would be great. There are some regressions fixed. Regards, Luis Em 12 de ago de 2016 08:24, "Allison, Timothy B." escreveu: > >> I know it's been a little bit since we talked about 2.0. We had > discussed holding off while some API changes

Re: Testing an ingest framework that uses Apache Tika

2017-02-16 Thread Luís Filipe Nassif
Excellent, Tim! Thank you for all your great work on Apache Tika! 2017-02-16 11:23 GMT-02:00 Konstantin Gribov : > Tim, > > it's a awesome feature for downstream projects' integration tests. Thanks > for implementing it! > > чт, 16 февр. 2017 г. в 16:17, Allison, Timothy B.

Re: [COMPRESS] zip-bomb prevention for Z?

2017-04-13 Thread Luís Filipe Nassif
I have reported a similar issue to them, see Compress-382, maybe those issues should be handled at Compress side, if I understood correctly the API contract. Luis Em 13 de abr de 2017 3:36 PM, "Allison, Timothy B." escreveu: On TIKA-1631 [1], users have observed that a

Change Scope of Jai-ImageIO-Core dependency

2017-04-21 Thread Luís Filipe Nassif
Hi devs, Looks like jai-imageio-core from github ( https://github.com/jai-imageio/jai-imageio-core) on which we depend with test scope is Apache compatible. Note that is a fork from the original Jai project referenced by PDFBox. The github fork has extracted jpeg2000 and other problematic code

Re: 1.15?

2017-04-19 Thread Luís Filipe Nassif
+1 from me, there are so many fixes and improvements! Best, Luis Em 18 de abr de 2017 03:13, "Oleg Tikhonov" escreveu: > +1 for the release. > > On Mon, Apr 17, 2017 at 8:39 PM, David Meikle wrote: > > > +1 from me too. > > > > Cheers, > > Dave > > > > On 13

Re: Tess4j API for TIKA OCR parser

2017-03-07 Thread Luís Filipe Nassif
Hi Thejan, Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy

Re: Improving Tika OCR

2017-04-17 Thread Luís Filipe Nassif
Hi Kranthi, That is an interesting comparison! But I think Tesseract 4.0 is still alpha? And do you know the VGG software license? Best, Luis Em 17 de abr de 2017 8:46 AM, "Kranthi Kiran G V" < kkran...@student.nitw.ac.in> escreveu: Hello Tim Allison, I am currently working on improving

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, July 10, 2017 10:26 AM > To: lfcnas...@gmail.com > Cc: dev@tika.apache.org > Subject: RE: [VOTE] Release Apache Tika 1.16 Candidate #1 > > Y. I need to fix that unit test. Thank you! > > https://issues.apache.org/jira/bro

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
I got the following failure on Window7, jdk1.8.0_131, in OOXMLParserTest.testXLSBVarious:1537. Any ideas? Failed tests: OOXMLParserTest.testXLSBVarious:1537->TikaTest.assertContains:102 13.1211231321 not found in: http://www.w3.org/1999/xhtml;> mySheet1 String

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-10 Thread Luís Filipe Nassif
OK, that is a Locale issue, working around... 2017-07-10 10:24 GMT-03:00 Luís Filipe Nassif <lfcnas...@gmail.com>: > I got the following failure on Window7, jdk1.8.0_131, in > OOXMLParserTest.testXLSBVarious:1537. > Any ideas? > > Failed tests: > OOXMLParserT

Re: Tika 1.15.1?

2017-06-29 Thread Luís Filipe Nassif
d if we rename the class in Tika so that we > don't have a conflict over oat.parsers.SentimentParser (TIKA-2368). > > > > Cheers, > > > > Tim > > > > -Original Message- > > From: Tyler Bui-Pal

Re: Tika 1.15.1? -> 1.16

2017-07-05 Thread Luís Filipe Nassif
lsx, .xlsb (TIKA-2362) > > * Extract text from charts in .docx, .pptx, .xlsx and .xlsb > (TIKA-2254). > > * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb > (TIKA-1945). > > * Enable base32 encoding of digests and enable

Re: Tika 1.16?

2017-06-02 Thread Luís Filipe Nassif
Maybe 1.15.1? Em 1 de jun de 2017 10:03 AM, "Bob Paulin" escreveu: > +1 > > > On 6/1/2017 6:50 AM, Allison, Timothy B. wrote: > > Given the broken OSGi and the org.json issues with 1.15, does it make > sense to aim for 1.16 fairly soon, say 3-4 weeks? > > > > Cheers, > > > >

Re: [ANNOUNCE] Apache Tika 1.15 released

2017-06-02 Thread Luís Filipe Nassif
Late to the party... Great work Tim! Thank you for all your huge work with Tika! Em 30 de mai de 2017 3:10 PM, "Tim Allison" escreveu: > The Apache Tika project is pleased to announce the release of Apache Tika > 1.15. The release contents have been pushed out to the main

Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-09-04 Thread Luís Filipe Nassif
Very welcome Madhav! Luis Em 2 de set de 2017 2:45 AM, "Madhav Sharan" escreveu: > Thanks a lot, everyone! So glad to be here. > > About me - I am a software engineer recently graduated from USC. I am > interested in the understanding data corpuses and building > applications

Re: Tika 1.17?

2017-12-08 Thread Luís Filipe Nassif
calls to safelyAllocate, or > if the files are just plain corrupt. > > > > After I fix TIKA-2483, I think I’ll be good to roll rc1 for 1.17. > > > > Anything else holding us back? > > > > *From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com] > *Sent:* Thur

Fwd: Tika 1.17?

2017-12-07 Thread Luís Filipe Nassif
hank you! Do you mind sharing this with the list? *From:* Luís Filipe Nassif [mailto:lfcnas...@gmail.com] *Sent:* Thursday, December 7, 2017 10:26 AM *To:* Allison, Timothy B. <talli...@mitre.org> *Subject:* RE: Tika 1.17? Hi Tim, I don't think it is a blocker, maybe a minor regressi

Re: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-11 Thread Luís Filipe Nassif
Built on Windows 10 Pro with jdk 1.8.0_152 x64, all tests passed. So +1 from me. PS: Running regression test on our 1M forensic test corpus... Luis 2017-12-08 22:43 GMT-02:00 Tim Allison : > > > On Friday, December 8, 2017, 7:43:05 PM EST, Tim Allison < >

Re: [VOTE] Release Apache Tika 1.17 Candidate #2

2017-12-12 Thread Luís Filipe Nassif
All seems ok after integrating in our system and testing with our limited regression corpus. Luis 2017-12-11 13:13 GMT-02:00 Luís Filipe Nassif <lfcnas...@gmail.com>: > Built on Windows 10 Pro with jdk 1.8.0_152 x64, all tests passed. So +1 > from me. > > PS: Running regress

Re: Tika 1.17?

2017-12-06 Thread Luís Filipe Nassif
Hi Tim, I've had a briefly look at exceptions folder, seems we are much better with ppt (4677 fixed exceptions) and pdf (7798), but there are 208 new exceptions with ppt. I did not check the files to see if they are corrupted, but some common tokens were lost. Below the most common new

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
users' guide. >> >> On Tue, May 29, 2018 at 3:22 PM, Tim Allison wrote: >> >>> Y, my mods to the ForkParser should make it more robust, and will help >>> with OOMs, permanent hangs and native lib crashing. But those changes are >>> still i

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
Hi Ken, Threads will not help with OutOfMemoryErrors or crashes caused by native libs. ForkParser can help, after the refactoring started by Tim to handle some of its limitations. See TIKA-2653 2018-05-29 16:11 GMT-03:00 Ken Krugler : > Thanks for the ref, Tim. > > I’m curious why SolrCell

Re: Guidance to avoid Tika's integration with Solr's ExtractingRequestHandler in production

2018-05-29 Thread Luís Filipe Nassif
Related to this, do we have any guidance to help java users choosing between ForkParser or TikaServer? 2018-05-29 16:18 GMT-03:00 Luís Filipe Nassif : > Hi Ken, > > Threads will not help with OutOfMemoryErrors or crashes caused by native > libs. ForkParser can help, after the refacto

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Luís Filipe Nassif
>From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception. To not spool everything to temp files to re-read the stream, I think we can create an optional

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Luís Filipe Nassif
Mine too, but I know it is important for many use cases. Maybe adding to XHtmlContentHandler some tracking of open tags and a new method to close them? 2018-02-07 12:59 GMT-02:00 Allison, Timothy B. : > Do we worry about properly closing tags on an exception? > > >

Re: Tika 1.18?

2018-03-07 Thread Luís Filipe Nassif
with another existing mime glob. Any workaround for this specific case? If yes, I can open a different ticket. Em 2 de mar de 2018 18:23, "Nick Burch" <apa...@gagravarr.org> escreveu: On Fri, 2 Mar 2018, Luís Filipe Nassif wrote: > If I make no progress on TIKA-1466 until

Re: Tika 1.18?

2018-03-01 Thread Luís Filipe Nassif
I think we should workaround TIKA-2591, and I would like to work on TIKA-1466 (what do you think?) and fix TIKA-2568. Cheers, Luis Livre de vírus. www.avast.com

Re: 1.20?

2018-12-13 Thread Luís Filipe Nassif
Hi Tim, Reading your great reports, I also saw some new exceptions with RAR files in likely broken folder, but seems tika was able to extract some text from them before. Do you know if those files are really broken and why tika extracted text from them before? Thank you, Luis Em qui, 13 de dez

Re: [ANNOUNCE] Welcome Tilman Hausherr as Tika PMC member and committer

2019-10-06 Thread Luís Filipe Nassif
Welcome, Tilman! Em sex, 4 de out de 2019 15:37, Tilman Hausherr escreveu: > Am 04.10.2019 um 16:19 schrieb Tim Allison: > > All, > > > > The Tika PMC has elected to add Tilman Hausherr to our ranks. Tilman, > > please feel free to introduce yourself, and welcome aboard! > > > > Cheers, > > >

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
I've done some few improvements in ForkParser performance in an internal fork. Will try to contribute upstream... Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I am attempting to Tika parse dozens of millions of office documents. Pdfs, > docs,

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
Not what you asked but related :) Luis Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif escreveu: > I've done some few improvements in ForkParser performance in an internal > fork. Will try to contribute upstream... > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < &

Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

2020-11-26 Thread Luís Filipe Nassif
Thank you, Peter, for all your contributions and welcome! Em qua., 25 de nov. de 2020 às 23:21, Chris Mattmann escreveu: > Welcome Peter!  > > > > > > > > *From: *Peter Lee > *Reply-To: * > *Date: *Wednesday, November 25, 2020 at 6:08 PM > *To: *"dev@tika.apache.org" , "talli...@apache.org" <

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Luís Filipe Nassif
gt;> ForkParser multi-thread able processing program that can gracefully handle >> the huge onslaught that is my use case. >> But at this point, I doubt I'll switch from Tika Server anyways because I >> invested some time creating a wrapper around it and it is performing very >

Re: new committer: Nicholas DiPiazza

2021-06-03 Thread Luís Filipe Nassif
Welcome on board, Nicholas. Great work! Best regards, Luis Filipe Nassif Em qui., 3 de jun. de 2021 às 16:00, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > Hi Everyone! > > Happy to be one of the committers for Tika! > > My name is Nicholas DiPiazza - I reside in Madison,

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-15 Thread Luís Filipe Nassif
Great, Thank you, Tim! Em qua., 15 de dez. de 2021 às 16:50, Tim Allison escreveu: > I've merged Lewis's edits to the README and added the EOL. Let's do > what both Konstantin and Nick recommend: README, notifications to > user/dev lists x months out and include EOL in all release messages? >

Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-14 Thread Luís Filipe Nassif
gt; > > > Sounds like 2 +1 to my -0. :D I'll start working on this now. > > > > > > On Mon, Dec 13, 2021 at 2:09 PM Nicholas DiPiazza > > > wrote: > > >> > > >> I prefer upgrade to log4j2 > > >> > > >> On Mon, Dec

Re: [DISCUSS] support for Java 8?

2022-03-25 Thread Luís Filipe Nassif
sensible -> sensitive Em sex, 25 de mar de 2022 21:15, Luís Filipe Nassif escreveu: > We are moving to java 11 because it's required by Lucene 9, that has some > features we are interested in. > > We use TIKA as a library, using ForkParser to protect against catastrophic &

Re: [DISCUSS] support for Java 8?

2022-03-25 Thread Luís Filipe Nassif
We are moving to java 11 because it's required by Lucene 9, that has some features we are interested in. We use TIKA as a library, using ForkParser to protect against catastrophic errors. And we are receiving a lot of illegal reflective access warnings because of some Tika dependencies, although

Re: next releases -- 2.4.1 and 1.28.4

2022-06-12 Thread Luís Filipe Nassif
+1 from me Em ter, 7 de jun de 2022 12:09, Tim Allison escreveu: > All, > > Any objections to starting the release processes for 1.x and 2.x in the > next few days? Any blockers? Anything we should wait for? > > Thank you, all. > > Cheers, > > Tim >

Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
Just got these build failures on Windows 10 JDK 11: [ERROR] Failures: [ERROR] TextAndCSVParserTest.testSubclassingMimeTypesRemain:217 expected:<...-vcalendar; charset=[ISO-8859-1]> but was:<...-vcalendar; charset=[windows-1252]> [ERROR] TXTParserTest.testSubclassingMimeTypesRemain:299

Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
tion on line endings on Windows. > > On Mon, May 2, 2022 at 9:40 AM Luís Filipe Nassif > wrote: > > > > Just got these build failures on Windows 10 JDK 11: > > > > [ERROR] Failures: > > [ERROR] TextAndCSVParserTest.testSubclassingMimeTypesRemain:217 >

Re: Re: [VOTE] Release Apache Tika 1.28.2 Candidate #2

2022-05-02 Thread Luís Filipe Nassif
ose test files to \r\n. Can you > > > open the test files in a hex editor and see what the line endings look > > > like? > > > > > > We need to improve documentation on line endings on Windows. > > > > > > On Mon, May 2, 2022

Re: [VOTE] Release Apache Tika 2.4.0 Candidate #1

2022-05-02 Thread Luís Filipe Nassif
Hello, +1. Just basic stuff, built on Windows 10, Liberica JDK 11.0.13 x64. Thank you, Tim! Em sáb., 30 de abr. de 2022 às 05:27, David Meikle escreveu: > Hi > > On Fri, 29 Apr 2022 at 00:23, Tim Allison wrote: > >> >> The SHA-512 checksum of the archive is >> >>

Re: checkstyle failures

2023-08-13 Thread Luís Filipe Nassif
Not sure, but maybe we could relax the mandatory rules a bit? This would make contributions from external collaborators easier... Also for commiters not contributing too often, at least this causes some difficulties for me too... Em dom, 13 de ago de 2023 01:03, Tilman Hausherr escreveu: >