[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708146#comment-17708146 ] Chris Mattmann commented on TIKA-4009: -- ugh, one more time, not `geo.topic`, instead `geo/topic

[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708144#comment-17708144 ] Chris Mattmann commented on TIKA-4009: -- Forgot the config, file, fixed in main:   {noformat} (base

[jira] [Resolved] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann resolved TIKA-4009. -- Resolution: Fixed Fixed:   {noformat} (base) mattmann@proscuitto:~/git/tika$ git commit -m

[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708070#comment-17708070 ] Chris Mattmann commented on TIKA-4009: -- OK, I have a patch and commit forthcoming but it's fixed

[jira] [Created] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
Chris Mattmann created TIKA-4009: Summary: GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic Key: TIKA-4009 URL: https://issues.apache.org/jira/browse/TIKA-4009

[jira] [Assigned] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann reassigned TIKA-4009: Assignee: Chris Mattmann > GeoTopic Parser package changed incorrectly f

[jira] [Updated] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann updated TIKA-3439: - Issue Type: New Feature (was: Bug) > Create new TensorFlow2 backed Tika NLP doc

[jira] [Assigned] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann reassigned TIKA-3439: Assignee: Chris Mattmann > Create new TensorFlow2 backed Tika NLP doc

[jira] [Created] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)
Chris Mattmann created TIKA-3439: Summary: Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis Key: TIKA-3439 URL: https://issues.apache.org/jira/browse/TIKA-3439 Project: Tika

Re: Question on custom tika-python configs for OMB PDF

2021-05-26 Thread Chris Mattmann
Hannah, I am pushing your question upstream to the dev@tika list. I think what you need is for them to look at your config file which I’ve reattached below pasted, and then see if it looks ok. Then in Tika Python you need to give it this config file before your server starts up or outside of

[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338675#comment-17338675 ] Chris Mattmann commented on TIKA-94: [~lewismc] congratulations! What an accomplishment! > Spe

[jira] [Resolved] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann resolved TIKA-3329. -- Resolution: Fixed Merged into main! Thanks [~thammegowda]!   {noformat} (base) mattmann

[jira] [Updated] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann updated TIKA-3329: - Fix Version/s: 2.0.0 > RTG Translator with many-to-eng translat

[jira] [Updated] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann updated TIKA-3329: - Labels: memex (was: ) > RTG Translator with many-to-eng translat

[jira] [Assigned] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Mattmann reassigned TIKA-3329: Assignee: Chris Mattmann (was: Thamme Gowda) > RTG Translator with many-to-

Re: Python-tika: issues related to memory consumption

2021-03-15 Thread Chris Mattmann
Hi Manish, I think you should ask this one upstream on the Tika Dev lists. I’ve cc’ed them for you. From: manish mathur Date: Monday, March 15, 2021 at 4:41 AM To: Subject: Re: Python-tika: issues related to memory consumption Hi Chris, I am using python-tika library to

Re: Help in tika-python

2021-01-15 Thread Chris Mattmann
l.com" Subject: Help in tika-python Hello Chris Mattmann, I installed your library, it works perfectly. I wonder if it possible to find the position (bounding boxes ) of the texts and images on ppt files. And to discorver which page de of the slides that texts come from. Thanks Nilton

FW: [EXTERNAL] Tika - problem with Polish encoding

2020-12-16 Thread Chris Mattmann
Copying the Tika dev list where I think you will find the help you are looking for  From: Mariusz G Date: Wednesday, December 16, 2020 at 7:04 AM To: "Mattmann, Chris A (US 1740)" Subject: [EXTERNAL] Tika - problem with Polish encoding Hello Sir, I'm writing to you because I

Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

2020-11-25 Thread Chris Mattmann
Welcome Peter!  From: Peter Lee Reply-To: Date: Wednesday, November 25, 2020 at 6:08 PM To: "dev@tika.apache.org" , "talli...@apache.org" Cc: "u...@tika.apache.org" Subject: Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer Many thanks to you, Tim. :) Hi,

Re: [EXTERNAL] Tika - Issues extracting Arabic script

2020-11-24 Thread Chris Mattmann
Christian thank you for reaching out. I am copying dev@tika.apache.org as I think your question is best directed there since tika python is downstream of the processing that happens there. Best of luck! Cheers Chris From: Christian Faggionato Date: Tuesday, November 24, 2020 at

Re: [EXTERNAL] I have some questions about tika-python

2020-08-29 Thread Chris Mattmann
Thanks for reaching out Aditya and for using Tika Python. This issue is best solved upstream in dev@tika.apache.org so I am copying that list and making it the reply to. The issue likely lies in the PDFBox algorithm. There are PDFBox folks on this list. They can help you. Hopefully there is a

Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Chris Mattmann
Haha  I’m down and supportive! Time’s TIME FOR 2.x  From: Tim Allison Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 174B-Affiliate)" Date: Friday, August 14, 2020 at 6:06 AM To: "" Subject: [EXTERNAL] Tika 2.0 modularization All, I _think_ I might have some time to

[jira] [Commented] (TIKA-3119) General upgrades for 1.25

2020-06-19 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140963#comment-17140963 ] Chris Mattmann commented on TIKA-3119: -- [~agibsonccc] can you help see above? > General upgra

Re: [EXTERNAL] renaming master?

2020-06-16 Thread Chris Mattmann
How about just development? We use that  on OODT … though we have a master too that  needs to get removed … From: Tim Allison Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 1740-Affiliate)" Date: Tuesday, June 16, 2020 at 10:31 AM To: "" Subject: [EXTERNAL] renaming master?

[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091708#comment-17091708 ] Chris Mattmann commented on TIKA-3093: -- yea we have lots of pipelines with OODT and Tika that does

Re: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

2020-04-16 Thread Chris Mattmann
Yes, some of us have been developing an Elastic scaling stack for Tika server… That does just that with AWS. Don’t have it ready to push upstream yet. Cheers, Chris From: Eric Pugh Reply-To: "dev@tika.apache.org" Date: Thursday, April 16, 2020 at 7:09 AM To: "dev@tika.apache.org"

[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2020-04-06 Thread Chris Mattmann (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076659#comment-17076659 ] Chris Mattmann commented on TIKA-2368: -- I have a TensorFlow version of Sentiment Analysis based

Re: [EXTERNAL] Re: JDK 12 build issues

2020-03-18 Thread Chris Mattmann
ated. Cheers, Oleg On Wed, Mar 18, 2020 at 4:35 PM Chris Mattmann wrote: So I was able to get past my issues with Tesseract by reinstalling the latest version with Brew. I have a new issue! I’ve tried in JDK12 and JDK13 to build tika-dl, but it keeps failing:

Re: [EXTERNAL] Re: JDK 12 build issues

2020-03-18 Thread Chris Mattmann
Date: Wednesday, March 18, 2020 at 2:35 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Re: JDK 12 build issues Haven’t tried...we should add java 12-14 to Jenkins. Wait, are we up to 18 yet... Will look into it... On Tue, Mar 17, 2020 at 10:07 PM Chris Mattmann wro

JDK 12 build issues

2020-03-17 Thread Chris Mattmann
Hey Tim et al., Do the tests fail for you with Java 12? [INFO] Running org.apache.tika.parser.pkg.GzipParserTest [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.397 s - in org.apache.tika.parser.pkg.GzipParserTest [INFO] Running

Re: [EXTERNAL] question about Tika

2020-02-10 Thread Chris Mattmann
Thanks.  Please make sure dev@tika.apache.org is where you are addressing  these questions to. From: Max Franklin Date: Monday, February 10, 2020 at 10:59 AM To: Chris Mattmann Subject: Re: [EXTERNAL] question about Tika Hi Chris, The Tika Server seems to work okay for me

FW: [EXTERNAL] question about Tika

2020-02-10 Thread Chris Mattmann
Max,  does Tika Server work OK for you? Is there a different behavior with Tika Python than simply posting the PDF to Tika server? Try first and then I am redirecting you to the Tika dev list for help. Thanks, Chris From: Max Franklin Date: Monday, February 10, 2020 at 9:37 AM

Re: [EXTERNAL] Regarding unicodeencode Error

2020-01-08 Thread Chris Mattmann
OK can you please post an issue http://issues.apache.org/jira/browse/TIKA and attach your document and specific error? Thanks! From: "Gowda,Sumanth" Date: Wednesday, January 8, 2020 at 9:36 PM To: Chris Mattmann Subject: RE: [EXTERNAL] Regarding unicodeencode Error T

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-01-08 Thread Chris Mattmann
0>> >>> > >>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 <https://github.com/apache/tika/pull/305> < https://github.com/apache/tika/pull/305 < https://github.com/apache/tika/pull/305>> >>> > &

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-04 Thread Chris Mattmann
Thanks for bringing this conversation up Eric. Historically if you look over the last 5 years, I think what you are asking below has sort of already become the de facto truth. Most people are in fact using Tika server, whether they are individual devs, govvies, commercial folk and the like.

Re: [EXTERNAL] Docker image along with 1.23?

2019-11-20 Thread Chris Mattmann
aking the existing Dockerfile that LogicalSpark has published. I don’t know how other projects at ASF handle the image publishing. On Nov 20, 2019, at 7:02 PM, Chris Mattmann wrote: Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping text file, code. Under a l

Re: [EXTERNAL] Re: Docker image along with 1.23?

2019-11-20 Thread Chris Mattmann
Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping text file, code. Under a license. If we create a “docker image” and then publish it to the ASF hub then I agree with you. My suggestion and my interpretation of Tim’s is to ship a standard “Dockerfile”. Do you

Re: [EXTERNAL] Tika 1.23?

2019-11-20 Thread Chris Mattmann
+1 ship it From: Tim Allison Reply-To: "dev@tika.apache.org" , "Allison, Timothy B (US 1760-Affiliate)" Date: Wednesday, November 20, 2019 at 9:07 AM To: "" Subject: [EXTERNAL] Tika 1.23? All, I've abandoned hope of getting the contenthandler factory configuration stuff into

Re: [EXTERNAL] How to set the page segmentation for TIKA python

2019-11-13 Thread Chris Mattmann
Hi Aswathi, Please check with dev@tika.apache.org. Cheers, Chris From: Aswathi Nambiar Date: Wednesday, November 13, 2019 at 7:39 AM To: "Mattmann, Chris A (US 1760)" Subject: [EXTERNAL] How to set the page segmentation for TIKA python Hi Chris, I am using Apache

Re: [EXTERNAL] Extracting font information from xml

2019-10-15 Thread Chris Mattmann
Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and it provides this functionality. CC’ing dev@tika From: Jay Chuk Date: Tuesday, October 15, 2019 at 3:47 PM To: "Mattmann, Chris A (US 1761)" Subject: [EXTERNAL] Extracting font information from xml Hi

Re: [EXTERNAL] Extracting font information from xml

2019-10-15 Thread Chris Mattmann
When you do a parse, do this: from tika import parser parsed = parser.from_file(‘/path/to/file’, xmlContent=True) xmlContent = parsed[“content”] print(xmlContent) G’luck! Cheers Chris From: Jay Chuk Date: Tuesday, October 15, 2019 at 3:54 PM To: Chris Mattmann Cc

Re: [EXTERNAL] Urgent!!! Tika-python

2019-08-19 Thread Chris Mattmann
I was able to compress the files in a single zip file and extract, this worked but the extracted text where saved in a single file, i need the files to be saved in their individual files so I can use them as input to another program. Please what is the best method to go about this. Thank

Re: [EXTERNAL] TIKA

2019-08-11 Thread Chris Mattmann
Victor, please send your email to dev@tika.apache.org, which I’ve CC’ed… From: Victor Olaiya Date: Tuesday, August 6, 2019 at 1:37 PM To: "Mattmann, Chris A (US 1761)" Subject: [EXTERNAL] TIKA Hello chris, I am building an information retrieval system and i need apache tika to

Re: [EXTERNAL] Re: Merge flow

2019-07-10 Thread Chris Mattmann
I’ve also got some new stuff I’m getting ready to contribute, in the following ML/Deep Learning areas: Some Basic models using Tensorflow stable 1.13 CIFAR-10 image classifier using a CNN ~86% accuracy – obviously different than Inception-v3/v4 and VGG-16 which we currently have available,

Re: [EXTERNAL] Re: Tika 1.22?

2019-06-25 Thread Chris Mattmann
Looks good… From: Oleg Tikhonov Reply-To: "dev@tika.apache.org" Date: Tuesday, June 25, 2019 at 7:57 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Re: Tika 1.22? Would be great!!! Cheers, Oleg On Tue, Jun 25, 2019, 17:45 Tim Allison wrote: All, The vote for the

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
ling to confirm that my commit/fix is sane, I'd appreciate it. Thank you!!! Cheers, Tim On Wed, May 8, 2019 at 11:32 AM Chris Mattmann wrote: Thejan, Thamme any ideas? From: Tim Allison Reply-

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
On Wed, May 8, 2019 at 11:32 AM Chris Mattmann wrote: Thejan, Thamme any ideas? From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Wednesday, May 8, 2019 at 7:50 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Re: DL4JVGG16NetTes

Re: [EXTERNAL] DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
I will test this out From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Wednesday, May 8, 2019 at 6:58 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] DL4JVGG16NetTest failures All, Apologies for the broken builds...I'm not able to reproduce this test failure on my mac

Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
Thejan, Thamme any ideas? From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Wednesday, May 8, 2019 at 7:50 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Re: DL4JVGG16NetTest failures Any recommendations? java.lang.IllegalStateException: Number of indices (got 2) must

Re: [EXTERNAL] Tika script

2019-04-26 Thread Chris Mattmann
Hi, This would be a good question to ask on the dev@tika.a.o list so I’m CC’ing them. Cheers, Chris From: Djari Imene Date: Friday, April 26, 2019 at 9:45 AM To: "Mattmann, Chris A (1761)" Subject: [EXTERNAL] Tika script Good evening sir I am writing you to request more

Re: [EXTERNAL] Wiki migration

2019-03-21 Thread Chris Mattmann
+1 from me! From: Konstantin Gribov Reply-To: "dev@tika.apache.org" Date: Thursday, March 21, 2019 at 10:02 AM To: "dev@tika.apache.org" Subject: [EXTERNAL] Wiki migration Hi, folks What do you think about starting wiki migration (from moin to confluence)? I can try it via

Re: 1.20?

2018-12-13 Thread Chris Mattmann
Roll forward! Yay! From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Thursday, December 13, 2018 at 7:02 AM To: "dev@tika.apache.org" Subject: Re: 1.20? Reports are here: http://162.242.228.174/reports/tika_1_20-pre-rc1.zip I'm going to revert the mp4 parser, and

Re: 1.20?

2018-11-20 Thread Chris Mattmann
Love it and I can align tika-python with that too ☺ From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Tuesday, November 20, 2018 at 3:04 PM To: "dev@tika.apache.org" Subject: 1.20? All, POI 4.0.1 will be out shortly with some important bug fixes. What would you all

Re: ***UNCHECKED*** Fwd: MODERATE for annou...@apache.org

2018-09-26 Thread Chris Mattmann
+1 from me please update the wiki once you do From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Wednesday, September 26, 2018 at 5:47 AM To: "dev@tika.apache.org" Cc: Craig Russell Subject: Re: ***UNCHECKED*** Fwd: MODERATE for annou...@apache.org All, It is ok to

Re: 1.19.1?

2018-09-25 Thread Chris Mattmann
Sounds great! From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Tuesday, September 25, 2018 at 9:40 AM To: "dev@tika.apache.org" Subject: Re: 1.19.1? Given the mp3 issue and some other items, let's go with 1.19.1 rc1 today or tomorrow? On Mon, Sep 24, 2018 at 3:07 PM Nick

Re: 1.19.1?

2018-09-21 Thread Chris Mattmann
Let’s roll it…. From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Wednesday, September 19, 2018 at 12:14 PM To: "dev@tika.apache.org" Subject: 1.19.1? The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly clear on this but I did some self-hand-waving to

FW: Tika DjVu?

2018-08-01 Thread Chris Mattmann
From: KamilD Date: Tuesday, July 31, 2018 at 11:37 PM To: "dev-ow...@tika.apache.org" Subject: Tika DjVu? Helo, I'm trying to use tika for djvu but is problem. When using app version 1.14 I get empty result, but in version 1.18 I get: C:\Users\>java -jar

Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
ach is REST + Docker? The upkeep in tika-dl is nontrivial. On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann wrote: Tim, Thanks. There are multiple modes of integrating deep learning with Tika: The original mode: uses Thamme’s work on REST exposing Tensorflow and Docker to pr

Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
Tim, Thanks. There are multiple modes of integrating deep learning with Tika: The original mode: uses Thamme’s work on REST exposing Tensorflow and Docker to provide a REST Service to Tika to allow for running Tensorflow DL models. We initially did Inception_v3, and a model by Madhav Sharan

Re: Tika 1.19?

2018-07-06 Thread Chris Mattmann
Once tika-dl works again with Inception v4, I’m good ☺ I’m working on adding some more models to tika-dl and other things but those can come after 1.19. Cheers, Chris From: Tim Allison Reply-To: "dev@tika.apache.org" Date: Friday, July 6, 2018 at 8:40 AM To:

Re: Branch_1x build broke?

2018-05-24 Thread Chris Mattmann
ctly on my Windows and Linux setups. Cheers, Dave On Thu, 24 May 2018, 17:09 Chris Mattmann, <mattm...@apache.org> wrote: Tim, Are you seeing this? Results : Failed tests: PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertConta

Branch_1x build broke?

2018-05-24 Thread Chris Mattmann
Tim, Are you seeing this? Results : Failed tests: PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 pdf_haystack not found in: http://www.w3.org/1999/xhtml;>

Welcome Thejan Wijesinghe as an Apache Tika PMC and committer!

2018-05-07 Thread Chris Mattmann
Welcome to Thejan Wijesinghe who has joined as a new Tika PMC member and committer! Please say a bit about yourself…thanks! Cheers, Chris

Re: rfc822 updates and 1.18

2018-04-06 Thread Chris Mattmann
Awesomeness From: "Allison, Timothy B." Reply-To: "dev@tika.apache.org" Date: Friday, April 6, 2018 at 11:30 AM To: "dev@tika.apache.org" Subject: rfc822 updates and 1.18 All, I made two updates to our handling of

Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Chris Mattmann
+1 From: Nick Burch Reply-To: "dev@tika.apache.org" Date: Wednesday, March 28, 2018 at 8:01 AM To: "dev@tika.apache.org" Subject: Re: message/news; charset=windows-1252 -> message/rfc822 On Wed, 28 Mar 2018, Allison,

R-Tika API Binding

2018-03-20 Thread Chris Mattmann
Hey Folks, Just found this R-Tika API binding: https://ropensci.github.io/rtika/articles/rtika_introduction.html Very cool! Updated the wiki with it. Cheers, Chris

Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-03-18 Thread Chris Mattmann
Completely agree, awesome job Nick. I will definitely try this week as well. Thank you! Sincerely, Chris On 3/18/18, 2:47 PM, "David Meikle" wrote: Nice one Nick! Will take a look this week. Cheers, Dave On 14 March 2018 at 17:38, Nick Burch

Re: Tika 1.18?

2018-03-07 Thread Chris Mattmann
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9 On 3/7/18, 1:16 PM, "Allison, Timothy B." wrote: All, I think I've made the updates that I wanted to make sure got in to 1.18. It looks like PDFBox is going to start their release cycle

Re: Tika 1.18?

2018-03-01 Thread Chris Mattmann
Same: makes perfect sense to me and let's do it ( I just updated (finally) Tika Python down stream to be based on the 1.16 Tika, I guess I should get it based on 1.17 soon too ( https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#L17 Cheers, Chris On 3/1/18, 5:16 AM,

Re: RE : Re: Issue with apache Tika

2018-02-24 Thread Chris Mattmann
No clue - Radhia - perhaps you can enlighten everyone..? On 2/23/18, 6:45 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: Um, no, that's not great. What's wrong with our current version?  -Original Message- From: Chris Mattmann [mailto:mat

Re: RE : Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Great to hear! From: radhia bezzine <bezzinerad...@gmail.com> Date: Thursday, February 22, 2018 at 12:28 PM To: Chris Mattmann <mattm...@apache.org> Subject: Re: RE : Re: Issue with apache Tika Hi Chris ! I fixed the issue ! it was not so complicated ! a proble

Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Try UTF-8 encoding the URLs or the parameters themselves. If you are using Tika-Python, then use the Python encode library… Cheers, Chris From: radhia bezzine Date: Thursday, February 22, 2018 at 6:03 AM To: "Mattmann, Chris A (1761)"

Re: Requesting Tika Wiki Page Edit Access

2018-02-17 Thread Chris Mattmann
Added! https://wiki.apache.org/tika/ContributorsGroup Feel free to edit the page From: Prerana Teligi Harapanahalli Math Date: Thursday, February 15, 2018 at 8:35 PM To: "dev@tika.apache.org" , "Mattmann, Chris A (1761)"

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Chris Mattmann
eate an optional setinputstreamfactory() method in TikaInputStream, so the user can implement an InputStreamFactory interface with a getInputStream method, if he does not want to pay a performance hit with temp files for everything. Luis Em 5 de fev de 2018 4:52 PM, "C

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
induce overhead, but as a start, why not? In short just run through the stream 2x ++++++ Chris Mattmann, Ph.D. Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
2 Jan 2018, Nick Burch wrote: > On Thu, 26 Oct 2017, Chris Mattmann wrote: >> On collision, the precedence order defines what key takes precedence and >> _overwrites_ the other. Overwrite is but one option (you could save *all* >> the values it’s a multi-val

Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
to OSSRH and synced On 2/5/18, 9:01 AM, "Chris Mattmann" <mattm...@apache.org> wrote: Hmmm...the problem here is that Sonatype won't let us publish to Central with the below. It's not even an ASF policy thing - it's a Sonatype thing On 2/5/18, 5:55 AM, &qu

Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
Hmmm...the problem here is that Sonatype won't let us publish to Central with the below. It's not even an ASF policy thing - it's a Sonatype thing On 2/5/18, 5:55 AM, "Allison, Timothy B." wrote: Sorry for the duplication, but I wanted to check on this and didn't want

Re: 1.17 rc1 and two repos in nexus?!

2017-12-08 Thread Chris Mattmann
c1 and two repos in nexus?! Do we expect only the src to be in nexus, not the jar artifacts (with sigs and digests) for app, server, eval? -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, December 8, 2017 5:07 PM To: dev

Re: 1.17 rc1 and two repos in nexus?!

2017-12-08 Thread Chris Mattmann
Hey Tim, probably just upload errors on the first one and so it tried again. No worries. Drop and close the first, and just use the 2nd. Cheers, Chris On 12/8/17, 12:05 PM, "Allison, Timothy B." wrote: Not sure what happened, but two repos were created in Nexus:

Re: Tika 1.17?

2017-11-29 Thread Chris Mattmann
eers, > Dave > > > > On 3 November 2017 at 15:19, Mattmann, Chris A (3010) < > chris.a.mattm...@jpl.nasa.gov> wrote: > > > Let’s make it so ( > > > > > +

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
On Thu, 26 Oct 2017, Chris Mattmann wrote: > My general approach to conflicting metadata is simply to define > precedence orders. > > For example here is one documented from OODT: > > https://cwiki.apache.org/confluence/display/OODT/Understa

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
maybe in tika-config.xml would be a fine start. On 10/26/17, 9:14 AM, "Nick Burch" <apa...@gagravarr.org> wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: > Why don’t we just store N copies of the stream, and parse it twice? I'm not sure that's the chal

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Why don’t we just store N copies of the stream, and parse it twice? Of course that’s the ugly way, but currently the way I’ve hacked this in all of my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just use that as the weakest baseline and work backwards from there? Chris

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-24 Thread Chris Mattmann
This makes sense to me, +1 Giuseppe! On 10/24/17, 6:12 PM, "Giuseppe Totaro" wrote: Hi folks, I am developing the proposed solutions within tika-server for enabling specific ContentHandlers. Basically, I am working to provide the ability of giving

Re: Announcing go-tika, a Go package for Tika

2017-10-06 Thread Chris Mattmann
I saw this Tyler, and it’s awesome. I forked it already though I’m not a Go programmer thank you for increasing the community here ( CC’ing Jim Jag who I know has done some Go programming, Jim spread the word ;) Cheers, Chris On 10/6/17, 10:12 AM, "Tyler Bui-Palsulich"

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Chris Mattmann
ssing TikaConfig is needed anyway, having a way to specify a handler there can be handy too... Cheers, Sergey On 28/09/17 22:17, Chris Mattmann wrote: > I am +1 for this. Option #2 sounds like a slick way to handle this for me that would > remain back compat

Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Chris Mattmann
I am +1 for this. Option #2 sounds like a slick way to handle this for me that would remain back compat with tika-python which is of strong interest to me. Cheers, Chris On 9/28/17, 1:35 PM, "Giuseppe Totaro" wrote: Hi folks, if I am not wrong, currently

Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression.

Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
Hi all, One other thing is that Tika extracts metadata, and language information in which order doesn’t matter (Keys can be out of order). Would this be useful? Cheers, Chris On 9/21/17, 2:10 PM, "Sergey Beryozkin" wrote: Hi Eugene Thank you, very

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Chris Mattmann
te a new > instance of TikaIO pipeline, and point it to the new temp folder where a > new batch of files has been dropped to. > > Thanks, Sergey > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote: >> Amazing work, thank you Sergey!! >> &

Re: Tika 2.0?

2017-09-12 Thread Chris Mattmann
ranch is so I defer to Tim on the risk of going with #1. - Bob On 9/11/2017 5:15 PM, Chris Mattmann wrote: > +1000 > > > > On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Y, wel

Re: Tika 2.0?

2017-09-11 Thread Chris Mattmann
+1000 On 9/11/17, 12:03 PM, "Allison, Timothy B." wrote: Y, well, I didn't say _which_ September... Given my limited availability to work on this in Sept and POI's decision to move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 and

Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Chris Mattmann
Welcome Madhav! Cheers, Chris On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" wrote: Hello Everyone, Please join me in welcoming Madhav Sharan as a PMC Members and Committer to the project!

Re: Query related to Apache Tika dependencies

2017-08-08 Thread Chris Mattmann
From: Deepanshu Bhardwaj Date: Tuesday, August 8, 2017 at 2:53 AM To: "dev-ow...@tika.apache.org" Subject: Query related to Apache Tika dependencies Hi Team, I need one help. I need to know the list of libraries

Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-08 Thread Chris Mattmann
+1 from me SIGS and CHECKSUMS look good. Thanks Tim! Cheers, Chris LMC-053601:apache-tika-1.16-rc1 mattmann$ for type in "" \-app \-eval \-server; do $HOME/bin/stage_apache_rc tika$type 1.16 https://dist.apache.org/repos/dist/dev/tika/; done % Total% Received % Xferd Average Speed

Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
thy B." <talli...@mitre.org> wrote: Thank you, Chris! Now, how do I bulk move open 1.16->1.17 on JIRA? -Original Message----- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, July 7, 2017 11:39 AM To: dev@tika.apache.org

Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
Sure On 7/7/17, 7:57 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: I'll leave the moving to a new module to you? -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, July 7, 2017 10:32 AM To: dev@tika.

Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
Great Tim thanks! On 7/7/17, 7:28 AM, "talli...@apache.org" wrote: This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/tika.git The

Re: Tika 1.15.1? -> 1.16

2017-07-07 Thread Chris Mattmann
10) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Monday, July 3, 2017 2:24 PM > To: dev@tika.apache.org > Subject: Re: Tika 1.15.1? -> 1.16 > > Hey Tim, if I don’t get it done by today, push 1.16 and we’ll put Age > Detection in 1.17. > > +

  1   2   3   >