[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708146#comment-17708146
 ] 

Chris Mattmann commented on TIKA-4009:
--

ugh, one more time, not `geo.topic`, instead `geo/topic`

 
{noformat}
(base) mattmann@proscuitto:~/git/tika$ !2036
git commit -m "Forgot to fix the path for this config file: TIKA-4009: GeoTopic 
Parser package changed incorrectly from o.a.t.parser.geo from 
o.a.t.parser.geo.topic"
[main a7d34b100] Forgot to fix the path for this config file: TIKA-4009: 
GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
o.a.t.parser.geo.topic
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/resources/org/apache/tika/parser/{geo.topic
 => geo/topic}/GeoTopicConfig.properties (100%)
(base) mattmann@proscuitto:~/git/tika$ git push -u origin main
Warning: the ECDSA host key for 'github.com' differs from the key for the IP 
address '192.30.255.113'
Offending key for IP in /home/mattmann/.ssh/known_hosts:8
Matching host key in /home/mattmann/.ssh/known_hosts:9
Are you sure you want to continue connecting (yes/no)? yes
Enter passphrase for key '/home/mattmann/.ssh/id_rsa': 
Enumerating objects: 24, done.
Counting objects: 100% (24/24), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (13/13), 1.03 KiB | 1.03 MiB/s, done.
Total 13 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
remote: 
remote: GitHub found 3 vulnerabilities on apache/tika's default branch (1 
critical, 2 high). To find out more, visit:
remote:      https://github.com/apache/tika/security/dependabot
remote: 
To github.com:apache/tika.git
   0672949c4..a7d34b100  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.
(base) mattmann@proscuitto:~/git/tika$ 
 {noformat}

> GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
> o.a.t.parser.geo.topic
> -
>
> Key: TIKA-4009
> URL: https://issues.apache.org/jira/browse/TIKA-4009
> Project: Tika
>  Issue Type: Bug
>  Components: geo
>Reporter: Chris Mattmann
>Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.7.1
>
>
> The package for the GeoTopicParser was incorrectly changed. It should be 
> changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708144#comment-17708144
 ] 

Chris Mattmann commented on TIKA-4009:
--

Forgot the config, file, fixed in main:

 
{noformat}
(base) mattmann@proscuitto:~/git/tika$ git status
On branch main
Your branch is up to date with 'origin/main'.Changes to be committed:
  (use "git restore --staged ..." to unstage)
        renamed:    
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/resources/org/apache/tika/parser/geo/GeoTopicConfig.properties
 -> 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/resources/org/apache/tika/parser/geo.topic/GeoTopicConfig.properties
                                                                                
    (base) mattmann@proscuitto:~/git/tika$ history | grep commit
 1071  git commit -m "Finished through chapter 6."
 1076  git commit -m "finished through chapter 6."
 1254  git commit -m "Fix test case files #385"
 1258  git commit -m "Fixes #385"
 1308  git commit -m "Prepare for Tika 1.24.2 release"
 1393  git commit -m "Upgrade to 1.25 and prep for release."
 1454  git commit -m "Bump to version 2.6.0 for current development."
 1569  git commit -m "Update MEMEX links."
 2035  history | grep commit
(base) mattmann@proscuitto:~/git/tika$ git commit -m "Forgot to fix the path 
for this config file: TIKA-4009: GeoTopic Parser package changed incorrectly 
from o.a.t.parser.geo from o.a.t.parser.geo.topic"
[main 0672949c4] Forgot to fix the path for this config file: TIKA-4009: 
GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
o.a.t.parser.geo.topic
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/resources/org/apache/tika/parser/{geo
 => geo.topic}/GeoTopicConfig.properties (100%)
(base) mattmann@proscuitto:~/git/tika$ git push -u origin master
error: src refspec master does not match any
error: failed to push some refs to 'g...@github.com:apache/tika.git'
(base) mattmann@proscuitto:~/git/tika$ git push -u origin main
Warning: the ECDSA host key for 'github.com' differs from the key for the IP 
address '192.30.255.113'
Offending key for IP in /home/mattmann/.ssh/known_hosts:8
Matching host key in /home/mattmann/.ssh/known_hosts:9
Are you sure you want to continue connecting (yes/no)? yes
Enter passphrase for key '/home/mattmann/.ssh/id_rsa': 
Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 12 threads
Compressing objects: 100% (9/9), done.
Writing objects: 100% (12/12), 923 bytes | 230.00 KiB/s, done.
Total 12 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
remote: 
remote: GitHub found 3 vulnerabilities on apache/tika's default branch (1 
critical, 2 high). To find out more, visit:
remote:      https://github.com/apache/tika/security/dependabot
remote: 
To github.com:apache/tika.git
   9313b0c31..0672949c4  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.
(base) mattmann@proscuitto:~/git/tika$ 
 {noformat}

> GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
> o.a.t.parser.geo.topic
> -
>
> Key: TIKA-4009
> URL: https://issues.apache.org/jira/browse/TIKA-4009
> Project: Tika
>  Issue Type: Bug
>  Components: geo
>Reporter: Chris Mattmann
>Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.7.1
>
>
> The package for the GeoTopicParser was incorrectly changed. It should be 
> changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann resolved TIKA-4009.
--
Resolution: Fixed

Fixed:

 
{noformat}
(base) mattmann@proscuitto:~/git/tika$ git commit -m "Fix for TIKA-4009: 
GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
o.a.t.parser.geo.topic"
[main 9313b0c31] Fix for TIKA-4009: GeoTopic Parser package changed incorrectly 
from o.a.t.parser.geo from o.a.t.parser.geo.topic
 9 files changed, 15 insertions(+), 15 deletions(-)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/GeoParser.java (97%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/GeoParserConfig.java (98%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/GeoTag.java (95%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/NameEntityExtractor.java (98%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/gazetteer/GeoGazetteerClient.java (97%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/main/java/org/apache/tika/parser/geo/{
 => topic}/gazetteer/Location.java (97%)
 rename 
tika-parsers/tika-parsers-ml/tika-parser-nlp-module/src/test/java/org/apache/tika/parser/geo/{
 => topic}/GeoParserTest.java (97%)
(base) mattmann@proscuitto:~/git/tika$ git push -u origin main
Warning: the ECDSA host key for 'github.com' differs from the key for the IP 
address '192.30.255.113'
Offending key for IP in /home/mattmann/.ssh/known_hosts:8
Matching host key in /home/mattmann/.ssh/known_hosts:9
Are you sure you want to continue connecting (yes/no)? yes
Enter passphrase for key '/home/mattmann/.ssh/id_rsa': 
Enumerating objects: 64, done.
Counting objects: 100% (64/64), done.
Delta compression using up to 12 threads
Compressing objects: 100% (26/26), done.
Writing objects: 100% (40/40), 11.29 KiB | 3.76 MiB/s, done.
Total 40 (delta 8), reused 0 (delta 0)
remote: Resolving deltas: 100% (8/8), completed with 6 local objects.
remote: 
remote: GitHub found 3 vulnerabilities on apache/tika's default branch (1 
critical, 2 high). To find out more, visit:
remote:      https://github.com/apache/tika/security/dependabot
remote: 
To github.com:apache/tika.git
   fde9a3b89..9313b0c31  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.
(base) mattmann@proscuitto:~/git/tika$ 
 {noformat}

> GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
> o.a.t.parser.geo.topic
> -
>
> Key: TIKA-4009
> URL: https://issues.apache.org/jira/browse/TIKA-4009
> Project: Tika
>  Issue Type: Bug
>  Components: geo
>    Reporter: Chris Mattmann
>Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.7.1
>
>
> The package for the GeoTopicParser was incorrectly changed. It should be 
> changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17708070#comment-17708070
 ] 

Chris Mattmann commented on TIKA-4009:
--

OK, I have a patch and commit forthcoming but it's fixed and work:

 
{noformat}
[INFO] Apache Tika fuzzing  SUCCESS [  1.076 s]
[INFO] Apache Tika examples ... SUCCESS [  8.119 s]
[INFO] Apache Tika Java-7 Components .. SUCCESS [  1.921 s]
[INFO] Apache Tika Detectors .. SUCCESS [  0.031 s]
[INFO] Apache Tika Siegfried wrapper .. SUCCESS [  1.802 s]
[INFO] Apache Tika  SUCCESS [  0.036 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time:  11:55 min
[INFO] Finished at: 2023-04-03T10:27:58-07:00
[INFO] 
(base) mattmann@proscuitto:~/git/tika$ 
 {noformat}

> GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
> o.a.t.parser.geo.topic
> -
>
> Key: TIKA-4009
> URL: https://issues.apache.org/jira/browse/TIKA-4009
> Project: Tika
>  Issue Type: Bug
>  Components: geo
>    Reporter: Chris Mattmann
>    Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.7.1
>
>
> The package for the GeoTopicParser was incorrectly changed. It should be 
> changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)
Chris Mattmann created TIKA-4009:


 Summary: GeoTopic Parser package changed incorrectly from 
o.a.t.parser.geo from o.a.t.parser.geo.topic
 Key: TIKA-4009
 URL: https://issues.apache.org/jira/browse/TIKA-4009
 Project: Tika
  Issue Type: Bug
  Components: geo
Reporter: Chris Mattmann
 Fix For: 2.7.1


The package for the GeoTopicParser was incorrectly changed. It should be 
changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (TIKA-4009) GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from o.a.t.parser.geo.topic

2023-04-03 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann reassigned TIKA-4009:


Assignee: Chris Mattmann

> GeoTopic Parser package changed incorrectly from o.a.t.parser.geo from 
> o.a.t.parser.geo.topic
> -
>
> Key: TIKA-4009
> URL: https://issues.apache.org/jira/browse/TIKA-4009
> Project: Tika
>  Issue Type: Bug
>  Components: geo
>    Reporter: Chris Mattmann
>    Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.7.1
>
>
> The package for the GeoTopicParser was incorrectly changed. It should be 
> changed back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann updated TIKA-3439:
-
Issue Type: New Feature  (was: Bug)

> Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis
> ---
>
> Key: TIKA-3439
> URL: https://issues.apache.org/jira/browse/TIKA-3439
> Project: Tika
>  Issue Type: New Feature
>    Reporter: Chris Mattmann
>Priority: Major
>
> Add new TensorFlow2 based docker that will contribute:
>  * Updated Sentiment Analysis parser and model
>  * Age Detector and classifier 
>  
> There is more coming from my book, [Machine Learning with TensorFlow 
> 2ed|https://www.manning.com/books/machine-learning-with-tensorflow-second-edition]
>  but this is the start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann reassigned TIKA-3439:


Assignee: Chris Mattmann

> Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis
> ---
>
> Key: TIKA-3439
> URL: https://issues.apache.org/jira/browse/TIKA-3439
> Project: Tika
>  Issue Type: New Feature
>    Reporter: Chris Mattmann
>    Assignee: Chris Mattmann
>Priority: Major
>
> Add new TensorFlow2 based docker that will contribute:
>  * Updated Sentiment Analysis parser and model
>  * Age Detector and classifier 
>  
> There is more coming from my book, [Machine Learning with TensorFlow 
> 2ed|https://www.manning.com/books/machine-learning-with-tensorflow-second-edition]
>  but this is the start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3439) Create new TensorFlow2 backed Tika NLP docker for SentimentAnalysis

2021-06-07 Thread Chris Mattmann (Jira)
Chris Mattmann created TIKA-3439:


 Summary: Create new TensorFlow2 backed Tika NLP docker for 
SentimentAnalysis
 Key: TIKA-3439
 URL: https://issues.apache.org/jira/browse/TIKA-3439
 Project: Tika
  Issue Type: Bug
Reporter: Chris Mattmann


Add new TensorFlow2 based docker that will contribute:
 * Updated Sentiment Analysis parser and model
 * Age Detector and classifier 

 

There is more coming from my book, [Machine Learning with TensorFlow 
2ed|https://www.manning.com/books/machine-learning-with-tensorflow-second-edition]
 but this is the start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Question on custom tika-python configs for OMB PDF

2021-05-26 Thread Chris Mattmann
Hannah, I am pushing your question upstream to the dev@tika list. I think what 
you need is for them to look
at your config file which I’ve reattached below pasted, and then see if it 
looks ok. Then in Tika Python you need
to give it this config file before your server starts up or outside of Python 
just start your server with this config
file working, then Tika Python will pick it up:

 













application/pdf







application/pdf









 

 

Cheers,

Chris

 

 

From: Hannah Eli 
Date: Wednesday, May 26, 2021 at 1:47 PM
To: "Mattmann, Chris A (US 1740)" 
Subject: [EXTERNAL] Question on custom tika-python configs for OMB PDF

 

Hi Chris,  

 

Hope you're well. I'm trying to use tika to parse the table of contents for the 
Office of Management and Budget's A-11 Circular PDF (I know it's short enough 
to parse manually, but we're building a repeatable extract). When I do so, the 
text is parsed out of order. I was trying to fix this by creating a custom 
config file with the sortbyPosition property (see attached), but I'm not an XML 
guru and don't believe it's working properly. I've also tried changing the 
Windows environment variables to point to this file. 

 

Any guidance would be much appreciated. 

 

Thank you!

Hannah

 

-- 

Hannah Eli 



[jira] [Commented] (TIKA-94) Speech-to-text transcription

2021-05-03 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-94?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338675#comment-17338675
 ] 

Chris Mattmann commented on TIKA-94:


[~lewismc] congratulations! What an accomplishment!

> Speech-to-text transcription
> 
>
> Key: TIKA-94
> URL: https://issues.apache.org/jira/browse/TIKA-94
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: new-parser
> Fix For: 1.27
>
>
> Like OCR for image files (TIKA-93), we could try using speech recognition to 
> extract text content (where available) from audio (and video!) files.
> The CMU Sphinx engine (http://cmusphinx.sourceforge.net/) looks promising and 
> comes with a friendly license.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann resolved TIKA-3329.
--
Resolution: Fixed

Merged into main! Thanks [~thammegowda]!

 
{noformat}
(base) mattmann@proscuitto:~/git/tika$ git push -u origin main
Enter passphrase for key '/home/mattmann/.ssh/id_rsa': 
Enumerating objects: 126, done.
Counting objects: 100% (122/122), done.
Delta compression using up to 12 threads
Compressing objects: 100% (44/44), done.
Writing objects: 100% (88/88), 12.11 KiB | 1.21 MiB/s, done.
Total 88 (delta 20), reused 0 (delta 0)
remote: Resolving deltas: 100% (20/20), completed with 9 local objects.
To github.com:apache/tika.git
   7ea14beb1..d0e3230d4  main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.

 {noformat}

> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>    Assignee: Chris Mattmann
>Priority: Major
>  Labels: memex
> Fix For: 2.0.0
>
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann updated TIKA-3329:
-
Fix Version/s: 2.0.0

> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>    Assignee: Chris Mattmann
>Priority: Major
> Fix For: 2.0.0
>
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann updated TIKA-3329:
-
Labels: memex  (was: )

> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>    Assignee: Chris Mattmann
>Priority: Major
>  Labels: memex
> Fix For: 2.0.0
>
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (TIKA-3329) RTG Translator with many-to-eng translation

2021-05-01 Thread Chris Mattmann (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Mattmann reassigned TIKA-3329:


Assignee: Chris Mattmann  (was: Thamme Gowda)

> RTG Translator with many-to-eng translation
> ---
>
> Key: TIKA-3329
> URL: https://issues.apache.org/jira/browse/TIKA-3329
> Project: Tika
>  Issue Type: Improvement
>  Components: translation
>Reporter: Thamme Gowda
>    Assignee: Chris Mattmann
>Priority: Major
>
> The existing translation services in tika-translate are either 
> commercial/paid engines (e.g. Google, Microsoft  etc ) or not state of the 
> art (such as Joshua, Moses etc). 
> Reader Translator Generator () is a neural machine translation toolkit 
> [https://isi-nlp.github.io/rtg/]
>  and has the implementation of Transformer NMT model (current state of the 
> art). 
> It also has massively multilingual pretrained NMT model  ( many-to-English 
> translation direction)  
> [https://hub.docker.com/repository/docker/tgowda/rtg-model] 
> in which about 500 source languages are represented, with atleast ~300 source 
> languages have good enough quality (For a comparison Google translate has 
> ~106 languages, and Microsoft has about 80 languages). 
> This issue is for integrating RTG Translator into tika-translate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Python-tika: issues related to memory consumption

2021-03-15 Thread Chris Mattmann
Hi Manish, I think you should ask this one upstream on the Tika Dev lists. I’ve 
cc’ed them for you.

 

 

 

 

From: manish mathur 
Date: Monday, March 15, 2021 at 4:41 AM
To: 
Subject: Re: Python-tika: issues related to memory consumption

 

Hi Chris,

 

I am using python-tika library to extract the content from pdf. but  lot of 
junks are coming due to tables or graphs etc. so is there have any way to 
ignore while parsing pdf to get the content.

 

Thanks in advance

 

Thanks 

Manish Mathur

 

 

 

 

On Mon, Feb 1, 2021 at 4:18 PM manish mathur  wrote:

Hi Chris,

 

I am using python-tika library for reading pdf urls, but gradually memory 
consumption is increasing so much. is there have any way to release the memory 
after reading one pdf url. Please let me know.

 

Thanks in advance

 

Thanks 

Manish Mathur

 

 



Re: Help in tika-python

2021-01-15 Thread Chris Mattmann
Nilton thank you.

 

I’m copying dev@tika.apache.org as they can help you configure the Tika server
(which Tika Python relies on) that should deal with your use case.

 

Thanks,

Chris

 

 

 

From: Nilton Monteiro 
Date: Friday, January 15, 2021 at 10:12 AM
To: "chris.mattm...@gmail.com" 
Subject: Help in tika-python

 

Hello Chris Mattmann, 

I installed your library, it works perfectly. I wonder if it possible to find 
the position (bounding boxes ) of the texts and images on ppt files.

And to discorver which page de of the slides that texts come from.

Thanks

Nilton 

 



FW: [EXTERNAL] Tika - problem with Polish encoding

2020-12-16 Thread Chris Mattmann
Copying the Tika dev list where I think you will find the help you are looking 
for 

 

 

 

From: Mariusz G 
Date: Wednesday, December 16, 2020 at 7:04 AM
To: "Mattmann, Chris A (US 1740)" 
Subject: [EXTERNAL] Tika - problem with Polish encoding

 

Hello Sir, 

I'm writing to you because I tried everything but unsuccessful.

When I use Tika with Polish PDF documents, Polish language is not encoded 
properly.

 

This is my code:

 

from tika import parser
raw = parser.from_file("/Users/mgrub/Downloads/NLP/PCC_Rokita_2019.pdf")
raw = str(raw)
safe_text = raw.encode('UTF-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )

 

I've tried several different encoding standards (ISO-8859, ISO-8859-2, 
Windows-1250, CP852) but with no success.

If you can help me I will be grateful, because I don't know who can help better 
than you.

 

Regards,

Mariusz Grubba



Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

2020-11-25 Thread Chris Mattmann
Welcome Peter! 

 

 

 

From: Peter Lee 
Reply-To: 
Date: Wednesday, November 25, 2020 at 6:08 PM
To: "dev@tika.apache.org" , "talli...@apache.org" 

Cc: "u...@tika.apache.org" 
Subject: Re: [ANNOUNCE] Welcome Peter Lee as Tika PMC member and committer

 

Many thanks to you, Tim. :)

 

Hi, all

 

I'm Peter Lee and I was a Apache Commons committer. I'm familiar with many 
archivers and compressors. Feel free to ask me if you have some problems in 
compression.

 

I'm honored to be part of Tika. Tika is great and it helped me a lot. Besides, 
Tika is a great community and it has helped a lot of users. I hope I can help 
Tika a little bit.

 

Once again, thank you all for making such a great community!

 

cheers,

Lee

On 11 25 2020, at 9:27, Tim Allison  wrote:

All,

 

The Tika PMC has elected to add Peter Lee to our ranks.

 

Lee,

Please introduce yourself, and welcome aboard!

 

Cheers,

 

Tim



Re: [EXTERNAL] Tika - Issues extracting Arabic script

2020-11-24 Thread Chris Mattmann
Christian thank you for reaching out. I am copying dev@tika.apache.org as 
I think your question is best directed there since tika python is downstream 
of the processing that happens there.

 

Best of luck!

 

Cheers

Chris

 

 

From: Christian Faggionato 
Date: Tuesday, November 24, 2020 at 10:10 AM
To: "Mattmann, Chris A (US 1740)" 
Subject: [EXTERNAL] Tika - Issues extracting Arabic script

 

Dear Chris, 

I am Christian Faggionato, research fellow at the School or Oriental and 
African Studies, University of London. At the moment I’m working on building a 
corpus of Uyghur texts and some of the content is coming from pdf files. I 
wrote a short python script to scrape text from pdf using tika-python. The 
script is Arabic, and the output looks good but there is one major problem: 
there are many missing spaces between words and I really do not know how to 
address this issue. Do you have any suggestions in these regards? 

I am attaching a pdf file and the script I wrote in case you would like to 
check it. Thanks in advance for your help, 

Best

Christian.

-- 

Phd, Post-Doctoral Fellow

Department of Religions and Philosophies

Room 339

SOAS University of London
Thornhaugh Street

London, WC1H 0XG

c...@soas.ac.uk

 



Re: [EXTERNAL] I have some questions about tika-python

2020-08-29 Thread Chris Mattmann
Thanks for reaching out Aditya and for using Tika Python. This issue is 
best solved upstream in dev@tika.apache.org so I am copying that list
and making it the reply to.

 

The issue likely lies in the PDFBox algorithm. There are PDFBox folks on
this list. They can help you. Hopefully there is a simple config setting
to help out.

 

Cheers,

Chris

 

 

From: Aditya Sardesai 
Date: Thursday, August 27, 2020 at 11:44 PM
To: "Mattmann, Chris A (US 1740)" 
Subject: [EXTERNAL] I have some questions about tika-python

 

Greetings Chris,

 

We had a requirement for our project which required parsing PDF files and 
extracting the text for some verification. I tried a number of other python 
packages but they all had issues recognizing text consistently across the file.

 

The most common issue which we faced was text not dumped the correct sequence. 
This was until we found Tika. We are very impressed by the recognition of text 
sequencing. It is exactly how we want.

 

However, we're facing an issue with vertically aligned text. There are two 
examples of vertically aligned text which I can show. In one instance the text 
is parsed correctly but not in the other.

 

Ex1.

  

In this the word values is read as,

V

al

 

ue

s

 

Ex2.

In this, the date is parsed correctly as,

2020-07-16 00:30

 

Can you please help us understand if there are some specifics about the tika 
algorithm, we should be aware of? Any suggestions on how we can better use the 
tool?

Please let me know if I need to connect with any other contributor for this.

 

Looking forward to your valuable comments.

 

 

Regards,

__

 

Aditya Sardesai

Lead Quality Engineer



aditya_sarde...@persistent.com

Connect with me on: LinkedIn



See Beyond, Rise Above

 



Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Chris Mattmann
Haha  I’m down and supportive!

 

Time’s TIME FOR 2.x 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 
174B-Affiliate)" 
Date: Friday, August 14, 2020 at 6:06 AM
To: "" 
Subject: [EXTERNAL] Tika 2.0 modularization

 

All,

  I _think_ I might have some time to start working on integrating Bob's

work on the current main branch.  I'll have to ignore most of the incoming

issues for a bit...unlike the last 4 years...this time I mean it. :)

  Let me know if there are any objections to heading down this path now.

 

   Cheers,

 

  Tim

 



[jira] [Commented] (TIKA-3119) General upgrades for 1.25

2020-06-19 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140963#comment-17140963
 ] 

Chris Mattmann commented on TIKA-3119:
--

[~agibsonccc] can you help see above?

> General upgrades for 1.25
> -
>
> Key: TIKA-3119
> URL: https://issues.apache.org/jira/browse/TIKA-3119
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: Screen Shot 2020-06-19 at 8.56.05 PM.png, Screenshot 
> from 2020-06-19 14-50-21.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] renaming master?

2020-06-16 Thread Chris Mattmann
How about just development?

 

We use that  on OODT … though we have a master too that  needs to get
removed …

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" , "Allison, Tim (US 
1740-Affiliate)" 
Date: Tuesday, June 16, 2020 at 10:31 AM
To: "" 
Subject: [EXTERNAL] renaming master?

 

All,

 

  As you may have seen, there's a movement to rename the "master" branch to

"main" or "trunk" (at least in the U.S.)[1][2].  Github is doing this, and

I personally think this makes sense.

 

  Are there any objections if we change "master"?  If we do change it, is

there a preference for "main", "trunk" or something else?

 

  My personal preference would be for trunk, but I'm open.

 

 Best,

 

 Tim

 

[1]

https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/

[2] https://www.bbc.com/news/technology-53050955

 



[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091708#comment-17091708
 ] 

Chris Mattmann commented on TIKA-3093:
--

yea we have lots of pipelines with OODT and Tika that does this already 
([http://github.com/apache/drat/)] is a classic example of this...

> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

2020-04-16 Thread Chris Mattmann
Yes, some of us have been developing an Elastic scaling stack for Tika server…

 

That does just that with AWS. Don’t have it ready to push upstream yet.


Cheers,

Chris

 

 

From: Eric Pugh 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, April 16, 2020 at 7:09 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Re: Issue with > 200% CPU after bulk usage

 

Does anyone have a good example of combining Tika with some sort of pool of 
Docker containers?   I think a lot of folks treat their Tika server like a pet, 
not like a cow.  
https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ 


 

I wonder if we could ship some “recipes” that describe how to deploy a pool of 
Tika’s.Tika running over 200% for 1 hour, kill it and start the next.

 

 

 

On Apr 16, 2020, at 9:40 AM, Nick Burch  wrote:

On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote:

I have encountered an issue with Tika running locally on a box that the Java 
runtime goes up to over 200% CPU, after running a bulk load of documents over a 
couple of days, it is more than 3 million documents.

Can you do a thread dump to show what the JVM is doing?

https://access.redhat.com/solutions/18178

Nick

 

___

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

   

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

 

 



[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2020-04-06 Thread Chris Mattmann (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076659#comment-17076659
 ] 

Chris Mattmann commented on TIKA-2368:
--

I have a TensorFlow version of Sentiment Analysis based on the Netflix data I 
will be contributing it soon hopefully that will help render this moot...

> Clean up SentimentParser dependencies
> -
>
> Key: TIKA-2368
> URL: https://issues.apache.org/jira/browse/TIKA-2368
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>
> Is there any way to avoid reliance on edu.usc.ir's sentiment-analysis-parser? 
>  I ask because:
> {noformat}
> [WARNING] sentiment-analysis-parser-0.1.jar, tika-parsers-1.15-SNAPSHOT.jar 
> define 1 overlapping classes: 
> [WARNING]   - org.apache.tika.parser.sentiment.analysis.SentimentParser
> [WARNING] tika-core-1.15-SNAPSHOT.jar, tika-translate-1.15-SNAPSHOT.jar 
> define 4 overlapping classes: 
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator$1
> [WARNING]   - org.apache.tika.language.translate.EmptyTranslator
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator
> [WARNING]   - org.apache.tika.language.translate.Translator
> {noformat}
> We should be ok keeping things as they are and excluding SentimentParser and 
> tika-translate, but can we easily move the code that's still in edu.usc.ir's 
> package into Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Re: JDK 12 build issues

2020-03-18 Thread Chris Mattmann
Thanks Oleg I was using OpenJDK 12 and 13, but I fixed it!

 

I needed to delete the $HOME/.tika-dl folder. All good now!

 

NING] Invalid POM for commons-net:commons-net:jar:3.1, transitive dependencies 
(if any) will not be available, enable debug logging for more details

[WARNING] Invalid POM for net.ericaro:neoitertools:jar:1.0.0, transitive 
dependencies (if any) will not be available, enable debug logging for more 
details

[INFO] 

[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ tika-dl ---

[WARNING] Invalid project model for artifact [commons-net:commons-net:3.1]. It 
will be ignored by the remote resources Mojo.

[WARNING] Invalid project model for artifact [neoitertools:net.ericaro:1.0.0]. 
It will be ignored by the remote resources Mojo.

[INFO] 

[INFO] --- maven-resources-plugin:2.7:resources (default-resources) @ tika-dl 
---

[INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] skip non existing resourceDirectory 
/Users/mattmann/src/tika/tika-dl/src/main/resources

[INFO] Copying 3 resources

[INFO] 

[INFO] --- maven-compiler-plugin:3.8.0:compile (default-compile) @ tika-dl ---

[INFO] Changes detected - recompiling the module!

[INFO] Compiling 2 source files to 
/Users/mattmann/src/tika/tika-dl/target/classes

[INFO] 

[INFO] --- maven-resources-plugin:2.7:testResources (default-testResources) @ 
tika-dl ---

[INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] Copying 4 resources

[INFO] Copying 3 resources

[INFO] 

[INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @ 
tika-dl ---

[INFO] Changes detected - recompiling the module!

[INFO] Compiling 2 source files to 
/Users/mattmann/src/tika/tika-dl/target/test-classes

[INFO] 

[INFO] --- maven-surefire-plugin:3.0.0-M4:test (default-test) @ tika-dl ---

[INFO] 

[INFO] ---

[INFO]  T E S T S

[INFO] ---

[INFO] Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest

log4j:WARN No appenders could be found for logger 
(org.nd4j.linalg.factory.Nd4jBackend).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 272.202 
s - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest

[INFO] Running org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 44.616 s 
- in org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest

[INFO] 

[INFO] Results:

[INFO] 

[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0

[INFO] 

[INFO] 

[INFO] BUILD SUCCESS

[INFO] 

[INFO] Total time:  05:27 min

[INFO] Finished at: 2020-03-18T07:51:56-07:00

[INFO] 

pomodoro:tika-dl mattmann$ 

 

 

 

 

From: Oleg Tikhonov 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 18, 2020 at 7:53 AM
To: "dev@tika.apache.org" 
Subject: Re: [EXTERNAL] Re: JDK 12 build issues

 

Hi Chris,

I'm currently trying to build an env with java 12/13 ... in order to try

your setup.

What java version are you using? open jdk or oracle?

One upon a time was a bug in openjdk

https://bugs.openjdk.java.net/browse/JDK-8131146

But it seems to be ok in recent releases.

 

Keep you updated.

Cheers,

Oleg

 

 

On Wed, Mar 18, 2020 at 4:35 PM Chris Mattmann  wrote:

 

So I was able to get past my issues with Tesseract by reinstalling the

latest version with Brew.

 

 

 

I have a new issue!

 

I’ve tried in JDK12 and JDK13 to build tika-dl, but it keeps failing:

 

 

 

[INFO]

 

[INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @

tika-dl ---

 

[INFO] Changes detected - recompiling the module!

 

[INFO] Compiling 2 source files to

/Users/mattmann/src/tika/tika-dl/target/test-classes

 

[INFO]

 

[INFO] --- maven-surefire-plugin:3.0.0-M4:test (default-test) @ tika-dl ---

 

[INFO]

 

[INFO] ---

 

[INFO]  T E S T S

 

[INFO] ---

 

[INFO] Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest

 

log4j:WARN No appenders could be found for logger

(org.nd4j.linalg.factory.Nd4jBackend).

 

log4j:WARN Please initialize the log4j system properly.

 

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for

more info.

 

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:

3.38 s <<< FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest

 

[ERROR] org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise  Time

elapsed: 3.29 s  <<< ERROR!

 

org.apa

Re: [EXTERNAL] Re: JDK 12 build issues

2020-03-18 Thread Chris Mattmann
So I was able to get past my issues with Tesseract by reinstalling the latest 
version with Brew.

 

I have a new issue!

I’ve tried in JDK12 and JDK13 to build tika-dl, but it keeps failing:

 

[INFO] 

[INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @ 
tika-dl ---

[INFO] Changes detected - recompiling the module!

[INFO] Compiling 2 source files to 
/Users/mattmann/src/tika/tika-dl/target/test-classes

[INFO] 

[INFO] --- maven-surefire-plugin:3.0.0-M4:test (default-test) @ tika-dl ---

[INFO] 

[INFO] ---

[INFO]  T E S T S

[INFO] ---

[INFO] Running org.apache.tika.dl.imagerec.DL4JVGG16NetTest

log4j:WARN No appenders could be found for logger 
(org.nd4j.linalg.factory.Nd4jBackend).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.38 s 
<<< FAILURE! - in org.apache.tika.dl.imagerec.DL4JVGG16NetTest

[ERROR] org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise  Time elapsed: 
3.29 s  <<< ERROR!

org.apache.tika.exception.TikaConfigException: java.io.UTFDataFormatException: 
malformed input around byte 11

   at 
org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)

Caused by: java.lang.RuntimeException: java.io.UTFDataFormatException: 
malformed input around byte 11

   at 
org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)

Caused by: java.io.UTFDataFormatException: malformed input around byte 11

   at 
org.apache.tika.dl.imagerec.DL4JVGG16NetTest.recognise(DL4JVGG16NetTest.java:36)

 

[INFO] Running org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.392 s 
- in org.apache.tika.dl.imagerec.DL4JInceptionV3NetTest

[INFO] 

[INFO] Results:

[INFO] 

[ERROR] Errors: 

[ERROR]   DL4JVGG16NetTest.recognise:36 » TikaConfig 
java.io.UTFDataFormatException: mal...

[INFO] 

[ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0

[INFO] 

[INFO] 

[INFO] BUILD FAILURE

[INFO] 

[INFO] Total time:  25.628 s

[INFO] Finished at: 2020-03-18T07:34:08-07:00

[INFO] 

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M4:test (default-test) on 
project tika-dl: There are test failures.

[ERROR] 

[ERROR] Please refer to 
/Users/mattmann/src/tika/tika-dl/target/surefire-reports for the individual 
test results.

[ERROR] Please refer to dump files (if any exist) [date].dump, 
[date]-jvmRun[N].dump and [date].dumpstream.

[ERROR] -> [Help 1]

[ERROR] 

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR] 

[ERROR] For more information about the errors and possible solutions, please 
read the following articles:

[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

pomodoro:tika-dl mattmann$ 

 

Thamme, do you have any ideas what is going on here?


Cheers,

Chris

 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" , "Allison, Timothy B (US 
1760-Affiliate)" 
Date: Wednesday, March 18, 2020 at 2:35 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Re: JDK 12 build issues

 

Haven’t tried...we should add java 12-14 to Jenkins.

 

Wait, are we up to 18 yet...

 

Will look into it...

 

On Tue, Mar 17, 2020 at 10:07 PM Chris Mattmann  wrote:

 

Hey Tim et al.,

 

 

 

Do the tests fail for you with Java 12?

 

 

 

[INFO] Running org.apache.tika.parser.pkg.GzipParserTest

 

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:

0.397 s - in org.apache.tika.parser.pkg.GzipParserTest

 

[INFO] Running org.apache.tika.TestXMLEntityExpansion

 

[WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed:

0.085 s - in org.apache.tika.TestXMLEntityExpansion

 

[INFO] Running org.apache.tika.mime.MimeTypeTest

 

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:

0.001 s - in org.apache.tika.mime.MimeTypeTest

 

[INFO] Running org.apache.tika.mime.MimeTypesTest

 

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:

0.001 s - in org.apache.tika.mime.MimeTypesTest

 

[INFO] Running org.apache.tika.mime.TestMimeTypes

 

[INFO] Tests run: 80, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:

8.997 s - in org.apache.tika.mime.TestMimeTypes

 

[INFO] Running org.apache.tika.TestCorruptedFiles

 

[WARNING] Tests run: 1, Failure

JDK 12 build issues

2020-03-17 Thread Chris Mattmann
Hey Tim et al.,

 

Do the tests fail for you with Java 12? 

 

[INFO] Running org.apache.tika.parser.pkg.GzipParserTest

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.397 s 
- in org.apache.tika.parser.pkg.GzipParserTest

[INFO] Running org.apache.tika.TestXMLEntityExpansion

[WARNING] Tests run: 3, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.085 
s - in org.apache.tika.TestXMLEntityExpansion

[INFO] Running org.apache.tika.mime.MimeTypeTest

[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 s 
- in org.apache.tika.mime.MimeTypeTest

[INFO] Running org.apache.tika.mime.MimeTypesTest

[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 s 
- in org.apache.tika.mime.MimeTypesTest

[INFO] Running org.apache.tika.mime.TestMimeTypes

[INFO] Tests run: 80, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.997 s 
- in org.apache.tika.mime.TestMimeTypes

[INFO] Running org.apache.tika.TestCorruptedFiles

[WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.001 
s - in org.apache.tika.TestCorruptedFiles

[INFO] 

[INFO] Results:

[INFO] 

[ERROR] Failures: 

[ERROR]   
TesseractOCRParserTest.confirmMultiPageTiffHandling:290->TikaTest.assertContains:110
 Page 2 not found in:

http://www.w3.org/1999/xhtml;>







































































































Multipage

TIFF

Example

Page 1





[ERROR]   
TesseractOCRParserTest.testOCROutputsHOCR:146->TikaTest.assertContains:110 
Happy not found in:

http://www.w3.org/1999/xhtml;>

























































































Presentation1







http://www.w3.org/1999/xhtml;>





































































 

  

 

  

  

 

 

  

   



 

  Happy

  New

  Year

  2003!

 



 

   

 

  

 

 





[INFO] 

[ERROR] Tests run: 1188, Failures: 2, Errors: 0, Skipped: 48

[INFO] 

[INFO] 

[INFO] Reactor Summary for Apache Tika 2.0.0-SNAPSHOT:

[INFO] 

[INFO] Apache Tika parent . SUCCESS [  8.822 s]

[INFO] Apache Tika core ... SUCCESS [ 39.589 s]

[INFO] Apache Tika parsers  FAILURE [09:04 min]

[INFO] Apache Tika OSGi bundle  SKIPPED

[INFO] Apache Tika XMP  SKIPPED

[INFO] Apache Tika serialization .. SKIPPED

[INFO] Apache Tika batch .. SKIPPED

[INFO] Apache Tika language detection . SKIPPED

[INFO] Apache Tika application  SKIPPED

[INFO] Apache Tika translate .. SKIPPED

[INFO] Apache Tika server . SKIPPED

[INFO] Apache Tika eval ... SKIPPED

[INFO] Apache Tika examples ... SKIPPED

[INFO] Apache Tika Java-7 Components .. SKIPPED

[INFO] Apache Tika Deep Learning (powered by DL4J)  SKIPPED

[INFO] Apache Tika Natural Language Processing  SKIPPED

[INFO] Apache Tika  SKIPPED

[INFO] 

[INFO] BUILD FAILURE

[INFO] 

[INFO] Total time:  09:57 min

[INFO] Finished at: 2020-03-17T18:31:10-07:00

[INFO] 

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M4:test (default-test) on 
project tika-parsers: There are test failures.

[ERROR] 

[ERROR] Please refer to 
/Users/mattmann/src/tika/tika-parsers/target/surefire-reports for the 
individual test results.

[ERROR] Please refer to dump files (if any exist) [date].dump, 
[date]-jvmRun[N].dump and [date].dumpstream.

[ERROR] -> [Help 1]

[ERROR] 

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR] 

[ERROR] For more information about the errors and possible solutions, please 
read the following articles:

[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

[ERROR] 

[ERROR] After correcting the problems, you can resume the build with the command

[ERROR]   mvn  -rf :tika-parsers

pomodoro:tika mattmann$ java -version

openjdk version "12.0.1" 2019-04-16

OpenJDK Runtime Environment (build 12.0.1+12)

OpenJDK 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)

pomodoro:tika mattmann$ 

 

Any ideas?

 

Cheers,

Chris

 



Re: [EXTERNAL] question about Tika

2020-02-10 Thread Chris Mattmann
Thanks.  Please make sure dev@tika.apache.org is where you are addressing  
these questions to.

 

 

 

From: Max Franklin 
Date: Monday, February 10, 2020 at 10:59 AM
To: Chris Mattmann 
Subject: Re: [EXTERNAL] question about Tika

 

Hi Chris,  

 

The Tika Server seems to work okay for me, so it’s just an issue with Tika 
Python. 

 

Thanks, 

Max

On Feb 10, 2020, at 1:18 PM, Chris Mattmann  wrote:

 

Max,  does Tika Server work OK for you? Is there a different behavior with Tika
Python than simply posting the PDF to Tika server? Try first and then I am 
redirecting
you to the Tika dev list for help.

 

Thanks,

Chris

 

 

 

 

From: Max Franklin 
Date: Monday, February 10, 2020 at 9:37 AM
To: "Mattmann, Chris A (US 1760)" 
Subject: [EXTERNAL] question about Tika

 

Hello,

 

I'm sorry for the inconvenience, but I've been using Tika as part of a

Python code to extract text from PDFs and convert it into a TXT file.

The code worked fine for several months, but recently it stopped

recognizing the content and returning a none-type. Since then, it has

fluctuated seemingly randomly from working to not working. Attached

are pictures of my code and the error. Do you have any idea why this

is behaving erratically?

 

Thank you so much for your help!

 

Best,

Max

 



 



FW: [EXTERNAL] question about Tika

2020-02-10 Thread Chris Mattmann
Max,  does Tika Server work OK for you? Is there a different behavior with Tika
Python than simply posting the PDF to Tika server? Try first and then I am 
redirecting
you to the Tika dev list for help.

 

Thanks,

Chris

 

 

 

 

From: Max Franklin 
Date: Monday, February 10, 2020 at 9:37 AM
To: "Mattmann, Chris A (US 1760)" 
Subject: [EXTERNAL] question about Tika

 

Hello,

 

I'm sorry for the inconvenience, but I've been using Tika as part of a

Python code to extract text from PDFs and convert it into a TXT file.

The code worked fine for several months, but recently it stopped

recognizing the content and returning a none-type. Since then, it has

fluctuated seemingly randomly from working to not working. Attached

are pictures of my code and the error. Do you have any idea why this

is behaving erratically?

 

Thank you so much for your help!

 

Best,

Max

 



Re: [EXTERNAL] Regarding unicodeencode Error

2020-01-08 Thread Chris Mattmann
OK can you please post an issue http://issues.apache.org/jira/browse/TIKA and 
attach your
document and specific error? Thanks!

 

 

 

From: "Gowda,Sumanth" 
Date: Wednesday, January 8, 2020 at 9:36 PM
To: Chris Mattmann 
Subject: RE: [EXTERNAL] Regarding unicodeencode Error

 

Tika python

 

 

From: Chris Mattmann  
Sent: Thursday, January 9, 2020 8:47 AM
To: Gowda,Sumanth 
Cc: dev@tika.apache.org
Subject: Re: [EXTERNAL] Regarding unicodeencode Error

 

Hi Sumanth,

 

Are you using Tika Python? Or plain Tika in Java?

 

Can you file a ticket and share the PDF?

 

Cheers,

Chris

 

 

 

 

From: "Gowda,Sumanth" 
Date: Wednesday, January 8, 2020 at 12:58 AM
To: "Mattmann, Chris A (US 1760)" 
Subject: [EXTERNAL] Regarding unicodeencode Error

 

Hi Chris,

 

I was trying to read a pdf using the tika parser and I am getting a 
unicodeencodeError  for u2013.Any idea how I can resolve this?

 

Thanks,

Sumanth Gowda

 

Sent from Mail for Windows 10

 

  

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.



Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-01-08 Thread Chris Mattmann
+1

 

Note there is also a USC tika dockers repo where I put the data science stuff 
too:

 

http://github.com/USCDataScience/tika-dockers

 

I’ll continue to push DL and ML Tika stuff there.

Cheers,

Chris

 

 

 

 

From: Dave Meikle 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, January 8, 2020 at 2:18 PM
To: "" 
Subject: Re: [EXTERNAL] Do we have a community supported approach for deploying 
Tika Server in production?

 

Hi Eric,

 

Will take a look. On a related note, I've created a new repos:

https://github.com/apache/tika-docker

 

Thinking based on looking at the PRs and Issues on LogicalSpark

docker-tikaserver, I'll create an updated docker file using what you've

added here and look to publish builds to docker hub from that.

 

What do you think?

 

Cheers,

Dave

 

 

 

On Wed, 8 Jan 2020 at 03:16, Eric Pugh 

wrote:

 

Hi all, I’ve gone ahead and added the -spawnChild property as a default

when running Tika Server as a service.   I’d love some eyes on the PR, and

if this looks good, get it committed.

 

Feedback welcome!

 

Eric

 

 

 

> On Dec 17, 2019, at 12:53 PM, Eric Pugh 

wrote:

> 

> Cool.

> 

> It’s the auto run that I really need, and the other part that I don’t

think I’ve tackled properly is the managing of logs…

> 

> I’m going to check with my project to see if they support Snap packages.

> 

> Eric

> 

> 

>> On Dec 16, 2019, at 5:10 PM, Tom Barber > wrote:

>> 

>> Just saw this fly by and FYI on Linux systems that support Snap

packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`

doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`

and adding an init script wouldn’t take 5 minutes.

>> 

>> Tom

>> 

>> On 16 December 2019 at 18:42:55, Eric Pugh (

ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>)

wrote:

>> 

>>> Hi folks!

>>> 

>>> I’ve got a mostly completed PR for having install scripts for Tika

Server, and I’m hoping a committer will take a look at the PR, and give

feedback (and ideally commit in time for 1.24!)

>>> 

>>> A couple of things:

>>> 

>>> 1) This was completely influenced by

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

>< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script

< 

https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,

in fact I started with the Solr scripts.

>>> 

>>> 2) I’ve deleted all the Solr specific aspects (I think), however there

may still be more to delete.

>>> 

>>> 3) This requires a change to how we release Tika, previously we ship

tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we

want to add the tika-server-bin.tgz and tika-server-bin.zip binary

distributions.

>>> 

>>> I’m happy to start writing accompanying “how to deploy Tika Server”

docs if this PR looks good! Or, please give input and I’ll make the updates.

>>> 

>>> Eric

>>> 

>>> 

>>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh <

ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>>

wrote:

>>> >

>>> > I’ve created this JIRA to track this work:

https://issues.apache.org/jira/browse/TIKA-3010 <

https://issues.apache.org/jira/browse/TIKA-3010> <

https://issues.apache.org/jira/browse/TIKA-3010 <

https://issues.apache.org/jira/browse/TIKA-3010>>

>>> >

>>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305

<https://github.com/apache/tika/pull/305> <

https://github.com/apache/tika/pull/305 <

https://github.com/apache/tika/pull/305>>

>>> >

>>> > My thought is to put something together that mimics how we deploy

Solr, and see how that works. I have a need for an install process that a

general IT person can follow, who isn’t a Tika expert or a Docker users.

>>> >

>>> >

>>> >

>>> >

>>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann mailto:mattm...@apache.org> <mailto:mattm...@apache.org >> wrote:

>>> >>

>>> >> Thanks for bringing this conversation up Eric.

>>> >>

>>> >>

>>> >>

>>> >> Historically if you look over the last 5 years, I think what you

are asking below has sort of already become the de facto

>>> >> tru

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-04 Thread Chris Mattmann
Thanks for bringing this conversation up Eric.

 

Historically if you look over the last 5 years, I think what you are asking 
below has sort of already become the de facto
truth. Most people are in fact using Tika server, whether they are individual 
devs, govvies, commercial folk and the like. 

Big, small and medium projects. Evidenced by the expansion of Tika APIs into 
pretty much every PL I know and use of 
actively today.

 

Given that, we probably should update the main website docs to make this more 
prominent. The tika server docs on the
wiki are pretty darn good. But they don’t get prime real estate. Would be 
wonderful if someone wants to update the 
website to make it more prominent.

 

The downstream Tika Python lib that I maintain has tons of activity is used by 
more than 350+ projects and relies solely
on Tika-Server. My recommendation to the Solr folks (having created 7633) from 
the 2014 DARPA MEMEX days was to 
move towards Tika Server based SolrCell dep and that’s the right way to go IMO.

 

Chris

 

 

 

 

 

From: Eric Pugh 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, December 4, 2019 at 12:24 PM
To: "tika-...@apache.org" 
Subject: [EXTERNAL] Do we have a community supported approach for deploying 
Tika Server in production?

 

Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!

 

Over in Solr land there has been renewed discussion about streamlining what 
Solr is   

 

In regards to rich content extraction and the Tika project, it seems like the 
two ideas that continue to preserve the existing behavior are:

 

1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   
This slims down the standard Solr download, and *might* make it easier to 
update the version of Tika + dependent jars used?

 

2) The second approach is to instead require Tika-Server to be running 
(https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
the call to Tika-Server.

 

 

I was thinking about why I like option 1 better than 2, and I think it boils 
down to how mature the IT organization I am working with is.  Some IT 
organizations have large dev-ops teams, and are working at major scale, and 
managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
scaling up and down is simple and second nature!  However, many organizations 
aren’t like that.

 

So I guess what I’m asking is do we have a reasonable supported approach for 
deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
Solr, and specifically the fact that Solr has a well defined set of Service 
Installation scripts.   When I follow the directions in 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
 I can feel confident that when the server is rebooted, then Solr will come 
back up!   Plus there is log rotation and all the rest.

 

In contrast, when I look at Tika website, specifically 
https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
Tika as a command line application, or embedded in your application.   

 

I’m wondering if Tika-Server needs to be made more prominent, and treated as 
the “primary method of interacting with Tika”?   Do we need as a community to 
focus more on Tika-Server?   In our getting started documentation, in our usage 
documentation, and in our examples?

 

Do we need to create the equivalent of the Service Installation scripts for 
Tika-Server?   

 

Wanted to stoke the discussion!

 

Eric

 

___

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

   

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

 

 



Re: [EXTERNAL] Docker image along with 1.23?

2019-11-20 Thread Chris Mattmann
Yeah producing the actual image is tricky and my recommendation is for Tika to 
stay out of the business of that. Leave it to LogicalSpark or others to do 
this. It’s 
tricky with licenses and I doubt ASF will ever develop an optimal solution to 
this 
due to the nature of its core mission as Nick stated.

 

 

 

 

From: Eric Pugh 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, November 20, 2019 at 6:02 PM
To: "dev@tika.apache.org" 
Cc: "Allison, Timothy B (US 1760-Affiliate)" 
Subject: Re: [EXTERNAL] Docker image along with 1.23?

 

I was thinking more of producing the actual image, so that others don’t have to 
go through the pain of compiling an image.   Having the Dockerfile made 
available as well does give a nice recipe for modifying the “official” image.   
I recently tested Tesseract 3 with the latest Tika, and I did it by tweaking 
the existing Dockerfile that LogicalSpark has published.

 

I don’t know how other projects at ASF handle the image publishing.

 

 

 

 

On Nov 20, 2019, at 7:02 PM, Chris Mattmann  wrote:

Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping 
text file, 

code. Under a license. If we create a “docker image” and then publish it to the 
ASF 

hub then I agree with you.

My suggestion and my interpretation of Tim’s is to ship a standard 
“Dockerfile”. Do you

agree with this? It should be air covered (as former VP, Legal, at least it 
would have been

with me). 

Cheers,

Chris

From: Nick Burch 

Reply-To: "dev@tika.apache.org" 

Date: Wednesday, November 20, 2019 at 3:57 PM

To: "Allison, Timothy B (US 1760-Affiliate)" 

Cc: "" 

Subject: [EXTERNAL] Re: Docker image along with 1.23?

On Wed, 20 Nov 2019, Tim Allison wrote:

Eric Pugh recently asked on another channel if we had any plans to

release an official docker image for 1.23.

Depending on what we put in the container, we do need to be a little 

careful. There's "platform dependencies" under non-compatible licenses 

that we can optionally use if people have installed them, which we 

ourselves can't directly ship under ASF rules. (Tesseract is fine as 

that's Apache Licenses, Java itself is trickier, see the Netbeans 

discussions on legal-discuss@ and LEGAL jira)

Shipping an official docker container with the Tika Server on seems to me 

to be a helpful step for users, but we just need to make sure we're 

following ASF policies. (The Apache Software Foundation mission is to 

"provide software for the public good", but source code is the main focus 

for the mission, binaries are trickier!)

Nick

 

___

Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  

Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
   

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

 

 



Re: [EXTERNAL] Re: Docker image along with 1.23?

2019-11-20 Thread Chris Mattmann
Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping 
text file, 
code. Under a license. If we create a “docker image” and then publish it to the 
ASF 
hub then I agree with you.

 

My suggestion and my interpretation of Tim’s is to ship a standard 
“Dockerfile”. Do you
agree with this? It should be air covered (as former VP, Legal, at least it 
would have been
with me). 

 

Cheers,

Chris

 

 

 

 

From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, November 20, 2019 at 3:57 PM
To: "Allison, Timothy B (US 1760-Affiliate)" 
Cc: "" 
Subject: [EXTERNAL] Re: Docker image along with 1.23?

 

On Wed, 20 Nov 2019, Tim Allison wrote:

Eric Pugh recently asked on another channel if we had any plans to

release an official docker image for 1.23.

 

Depending on what we put in the container, we do need to be a little 

careful. There's "platform dependencies" under non-compatible licenses 

that we can optionally use if people have installed them, which we 

ourselves can't directly ship under ASF rules. (Tesseract is fine as 

that's Apache Licenses, Java itself is trickier, see the Netbeans 

discussions on legal-discuss@ and LEGAL jira)

 

Shipping an official docker container with the Tika Server on seems to me 

to be a helpful step for users, but we just need to make sure we're 

following ASF policies. (The Apache Software Foundation mission is to 

"provide software for the public good", but source code is the main focus 

for the mission, binaries are trickier!)

 

Nick

 



Re: [EXTERNAL] Tika 1.23?

2019-11-20 Thread Chris Mattmann
+1 ship it

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" , "Allison, Timothy B (US 
1760-Affiliate)" 
Date: Wednesday, November 20, 2019 at 9:07 AM
To: "" 
Subject: [EXTERNAL] Tika 1.23?

 

All,

  I've abandoned hope of getting the contenthandler factory configuration

stuff into 1.23.  We've added some new mime types, upgraded POI and made a

number of other useful changes.

  WDYT about kicking off regression tests shortly?  Any blockers?

 

  Best,

 

Tim

 



Re: [EXTERNAL] How to set the page segmentation for TIKA python

2019-11-13 Thread Chris Mattmann
Hi Aswathi,

 

Please check with dev@tika.apache.org.

 

Cheers,

Chris

 

 

 

 

From: Aswathi Nambiar 
Date: Wednesday, November 13, 2019 at 7:39 AM
To: "Mattmann, Chris A (US 1760)" 
Subject: [EXTERNAL] How to set the page segmentation for TIKA python

 

Hi Chris, 

 

I am using Apache TIKA OCR on python. Using the parser.from_file I am trying to 
extract text from the image. But the default psm seems to be 1 according to 
documentation. But how do I change the psm to 6 using python. 

I couldn’t find any documentation for this. I can find it for java using the 
below link. 

https://cwiki.apache.org/confluence/display/tika/TikaOCR

 

Regards,

Aswathi Nambiar

 


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by IHSMarkit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
https://ihsmarkit.com/Legal/EmailDisclaimer.html

Please visit www.ihsmarkit.com/about/contact-us.html for contact information on 
our offices worldwide.




Re: [EXTERNAL] Extracting font information from xml

2019-10-15 Thread Chris Mattmann
Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and 
it
provides this functionality. CC’ing dev@tika

 

 

 

From: Jay Chuk 
Date: Tuesday, October 15, 2019 at 3:47 PM
To: "Mattmann, Chris A (US 1761)" 
Subject: [EXTERNAL] Extracting font information from xml

 

Hi Chris, 

 

Thanks for provide the python package -Tika, to use for extracting text from 
pdf's.

 

I'll like to confirm it is possible when converting pdf to xml  to get the font 
style for the text e.g the font type, if the text is bold/solid . 

I need such information in identifying section headers and titles from the 
documents.

 

Please let me know if it is possible or if there is another way tp gp about 
this.

 

Thank you

Jay



Re: [EXTERNAL] Extracting font information from xml

2019-10-15 Thread Chris Mattmann
When you do a parse, do this:

 

from tika import parser

parsed = parser.from_file(‘/path/to/file’, xmlContent=True)

xmlContent = parsed[“content”]

print(xmlContent)

 

G’luck!

 

Cheers
Chris

 

 

 

 

From: Jay Chuk 
Date: Tuesday, October 15, 2019 at 3:54 PM
To: Chris Mattmann 
Cc: "dev@tika.apache.org" 
Subject: Re: [EXTERNAL] Extracting font information from xml

 

Thanks for the quick reply Chris. 

Please is there a possible code snippet in python for it.

 

Reagrds,

Jay 

 

On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann  wrote:

Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika Server and 
it
provides this functionality. CC’ing dev@tika

 

 

 

From: Jay Chuk 
Date: Tuesday, October 15, 2019 at 3:47 PM
To: "Mattmann, Chris A (US 1761)" 
Subject: [EXTERNAL] Extracting font information from xml

 

Hi Chris, 

 

Thanks for provide the python package -Tika, to use for extracting text from 
pdf's.

 

I'll like to confirm it is possible when converting pdf to xml  to get the font 
style for the text e.g the font type, if the text is bold/solid . 

I need such information in identifying section headers and titles from the 
documents.

 

Please let me know if it is possible or if there is another way tp gp about 
this.

 

Thank you

Jay



Re: [EXTERNAL] Urgent!!! Tika-python

2019-08-19 Thread Chris Mattmann
Hi,

 

Why not just do an os.walk or os.listdir in python, and then for each file, 
call Tika, e.g., 

 

import os

import json

from tika import parser

 

fs = os.listdir(‘/some/path’)

fs = [f for f in fs if os.isfile(f) and (str(f).endswith(‘.pdf’) or 
str(f).endswith(‘.doc’))]

 

for f in fs:

    parsed = parser.from_file(f)

    # save parsed to file

    json.dump(parsed, ‘/some/other/path’)

 

Cheers,

Chris

 

 

 

From: Victor Olaiya 
Date: Monday, August 19, 2019 at 8:28 AM
To: "Mattmann, Chris A (US 1761)" 
Subject: [EXTERNAL] Urgent!!! Tika-python

 

Hello, 

I sent a mail to the mailing list with no response, so I decided to mail you 
again.

I have been trying to extract text from all pdfs and doc etc files in a 
directory and that has been impossible as Tika-python does not allow parsing of 
directory only files.

I was able to compress the files in a single zip file and extract, this worked 
but the extracted text where saved in a single file, i need the files to be 
saved in their individual files so I can use them as input to another program.

 

Please what is the best method to go about this.

Thank you Chris Mattmann,

I await your reply.



Re: [EXTERNAL] TIKA

2019-08-11 Thread Chris Mattmann
Victor, please send your email to dev@tika.apache.org, which I’ve CC’ed…

 

 

 

From: Victor Olaiya 
Date: Tuesday, August 6, 2019 at 1:37 PM
To: "Mattmann, Chris A (US 1761)" 
Subject: [EXTERNAL] TIKA

 

Hello chris,

I am building an information retrieval system and i need apache tika to auto 
detect different file types such as docs. docx, doc, pdf, html, xls. majorly 
and extract text data from multiple files in a particular directory. Please how 
do i achieve this?

Thanks



Re: [EXTERNAL] Re: Merge flow

2019-07-10 Thread Chris Mattmann
I’ve also got some new stuff I’m getting ready to contribute, in the following 
ML/Deep Learning
areas:

 
Some Basic models using Tensorflow stable 1.13
CIFAR-10 image classifier using a CNN ~86% accuracy – obviously different 
than Inception-v3/v4 and VGG-16 which we currently have available, but 
just another option and a good standard benchmark for image labels.
Softmax LogReg Netflix Movie Review Sentiment Analyzer using TF. I’ll 
create a Docker for this like the ImageRecog and VideoRecog dockers.
this can and should replace the existing OpenNLP one we have. Will be
much easier to use and less code.
Audio Speaker Identification using TF and K-means – simple Python/Tensorflow
based Speaker Identification and Labeling in an audio file.
Face Identification model using CNN – face/no face in an image. Will Dockerize.
 

Should be able to get that contributed in the next 2 months.

 

Cheers,

Chris

 

 

 

From: Sergey Beryozkin 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, July 10, 2019 at 7:05 AM
To: "" 
Subject: [EXTERNAL] Re: Merge flow

 

Thanks, I was just curious, hope to start contributing a bit more...

 

Sergey

 

On Wed, Jul 10, 2019 at 1:39 PM Tim Allison  wrote:

 

Y.  Although sometimes I flip the order. :D

 

If it matters or if I’m doing something wrong, let me know!

 

On Wed, Jul 10, 2019 at 4:52 AM Sergey Beryozkin 

wrote:

 

> Hi Tim

> 

> What is the current process for merging the fixes ? The fix goes to the

> master first and then it is cherry-picked into the branch_1x ?

> 

> Cheers, Sergey

> 

 

 



Re: [EXTERNAL] Re: Tika 1.22?

2019-06-25 Thread Chris Mattmann
Looks good…

 

 

 

From: Oleg Tikhonov 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, June 25, 2019 at 7:57 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Re: Tika 1.22?

 

Would be great!!!

Cheers,

Oleg

 

On Tue, Jun 25, 2019, 17:45 Tim Allison  wrote:

 

All,

   The vote for the next version of PDFBox is under way.  I think we've

had a number of useful upgrades since our last release.  Any

objections to starting the release process for Tika 1.22 a week or so

after we integrate PDFBox?

 

  Cheers,

 

   Tim

 

 



Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
Yayy Tim

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, May 8, 2019 at 8:53 AM
To: "dev@tika.apache.org" 
Subject: Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

I think so. Works locally now. Actually works locally. :D

 

On Wed, May 8, 2019 at 11:50 AM Chris Mattmann  wrote:

 

Great work ☺ So it’s fixed?

 

 

 

 

 

 

 

From: Tim Allison 

Reply-To: "dev@tika.apache.org" 

Date: Wednesday, May 8, 2019 at 8:43 AM

To: "dev@tika.apache.org" 

Cc: Thamme Gowda N , Thejan Wijesinghe <

thejan.k.wijesin...@gmail.com>

Subject: Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

 

 

As is my habit, I should have kicked the tires a bit more before

 

asking for help.

 

 

 

I _think_ I fixed it.  I was able to reproduce it locally (darned

 

proxy!!!).  If anyone w knowledge of dl4j would be willing to confirm

 

that my commit/fix is sane, I'd appreciate it.  Thank you!!!

 

 

 

Cheers,

 

 

 

   Tim

 

 

 

On Wed, May 8, 2019 at 11:32 AM Chris Mattmann 

wrote:

 

 

 

Thejan, Thamme any ideas?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

From: Tim Allison 

 

Reply-To: "dev@tika.apache.org" 

 

Date: Wednesday, May 8, 2019 at 7:50 AM

 

To: "dev@tika.apache.org" 

 

Subject: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

 

 

 

 

 

 

Any recommendations?

 

 

 

 

 

 

 

java.lang.IllegalStateException: Number of indices (got 2) must be

 

 

 

same as array rank (1) - indices [0, 0]

 

 

 

  at

org.nd4j.base.Preconditions.throwStateEx(Preconditions.java:641)

 

 

 

  at org.nd4j.base.Preconditions.checkState(Preconditions.java:376)

 

 

 

  at org.nd4j.linalg.api.shape.Shape.getOffset(Shape.java:884)

 

 

 

  at org.nd4j.linalg.api.shape.Shape.getDouble(Shape.java:680)

 

 

 

  at

org.nd4j.linalg.api.ndarray.BaseNDArray.getDouble(BaseNDArray.java:1906)

 

 

 

  at

org.nd4j.linalg.api.ndarray.BaseNDArray.getInt(BaseNDArray.java:1851)

 

 

 

  at

org.apache.tika.dl.imagerec.DL4JVGG16Net.predict(DL4JVGG16Net.java:147)

 

 

 

  at

org.apache.tika.dl.imagerec.DL4JVGG16Net.recognise(DL4JVGG16Net.java:128)

 

 

 

  at

org.apache.tika.parser.recognition.ObjectRecognitionParser.parse(ObjectRecognitionParser.java:118)

 

 

 

 

 

 

 

On Wed, May 8, 2019 at 10:11 AM Tim Allison  wrote:

 

 

 

 

 

 

 

Yay!  I am able to reproduce this on our vm... Onward to debugging...

 

 

 

 

 

 

 

On Wed, May 8, 2019 at 9:57 AM Tim Allison  wrote:

 

 

 

> 

 

 

 

> All,

 

 

 

>   Apologies for the broken builds...I'm not able to reproduce this

 

 

 

> test failure on my mac or Windows machine.  I'm testing the build on

 

 

 

> our regression vm now.

 

 

 

>   If anyone has any idea why this is failing on our build vms (since

 

 

 

> we upgraded to dl4j-beta3), please let me know.

 

 

 

>   Thank you.

 

 

 

> 

 

 

 

>  Best,

 

 

 

> 

 

 

 

>  Tim

 

 

 

> 

 

 

 

> 

 

 

 

> 

https://builds.apache.org/job/tika-branch-1x/189/org.apache.tika$tika-dl/testReport/junit/org.apache.tika.dl.imagerec/DL4JVGG16NetTest/recognise/

 

 

 

 

 

 

 

 

 

 

 



Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
Great work ☺ So it’s fixed?

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, May 8, 2019 at 8:43 AM
To: "dev@tika.apache.org" 
Cc: Thamme Gowda N , Thejan Wijesinghe 

Subject: Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

As is my habit, I should have kicked the tires a bit more before

asking for help.

 

I _think_ I fixed it.  I was able to reproduce it locally (darned

proxy!!!).  If anyone w knowledge of dl4j would be willing to confirm

that my commit/fix is sane, I'd appreciate it.  Thank you!!!

 

Cheers,

 

  Tim

 

On Wed, May 8, 2019 at 11:32 AM Chris Mattmann  wrote:

 

Thejan, Thamme any ideas?

 

 

 

 

 

 

 

From: Tim Allison 

Reply-To: "dev@tika.apache.org" 

Date: Wednesday, May 8, 2019 at 7:50 AM

To: "dev@tika.apache.org" 

Subject: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

 

 

Any recommendations?

 

 

 

java.lang.IllegalStateException: Number of indices (got 2) must be

 

same as array rank (1) - indices [0, 0]

 

 at org.nd4j.base.Preconditions.throwStateEx(Preconditions.java:641)

 

 at org.nd4j.base.Preconditions.checkState(Preconditions.java:376)

 

 at org.nd4j.linalg.api.shape.Shape.getOffset(Shape.java:884)

 

 at org.nd4j.linalg.api.shape.Shape.getDouble(Shape.java:680)

 

 at 
org.nd4j.linalg.api.ndarray.BaseNDArray.getDouble(BaseNDArray.java:1906)

 

 at 
org.nd4j.linalg.api.ndarray.BaseNDArray.getInt(BaseNDArray.java:1851)

 

 at 
org.apache.tika.dl.imagerec.DL4JVGG16Net.predict(DL4JVGG16Net.java:147)

 

 at 
org.apache.tika.dl.imagerec.DL4JVGG16Net.recognise(DL4JVGG16Net.java:128)

 

 at 
org.apache.tika.parser.recognition.ObjectRecognitionParser.parse(ObjectRecognitionParser.java:118)

 

 

 

On Wed, May 8, 2019 at 10:11 AM Tim Allison  wrote:

 

 

 

Yay!  I am able to reproduce this on our vm... Onward to debugging...

 

 

 

On Wed, May 8, 2019 at 9:57 AM Tim Allison  wrote:

 

> 

 

> All,

 

>   Apologies for the broken builds...I'm not able to reproduce this

 

> test failure on my mac or Windows machine.  I'm testing the build on

 

> our regression vm now.

 

>   If anyone has any idea why this is failing on our build vms (since

 

> we upgraded to dl4j-beta3), please let me know.

 

>   Thank you.

 

> 

 

>  Best,

 

> 

 

>  Tim

 

> 

 

> 

 

> https://builds.apache.org/job/tika-branch-1x/189/org.apache.tika$tika-dl/testReport/junit/org.apache.tika.dl.imagerec/DL4JVGG16NetTest/recognise/

 

 

 

 



Re: [EXTERNAL] DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
I will test this out

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, May 8, 2019 at 6:58 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] DL4JVGG16NetTest failures

 

All,

  Apologies for the broken builds...I'm not able to reproduce this

test failure on my mac or Windows machine.  I'm testing the build on

our regression vm now.

  If anyone has any idea why this is failing on our build vms (since

we upgraded to dl4j-beta3), please let me know.

  Thank you.

 

 Best,

 

 Tim

 

 

https://builds.apache.org/job/tika-branch-1x/189/org.apache.tika$tika-dl/testReport/junit/org.apache.tika.dl.imagerec/DL4JVGG16NetTest/recognise/

 



Re: [EXTERNAL] Re: DL4JVGG16NetTest failures

2019-05-08 Thread Chris Mattmann
Thejan, Thamme any ideas? 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, May 8, 2019 at 7:50 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Re: DL4JVGG16NetTest failures

 

Any recommendations?

 

java.lang.IllegalStateException: Number of indices (got 2) must be

same as array rank (1) - indices [0, 0]

at org.nd4j.base.Preconditions.throwStateEx(Preconditions.java:641)

at org.nd4j.base.Preconditions.checkState(Preconditions.java:376)

at org.nd4j.linalg.api.shape.Shape.getOffset(Shape.java:884)

at org.nd4j.linalg.api.shape.Shape.getDouble(Shape.java:680)

at 
org.nd4j.linalg.api.ndarray.BaseNDArray.getDouble(BaseNDArray.java:1906)

at org.nd4j.linalg.api.ndarray.BaseNDArray.getInt(BaseNDArray.java:1851)

at 
org.apache.tika.dl.imagerec.DL4JVGG16Net.predict(DL4JVGG16Net.java:147)

at 
org.apache.tika.dl.imagerec.DL4JVGG16Net.recognise(DL4JVGG16Net.java:128)

at 
org.apache.tika.parser.recognition.ObjectRecognitionParser.parse(ObjectRecognitionParser.java:118)

 

On Wed, May 8, 2019 at 10:11 AM Tim Allison  wrote:

 

Yay!  I am able to reproduce this on our vm... Onward to debugging...

 

On Wed, May 8, 2019 at 9:57 AM Tim Allison  wrote:

> 

> All,

>   Apologies for the broken builds...I'm not able to reproduce this

> test failure on my mac or Windows machine.  I'm testing the build on

> our regression vm now.

>   If anyone has any idea why this is failing on our build vms (since

> we upgraded to dl4j-beta3), please let me know.

>   Thank you.

> 

>  Best,

> 

>  Tim

> 

> 

> https://builds.apache.org/job/tika-branch-1x/189/org.apache.tika$tika-dl/testReport/junit/org.apache.tika.dl.imagerec/DL4JVGG16NetTest/recognise/

 



Re: [EXTERNAL] Tika script

2019-04-26 Thread Chris Mattmann
Hi,

 

This would be a good question to ask on the dev@tika.a.o list so I’m CC’ing 
them.

 

Cheers,

Chris

 

 

From: Djari Imene 
Date: Friday, April 26, 2019 at 9:45 AM
To: "Mattmann, Chris A (1761)" 
Subject: [EXTERNAL] Tika script

 

Good evening sir I am writing you to request more information about how can i 
parse a arabic pdf to xml i tried to convert it to text by using your script 
but it gives a wrong caracters i  would be very thankful if you could just help 
me to fix this problem 



Re: [EXTERNAL] Wiki migration

2019-03-21 Thread Chris Mattmann
+1 from me!

 

 

 

From: Konstantin Gribov 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, March 21, 2019 at 10:02 AM
To: "dev@tika.apache.org" 
Subject: [EXTERNAL] Wiki migration

 

Hi, folks

 

What do you think about starting wiki migration (from moin to confluence)?

 

I can try it via selfservice.a.o if you consent but I'm not sure if I have

enough access to do so. Maybe only Tim as PMC Chair can.

 

-- 

Best regards,

Konstantin Gribov.

 



Re: 1.20?

2018-12-13 Thread Chris Mattmann
Roll forward! Yay!

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, December 13, 2018 at 7:02 AM
To: "dev@tika.apache.org" 
Subject: Re: 1.20?

 

Reports are here:

 

http://162.242.228.174/reports/tika_1_20-pre-rc1.zip

 

I'm going to revert the mp4 parser, and commit the few dependency

upgrades I ran.

 

The _major_ difference in content for ppt is explained by the

duplication of header/footer info.  To confirm this, note that the

values for "num_unique_tokens_a" and "num_unique_tokens_b" are

identical for nearly all ppt->ppt, but there are far more tokens in

"num_tokens_a" vs "num_tokens_b".

 

I also see that we're losing content in x-java and x-groovy, etc., but

that's because we're now suppressing the style markup that our parser

was (incorrectly, IMHO, inserting) -- check the values in

"top_10_unique_token_diffs_a", e.g.: rgb: 15 | color: 14 | font: 9 |

0,0,0: 4 | background: 4 | 147,147,147: 3 | 247,247,247: 3 | bold: 3 |

weight: 3 | family: 2

 

In short, I think we're good to go.  Will roll rc1 later today or

(more likely) tomorrow unless there are objections.

On Mon, Dec 10, 2018 at 9:37 PM Tim Allison  wrote:

 

Any blockers on 1.20?  I'm going to kick off the regression tests shortly.

On Fri, Nov 30, 2018 at 7:39 PM  wrote:

> 

> Hi,

> On Wed, 21 Nov 2018 at 13:00, Tim Allison  wrote:

> 

> > Dave,

> >   Should I try to get the Docker plugin working again?

> >

> 

> That would be great. I think I may have went down the wrong path building

> an image at package time, as there doesn't seem to be an easy way to

> publish it as an Apache labelled org on Dockerhub unless it builds from

> source.

> 

> I have some time over the weekend, so could update to where I got to and

> see what you think.

> 

> Cheers,

> Dave

 



Re: 1.20?

2018-11-20 Thread Chris Mattmann
Love it and I can align tika-python with that too ☺

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, November 20, 2018 at 3:04 PM
To: "dev@tika.apache.org" 
Subject: 1.20?

 

All,

   POI 4.0.1 will be out shortly with some important bug fixes.  What would

you all think of targeting 1st/2nd week of December for 1.20?

 

 Cheers,

 Tim

 



Re: ***UNCHECKED*** Fwd: MODERATE for annou...@apache.org

2018-09-26 Thread Chris Mattmann
+1 from me please update the wiki once you do

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, September 26, 2018 at 5:47 AM
To: "dev@tika.apache.org" 
Cc: Craig Russell 
Subject: Re: ***UNCHECKED*** Fwd: MODERATE for annou...@apache.org

 

All,

 

It is ok to include the sha512 checksums in text on the page but you also need 
an https link to the checksum.

 

It feels from the above like the checksums on the page are ok, but

what really matters are the checksums via the https links.  If this is

the case, would anyone object to getting rid of the checksums on our

webpage and just using the https links?  This would same a tedious

manual step of updating the website w each release.

 

Thank you.

 

Cheers,

 

  Tim

On Wed, Sep 19, 2018 at 1:25 PM Craig Russell  wrote:

 

Hi Tim,

 

Download page looks good now. Thanks for taking care of this so expeditiously.

 

Regards,

 

Craig

 

On Sep 19, 2018, at 8:35 AM, Tim Allison  wrote:

 

Accidentally dropped Craig in the last email.  Doh!

 

Craig,

 

I just fixed our downloads page...I think.  Let me know if we need to

do anything else...or if I botched anything in the announcement email.

 

Thank you, again.

On Wed, Sep 19, 2018 at 11:10 AM Tim Allison  wrote:

 

 

Thank you, Craig.

 

To confirm, I got the info right in the announcement email...what we

need to fix is our downloads page.  I can do that now.

 

Thank you, again.

 

Cheers,

 

  Tim

On Wed, Sep 19, 2018 at 10:50 AM Private LIst Moderation

 wrote:

 

 

Hi Tina devs,

 

I've moderated this announcement due to the urgency of the release.

 

For future releases, please change the downloads page:

 

It is ok to include the sha512 checksums in text on the page but you also need 
an https link to the checksum.

 

The link to the KEYS should link to the KEYS file in your distribution 
directory. The people.apache.org site should not be used.

 

Regards,

 

Craig

 

Announcements of Apache project releases must contain a link to the relevant

download page, which might be hosted on an Apache site or a third party site

such as github.com . [1]

 

The download page must provide public download links where current official

source releases and accompanying cryptographic files may be obtained. [2]

 

Links to the download artifacts must support downloads from mirrors. Links to

metadata (SHA, ASC) must be from https://www.apache.org/dist/ 
/

** MD5 is no longer considered useful and should not be used. SHA is required. 
**

Links to KEYS must be from https://www.apache.org/dist/ 
/ not release

specific.

 

Announcements that contain a link to the dyn/closer page alone will be

rejected by the moderators.

 

Announcements that contain a link to a web page that does not include a link

to a mirror to the artifact plus links to the signature and at least one sha

checksum will be rejected.

 

Announcements that link to dist.apache.org  will not 
be accepted.

Likewise ones which link to SVN or Git code repos.

 

[1] http://www.apache.org/legal/release-policy.html#release-announcements 
[2] 
https://www.apache.org/dev/release-distribution#download-links 


 

 

Begin forwarded message:

 

From: announce-reject-1537286294.48587.jgagknhepoajmbbkh...@apache.org

Subject: MODERATE for annou...@apache.org

Date: September 18, 2018 at 8:58:14 AM PDT

To: Recipient list not shown: ;

Cc: 
announce-allow-tc.1537286294.efngohokkjgkacicfpnk-tallison=apache@apache.org

Reply-To: announce-accept-1537286294.48587.jgagknhepoajmbbkh...@apache.org

 

 

To approve:

  announce-accept-1537286294.48587.jgagknhepoajmbbkh...@apache.org

To reject:

  announce-reject-1537286294.48587.jgagknhepoajmbbkh...@apache.org

To give a reason to reject:

%%% Start comment

%%% End comment

 

 

From: Tim Allison 

Subject: [ANNOUNCE] Apache Tika 1.19 released

Date: September 18, 2018 at 8:58:02 AM PDT

To: dev@tika.apache.org, u...@tika.apache.org, annou...@apache.org

 

 

The Apache Tika project is pleased to announce the release of Apache

Tika 1.19. The release contents have been pushed out to the main

Apache release site and to the Maven Central sync, so the releases

should be available as soon as the mirrors get the syncs.

 

Apache Tika is a toolkit for detecting and extracting metadata and

structured text content from various documents using existing parser

libraries.

 

Apache Tika 1.19 contains a number of improvements and bug fixes.

Details can be found in the changes file:

http://www.apache.org/dist/tika/CHANGES-1.19.txt

 

Apache Tika is available on the download page:

http://tika.apache.org/download.html

 

Apache Tika is also available in binary form or for use using Maven 

Re: 1.19.1?

2018-09-25 Thread Chris Mattmann
Sounds great!

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Tuesday, September 25, 2018 at 9:40 AM
To: "dev@tika.apache.org" 
Subject: Re: 1.19.1?

 

Given the mp3 issue and some other items, let's go with 1.19.1 rc1

today or tomorrow?

On Mon, Sep 24, 2018 at 3:07 PM Nick Burch  wrote:

 

On Mon, 24 Sep 2018, Tim Allison wrote:

> Aside from the problem with users and non-standard XML parsers, were

> there any other show-stoppers in POI 4.0.0?  Is there a reason to wait

> for POI 4.0.1?

 

I think, in terms of Tika affecting bugs, it was the xml parser stuff, and

commons compress missing from the pom.

 

Nick

 



Re: 1.19.1?

2018-09-21 Thread Chris Mattmann
Let’s roll it….

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, September 19, 2018 at 12:14 PM
To: "dev@tika.apache.org" 
Subject: 1.19.1?

 

The mp3 regression is bad. In hindsight, the Tika-eval reports were fairly

clear on this but I did some self-hand-waving to excuse away the

numbers...I shouldn’t have.

 

I want to add some new reports to tika-eval so that this never happens

again.

 

How long should we wait for 1.19.1 or 1.20?

 

Best,

 

Tim

 

On Wed, Sep 19, 2018 at 2:29 PM Hudson (JIRA)  wrote:

 

 

 [

https://issues.apache.org/jira/browse/TIKA-2730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621008#comment-16621008

]

 

Hudson commented on TIKA-2730:

--

 

SUCCESS: Integrated in Jenkins build tika-branch-1x #94 (See [

https://builds.apache.org/job/tika-branch-1x/94/])

TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF

(tallison: [

https://github.com/apache/tika/commit/80cfd6d4a4270f8f3697c6dc083b3dedfc36c86a

])

* (edit)

tika-parsers/src/main/java/org/apache/tika/parser/mp3/MpegStream.java

* (edit)

tika-parsers/src/test/java/org/apache/tika/parser/mp3/Mp3ParserTest.java

* (add)

tika-parsers/src/test/resources/test-documents/testMP3i18n_truncated.mp3

* (edit)

tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java

 

 

> parseToString fails for a simple mp3

> 

> 

> Key: TIKA-2730

> URL: https://issues.apache.org/jira/browse/TIKA-2730

> Project: Tika

>  Issue Type: Bug

>Affects Versions: 1.19

>Reporter: Boris Petrov

>Assignee: Tim Allison

>Priority: Major

> Fix For: 2.0.0, 1.20

> 

> Attachments: demo.mp3

> 

> 

> This is a regression from 1.18. I've attached the mp3 that fails. The

exception I get is:

> {noformat}

> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException

from org.apache.tika.parser.mp3.Mp3Parser@cefe6c6

> at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

> at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

> at

org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

> at org.apache.tika.Tika.parseToString(Tika.java:527)

> at com.company.TextExtractor.getText(TextExtractor.java:39)

> Caused by:

> java.io.EOFException: EOF: tried to skip 361 but could only skip 247

> at

org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:166)

> at

org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:204)

> at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)

> at

org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

> ... 5 more{noformat}

 

 

 

--

This message was sent by Atlassian JIRA

(v7.6.3#76005)

 

 



FW: Tika DjVu?

2018-08-01 Thread Chris Mattmann
 

 

 

 

From: KamilD 
Date: Tuesday, July 31, 2018 at 11:37 PM
To: "dev-ow...@tika.apache.org" 
Subject: Tika DjVu?

 

Helo,

I'm trying to use tika for djvu but is problem.

When using app version 1.14 I get empty result, but in version 1.18 I get:

 

C:\Users\>java -jar D:\djvu\tika-app-1.18.jar -t -eUTF-8

D:\djvu\Test_rs20846.djvu

 

jul 24, 2018 4:09:17 PM org.apache.tika.config.InitializableProblemHandler$3

handleInitializableProblem

WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.

See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

 

jul 24, 2018 4:09:18 PM org.apache.tika.config.InitializableProblemHandler$3

handleInitializableProblem

WARNING: org.xerial's sqlite-jdbc is not loaded.

Please provide the jar on your classpath to parse sqlite files.

See tika-parsers/pom.xml for the correct version.

 

For PDF, everything is ok.

 

Does Tika support DJVU?

 

Regards

Michal



Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
Yes, there is a big reason. It’s b/c you don’t have to have an external 
server running to use it with tika-dl. And of course you can static analyze
the code (which you have to mix languages for that with the other solution), 
etc.

 

So yes, we should keep them both…

 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 4:30 PM
To: "dev@tika.apache.org" 
Subject: Re: image recognition...how do the parts play together?

 

This is very helpful. Thank you! Is there any use in having the tika-dl

module if our more modern approach is REST + Docker? The upkeep in tika-dl

is nontrivial.

 

On Fri, Jul 6, 2018 at 6:15 PM Chris Mattmann  wrote:

 

Tim,

 

 

 

Thanks. There are multiple modes of integrating deep learning with Tika:

 

 

The original mode: uses Thamme’s work on REST exposing Tensorflow

and Docker to provide a REST Service to Tika to allow for running

Tensorflow

DL models. We initially did Inception_v3, and a model by Madhav Sharan

that combines OpenCV

with Inception v3 (and a new docker that installs OpenCV it’s a pain) for

image

and video object recognition, respectively. See:

https://github.com/apache/tika/pull/208

and https://github.com/apache/tika/pull/168 and also the wiki

Later, Thamme, Avtar Singh, KranthiGV, added DL4J support:

https://github.com/apache/tika/pull/165

including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182

This houses the model in USC Data science repo and uses it as an example

for how to store and load models from Keras/Python into DL4j:

 

https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data

Then, Thejan added Text Captioning and a new Docker, and trained model:

https://github.com/apache/tika/pull/180

Then Raunaq from UPenn added Inception v4 support via the

Docker/Tensorflow way:

https://github.com/apache/tika/pull/162

All this Docker work caused Thejan and others to think we needed to

refactor the dockers. We did

that here: https://github.com/apache/tika/pull/208 to make them cleaner,

and to depend on:

http://github.com/USCDataScience/tika-dockers/ and on

http://github.com/USCDataScience/img2text

models for image captioning. Now, Video and Image recognition and Image

Captioning all had the same

base docker and sub dockers from that.

 

 

That’s where we’re at today. Make sense? ☺ Thejan and others want to add

more DL4J supported models

and we can always use Tensorflow/Docker as well as a way of doing it.

 

 

 

Cheers,

 

Chris

 

 

 

 

 

 

 

 

 

From: Tim Allison 

Reply-To: "dev@tika.apache.org" 

Date: Friday, July 6, 2018 at 2:39 PM

To: "dev@tika.apache.org" 

Subject: image recognition...how do the parts play together?

 

 

 

On Twitter, Chris, Thamme, Thejan, and I are working with some

 

deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA

 

(TIKA-2672).

 

 

 

I initially requested help from Thejan (and Thamme :D) for this because we

 

were getting an initialization exception after the upgrade in tika-dl's

 

DL4JInceptionV3Net.

 

 

 

According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding

 

the TensorFlowRESTRecogniser...does this mean we can get rid of

 

DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help

 

with?

 

 

 

How do these recognizers play together?

 

 

 

Thank you.

 

 

 

Cheers,

 

 

 

  Tim

 

 

 

[1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617

 

[2] https://wiki.apache.org/tika/TikaAndVision

 

 

 

 

 



Re: image recognition...how do the parts play together?

2018-07-06 Thread Chris Mattmann
Tim,

 

Thanks. There are multiple modes of integrating deep learning with Tika:

 
The original mode: uses Thamme’s work on REST exposing Tensorflow
and Docker to provide a REST Service to Tika to allow for running Tensorflow
DL models. We initially did Inception_v3, and a model by Madhav Sharan that 
combines OpenCV
with Inception v3 (and a new docker that installs OpenCV it’s a pain) for image
and video object recognition, respectively. See: 
https://github.com/apache/tika/pull/208 
and https://github.com/apache/tika/pull/168 and also the wiki 
Later, Thamme, Avtar Singh, KranthiGV, added DL4J support:
https://github.com/apache/tika/pull/165 
including Inceptionv3 and VGG16 - https://github.com/apache/tika/pull/182 
This houses the model in USC Data science repo and uses it as an example
for how to store and load models from Keras/Python into DL4j:
https://github.com/USCDataScience/dl4j-kerasimport-examples/tree/master/dl4j-import-example/data
 
Then, Thejan added Text Captioning and a new Docker, and trained model:
https://github.com/apache/tika/pull/180 
Then Raunaq from UPenn added Inception v4 support via the Docker/Tensorflow way:
https://github.com/apache/tika/pull/162 
All this Docker work caused Thejan and others to think we needed to refactor 
the dockers. We did
that here: https://github.com/apache/tika/pull/208 to make them cleaner, and to 
depend on:
http://github.com/USCDataScience/tika-dockers/ and on 
http://github.com/USCDataScience/img2text 
models for image captioning. Now, Video and Image recognition and Image 
Captioning all had the same
base docker and sub dockers from that.
 

That’s where we’re at today. Make sense? ☺ Thejan and others want to add more 
DL4J supported models
and we can always use Tensorflow/Docker as well as a way of doing it.

 

Cheers,

Chris

 

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 2:39 PM
To: "dev@tika.apache.org" 
Subject: image recognition...how do the parts play together?

 

On Twitter, Chris, Thamme, Thejan, and I are working with some

deeplearning4j devs to help us upgrade to deeplearning4j 1.0.0-BETA

(TIKA-2672).

 

I initially requested help from Thejan (and Thamme :D) for this because we

were getting an initialization exception after the upgrade in tika-dl's

DL4JInceptionV3Net.

 

According to our wiki[2], we upgraded to InceptionV4 in Tika-2306 by adding

the TensorFlowRESTRecogniser...does this mean we can get rid of

DL4JInceptionV3Net?  Or, what are we actually asking the dl4j folks to help

with?

 

How do these recognizers play together?

 

Thank you.

 

Cheers,

 

 Tim

 

[1] e.g.  https://twitter.com/chrismattmann/status/1015340483923439617

[2] https://wiki.apache.org/tika/TikaAndVision

 



Re: Tika 1.19?

2018-07-06 Thread Chris Mattmann
Once tika-dl works again with Inception v4, I’m good ☺

 

I’m working on adding some more models to tika-dl and other things
but those can come after 1.19.

 

Cheers,

Chris

 

 

 

From: Tim Allison 
Reply-To: "dev@tika.apache.org" 
Date: Friday, July 6, 2018 at 8:40 AM
To: "dev@tika.apache.org" 
Subject: Tika 1.19?

 

All,

 

  We've made quite a few improvements, what would you think of starting the

release process in a couple of weeks...say, July 23ish?

 

  I'd like to complete the dl4j upgrade and update some of our dependencies

so that we can at least build with Java 11.

 

  Any blockers or other things people want to get in?

 

   Cheers,

 

Tim

 



Re: Branch_1x build broke?

2018-05-24 Thread Chris Mattmann
Thanks Dave, yes I have tesseract enabled and this is on my Mac Book.

Thanks for looking into it Dave…

 

Cheers,

Chris

 

 

 

From: "loo...@gmail.com" <loo...@gmail.com>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Thursday, May 24, 2018 at 11:34 AM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: Re: Branch_1x build broke?

 

Hey Chris,

 

This is happening to me with Tesseract enabled but only on my MacBook.

 

Are you running this on OSX?

 

Been trying to get some time to dig into it as it works perfectly on my

Windows and Linux setups.

 

Cheers,

Dave

 

 

 

On Thu, 24 May 2018, 17:09 Chris Mattmann, <mattm...@apache.org> wrote:

 

Tim,

 

 

 

Are you seeing this?

 

 

 

Results :

 

 

 

Failed tests:

 

 

PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103

pdf_haystack not found in:

 

http://www.w3.org/1999/xhtml;>

 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 



 

Outer_haystack

 

Outer_haystack

 



 



 

Outer_haystack

 



 

Outer_haystack

 



 

Outer_haystack

 



 



 



 



 

attached.pdf

 

dehayslack dehaystack dehayslack

dehaystack dehaystack dehaystack pd'

 

 

 



 



 



 

 

 



 

 

 



 

 

 

Haystack

 

 

 

Needle

 

 

 

Haystack

 

 

 



 

 

 



 

 

 



 

 

 



 

 

 



 



 

 

 

Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30

 

 

 

[INFO]



 

[INFO] Reactor Summary:

 

[INFO]

 

[INFO] Apache Tika parent . SUCCESS [

1.565 s]

 

[INFO] Apache Tika core ... SUCCESS [

32.977 s]

 

[INFO] Apache Tika parsers  FAILURE [05:52

min]

 

[INFO] Apache Tika XMP  SKIPPED

 

[INFO] Apache Tika serialization .. SKIPPED

 

[INFO] Apache Tika batch .. SKIPPED

 

[INFO] Apache Tika language detection . SKIPPED

 

[INFO] Apache Tika application  SKIPPED

 

[INFO] Apache Tika OSGi bundle  SKIPPED

 

[INFO] Apache Tika translate .. SKIPPED

 

[INFO] Apache Tika server . SKIPPED

 

[INFO] Apache Tika examples ... SKIPPED

 

[INFO] Apache Tika Java-7 Components .. SKIPPED

 

[INFO] Apache Tika eval ... SKIPPED

 

[INFO] Apache Tika Deep Learning (powered by DL4J)  SKIPPED

 

[INFO] Apache Tika Natural Language Processing  SKIPPED

 

[INFO] Apache Tika  SKIPPED

 

[INFO]



 

[INFO] BUILD FAILURE

 

[INFO]



 

[INFO] Total time: 06:27 min

 

[INFO] Finished at: 2018-05-24T09:04:59-07:00

 

[INFO] Final Memory: 72M/1029M

 

[INFO]



 

[ERROR] Failed to execute goal

org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test)

on project tika-parsers: There are test failures.

 

[ERROR]

 

[ERROR] Please refer to

/Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the

individual test results.

 

[ERROR] -> [Help 1]

 

[ERROR]

 

[ERROR] To see the full stack trace of the errors, re-run Maven with the

-e switch.

 

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

 

[ERROR]

 

[ERROR] For more information about the errors and possible solutions,

please read the following articles:

 

[ERROR] [Help 1]

http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

 

[ERROR]

 

[ERROR] After correcting the problems, you can resume the build with the

command

 

[ERROR]   mvn  -rf :tika-parsers

 

 

 

Keeps failing for me.

 

nonas:tika2.0.0 mattmann$ java -version

 

java version "1.8.0_144"

 

Java(TM) SE Runtime Environment (build 1.8.0_144-b01)

 

Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

 

nonas:tika2.0.0 mattmann$

 

 

 

Any ideas?

 

 

 

Cheers,

 

Chris

 

 

 

 

 



Branch_1x build broke?

2018-05-24 Thread Chris Mattmann
Tim,

 

Are you seeing this?

 

Results :

 

Failed tests: 

  PDFParserTest.testEmbeddedDocsWithOCROnly:1250->TikaTest.assertContains:103 
pdf_haystack not found in:

http://www.w3.org/1999/xhtml;>





















































































Outer_haystack

Outer_haystack





Outer_haystack



Outer_haystack



Outer_haystack









attached.pdf

dehayslack dehaystack dehayslack dehaystack 
dehaystack dehaystack pd'

 







 



 



 

Haystack

 

Needle

 

Haystack

 



 



 



 



 





 

Tests run: 1009, Failures: 1, Errors: 0, Skipped: 30

 

[INFO] 

[INFO] Reactor Summary:

[INFO] 

[INFO] Apache Tika parent . SUCCESS [  1.565 s]

[INFO] Apache Tika core ... SUCCESS [ 32.977 s]

[INFO] Apache Tika parsers  FAILURE [05:52 min]

[INFO] Apache Tika XMP  SKIPPED

[INFO] Apache Tika serialization .. SKIPPED

[INFO] Apache Tika batch .. SKIPPED

[INFO] Apache Tika language detection . SKIPPED

[INFO] Apache Tika application  SKIPPED

[INFO] Apache Tika OSGi bundle  SKIPPED

[INFO] Apache Tika translate .. SKIPPED

[INFO] Apache Tika server . SKIPPED

[INFO] Apache Tika examples ... SKIPPED

[INFO] Apache Tika Java-7 Components .. SKIPPED

[INFO] Apache Tika eval ... SKIPPED

[INFO] Apache Tika Deep Learning (powered by DL4J)  SKIPPED

[INFO] Apache Tika Natural Language Processing  SKIPPED

[INFO] Apache Tika  SKIPPED

[INFO] 

[INFO] BUILD FAILURE

[INFO] 

[INFO] Total time: 06:27 min

[INFO] Finished at: 2018-05-24T09:04:59-07:00

[INFO] Final Memory: 72M/1029M

[INFO] 

[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on 
project tika-parsers: There are test failures.

[ERROR] 

[ERROR] Please refer to 
/Users/mattmann/tmp/tika2.0.0/tika-parsers/target/surefire-reports for the 
individual test results.

[ERROR] -> [Help 1]

[ERROR] 

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR] 

[ERROR] For more information about the errors and possible solutions, please 
read the following articles:

[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

[ERROR] 

[ERROR] After correcting the problems, you can resume the build with the command

[ERROR]   mvn  -rf :tika-parsers

 

Keeps failing for me.

nonas:tika2.0.0 mattmann$ java -version

java version "1.8.0_144"

Java(TM) SE Runtime Environment (build 1.8.0_144-b01)

Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

nonas:tika2.0.0 mattmann$ 

 

Any ideas?

 

Cheers,

Chris

 



Welcome Thejan Wijesinghe as an Apache Tika PMC and committer!

2018-05-07 Thread Chris Mattmann
Welcome to Thejan Wijesinghe who has joined as a new Tika PMC member and 
committer!

 

Please say a bit about yourself…thanks!

 

Cheers,

Chris

 

 

 



Re: rfc822 updates and 1.18

2018-04-06 Thread Chris Mattmann
Awesomeness

 

 

 

From: "Allison, Timothy B." 
Reply-To: "dev@tika.apache.org" 
Date: Friday, April 6, 2018 at 11:30 AM
To: "dev@tika.apache.org" 
Subject: rfc822 updates and 1.18

 

All,

I made two updates to our handling of rfc822 files and reran the eval against 
what Tika 1.18-SNAPSHOT thinks are rfc822 files.  The reports are available 
here:

 

http://162.242.228.174/reports/tika_1_18-SNAPSHOT_rfc822_concat_reports.tbz

 

I _think_ we're good to go...  I'll roll the RC1 on Monday unless there are 
objections.

 

 Best,

 

  Tim

 

 



Re: message/news; charset=windows-1252 -> message/rfc822

2018-03-28 Thread Chris Mattmann
+1

 

 

From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Wednesday, March 28, 2018 at 8:01 AM
To: "dev@tika.apache.org" 
Subject: Re: message/news; charset=windows-1252 -> message/rfc822

 

On Wed, 28 Mar 2018, Allison, Timothy B. wrote:

  With the new mime patterns, we've gotten quite a few changes of 

message/news being identified as message/rfc822.  An example is:

 

http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5

 

That looks like a regression to me, it's really news

 

We should correct this, right?  Any recommendations?

 

I think it's the Message-ID header it's matching on. I'd suggest we bump 

the news magics up from 50 (same as rfc822) to 60, so the news ones take 

preference

 

Nick

 



R-Tika API Binding

2018-03-20 Thread Chris Mattmann
Hey Folks,

 

Just found this R-Tika API binding:

 

https://ropensci.github.io/rtika/articles/rtika_introduction.html

 

Very cool! Updated the wiki with it.


Cheers,

Chris

 

 

 



Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

2018-03-18 Thread Chris Mattmann
Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!

Sincerely,
Chris



On 3/18/18, 2:47 PM, "David Meikle"  wrote:

Nice one Nick!  Will take a look this week.

Cheers,
Dave

On 14 March 2018 at 17:38, Nick Burch  wrote:

> Hi All
>
> As promised, I've finally had a go to try and implement my ideas for
> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
> breaking 2.x parser change
>
> My work so far is in this github branch, and is ready for review!
> https://github.com/apache/tika/tree/multiple-parsers
>
>
> It seems to work fine for the Fallback case, and for the Supplemental
> case. You can set a policy that controls how clashing metadata is handled,
> currently "first one to set a key wins", "last one to set a key wins",
> "ignore previous parsers", and "keep old and new unique values"
>
> I've also done a proof of concept for "pick best" case, to try running the
> text parser with a specified set of different charsets, capture the text
> from each, "pick the best" (hard coded 1st...) then run for real with that
> one.
>
>
> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> to claim to support, Tika Config XML friendly helper for the metadata 
clash
> policy, review ContentHandlerFactory signature and tweak if needed.
>
> Proposed breaking 2.x change - add second parse method that takes
> ContentHandlerFactory instead of ContentHandler, with most parsers getting
> that just grabbing a single one and using that as before
>
>
> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> I stop? Carry on? Modify it? Other?
>
> Nick
>





Re: Tika 1.18?

2018-03-07 Thread Chris Mattmann
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9


On 3/7/18, 1:16 PM, "Allison, Timothy B."  wrote:

All,

  I think I've made the updates that I wanted to make sure got in to 1.18.  
It looks like PDFBox is going to start their release cycle shortly.  Should we 
wait for PDFBox 2.0.9?

  That may add a week or two to our release, although, frankly, it might 
not.  We can start running the regression tests March 9(ish) and see if 
anything dire appears...

  Cheers,

  Tim






Re: Tika 1.18?

2018-03-01 Thread Chris Mattmann
Same: makes perfect sense to me and let's do it ( I just updated (finally) Tika 
Python down
stream to be based on the 1.16 Tika, I guess I should get it based on 1.17 soon 
too (

https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#L17

Cheers,
Chris

On 3/1/18, 5:16 AM, "Nick Burch"  wrote:

On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
> There have been some important bug fixes, a few new capabilities, and 
> the upgrading of dependencies because of CVEs.  There are a bunch of 
> mime tickets from Andreas Meier that I’d like to get into 1.18.  Is 
> there anything else that is critical?

I've had a busy few weeks, so haven't yet had a chance to try out my 
proposed multi-parser stuff for 2.x. I'll hopefully take a look next week, 
assuming even the fastest review cycle and everyone loving it, I can't see 
us being ready to all sign-off on those "2.x breaking changes" until 
probably April.

Given that, doing an interim 1.x release soon makes sense to me!

Nick




Re: RE : Re: Issue with apache Tika

2018-02-24 Thread Chris Mattmann
No clue - Radhia - perhaps you can enlighten everyone..?


On 2/23/18, 6:45 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Um, no, that's not great.  What's wrong with our current version? 

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Thursday, February 22, 2018 5:11 PM
To: dev@tika.apache.org
Cc: radhia bezzine <bezzinerad...@gmail.com>
Subject: Re: RE : Re: Issue with apache Tika

Great to hear!

 

 

From: radhia bezzine <bezzinerad...@gmail.com>
Date: Thursday, February 22, 2018 at 12:28 PM
To: Chris Mattmann <mattm...@apache.org>
Subject: Re: RE : Re: Issue with apache Tika

 

Hi Chris !  

 

I fixed the issue ! it was not so complicated ! a problem of version ! the 
recent version doesn t work for me but the version 1.15 works fine.

 

Thank you very much.

 

Good Night !

 

On Thu, Feb 22, 2018 at 6:42 PM, bezzineradhia <bezzinerad...@gmail.com> 
wrote:

Hello !

 

Thanks i ll try it tomorrow ! I ll let you know ! 

 

Regards !

Radhia

 

 

 

Envoyé depuis mon smartphone Samsung Galaxy.

 Message d'origine 

De : Chris Mattmann <mattm...@apache.org> 

Date : 22/02/2018 18:31 (GMT+01:00) 

À : radhia bezzine <bezzinerad...@gmail.com> 

Cc : dev@tika.apache.org 

Objet : Re: Issue with apache Tika 

 

Try UTF-8 encoding the URLs or the parameters themselves. If you are using 
Tika-Python, then use the Python encode library…

 

Cheers,

Chris

 

 

 

From: radhia bezzine <bezzinerad...@gmail.com>
Date: Thursday, February 22, 2018 at 6:03 AM
To: "Mattmann, Chris A (1761)" <chris.a.mattm...@jpl.nasa.gov>
Subject: Issue with apache Tika

 

Hello Dear ! 

 

I hope your are doing well.

 

I am writing to you because i have an issue running apache Tika on Python.

I'm trying to parse content & metadata from many urls (existing in the 
internet)

however Tika returns some times an error like " invalid argument "

i troubleshooted  the problem and i realized that some url include 
forbidden characters that is why apache tika mention " invalid argument "

I really don't know how to deal with this problem, i tried other tools but 
i think tika is matching with my need.

 

Thank you very much for you time.

 

Best regards! 

 

Radhia

 






Re: RE : Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Great to hear!

 

 

From: radhia bezzine <bezzinerad...@gmail.com>
Date: Thursday, February 22, 2018 at 12:28 PM
To: Chris Mattmann <mattm...@apache.org>
Subject: Re: RE : Re: Issue with apache Tika

 

Hi Chris !  

 

I fixed the issue ! it was not so complicated ! a problem of version ! the 
recent version doesn t work for me but the version 1.15 works fine.

 

Thank you very much.

 

Good Night !

 

On Thu, Feb 22, 2018 at 6:42 PM, bezzineradhia <bezzinerad...@gmail.com> wrote:

Hello !

 

Thanks i ll try it tomorrow ! I ll let you know ! 

 

Regards !

Radhia

 

 

 

Envoyé depuis mon smartphone Samsung Galaxy.

 Message d'origine ----

De : Chris Mattmann <mattm...@apache.org> 

Date : 22/02/2018 18:31 (GMT+01:00) 

À : radhia bezzine <bezzinerad...@gmail.com> 

Cc : dev@tika.apache.org 

Objet : Re: Issue with apache Tika 

 

Try UTF-8 encoding the URLs or the parameters themselves. If you are using 
Tika-Python, then use the Python
encode library…

 

Cheers,

Chris

 

 

 

From: radhia bezzine <bezzinerad...@gmail.com>
Date: Thursday, February 22, 2018 at 6:03 AM
To: "Mattmann, Chris A (1761)" <chris.a.mattm...@jpl.nasa.gov>
Subject: Issue with apache Tika

 

Hello Dear ! 

 

I hope your are doing well.

 

I am writing to you because i have an issue running apache Tika on Python.

I'm trying to parse content & metadata from many urls (existing in the internet)

however Tika returns some times an error like " invalid argument "

i troubleshooted  the problem and i realized that some url include forbidden 
characters that is why apache tika mention " invalid argument "

I really don't know how to deal with this problem, i tried other tools but i 
think tika is matching with my need.

 

Thank you very much for you time.

 

Best regards! 

 

Radhia

 



Re: Issue with apache Tika

2018-02-22 Thread Chris Mattmann
Try UTF-8 encoding the URLs or the parameters themselves. If you are using 
Tika-Python, then use the Python
encode library…

 

Cheers,

Chris

 

 

 

From: radhia bezzine 
Date: Thursday, February 22, 2018 at 6:03 AM
To: "Mattmann, Chris A (1761)" 
Subject: Issue with apache Tika

 

Hello Dear ! 

 

I hope your are doing well.

 

I am writing to you because i have an issue running apache Tika on Python.

I'm trying to parse content & metadata from many urls (existing in the internet)

however Tika returns some times an error like " invalid argument "

i troubleshooted  the problem and i realized that some url include forbidden 
characters that is why apache tika mention " invalid argument "

I really don't know how to deal with this problem, i tried other tools but i 
think tika is matching with my need.

 

Thank you very much for you time.

 

Best regards! 

 

Radhia



Re: Requesting Tika Wiki Page Edit Access

2018-02-17 Thread Chris Mattmann
Added! https://wiki.apache.org/tika/ContributorsGroup 

 

Feel free to edit the page

 

From: Prerana Teligi Harapanahalli Math 
Date: Thursday, February 15, 2018 at 8:35 PM
To: "dev@tika.apache.org" , "Mattmann, Chris A (1761)" 

Subject: Requesting Tika Wiki Page Edit Access

 

preranathm



Re: Not-yet-broken breaking changes for Tika 2?

2018-02-07 Thread Chris Mattmann
IMO, if the parser p1 has an exception and then we move to p2 before p1 is done 
creating its SAX we can create a special tag indicating the exception e.g., 
Message here and have it output that before moving to p2 in the chain...



On 2/7/18, 7:00 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Do we worry about properly closing tags on an exception?




kaboom
mailto:lfcnas...@gmail.com] 
Sent: Monday, February 5, 2018 5:34 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

From a forensic use case it is better just saying we are trying another 
parser and not resetting the content handler, because the first parser can 
extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can 
create an optional setinputstreamfactory() method in TikaInputStream, so the 
user can implement an InputStreamFactory interface with a getInputStream 
method, if he does not want to pay a performance hit with temp files for 
everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <mattm...@apache.org>
escreveu:

I think we should just say, OK now we're trying  a different parser



On 2/5/18, 9:51 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

To my mind, the real challenge is what to do with content that should 
be ignored...

If the strategy is back-off-on-exception (try the DOCX parser, but if 
there's an exception, use the Zip parser), what do we do with the sax elements 
that have already been written?  Do we need a new handler type that has a 
reset() method?

Or do we just say, hey, now we're trying a different parser...


-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2xyes I get it will induce 
overhead, but as a start, why not?
In short just run through the stream 2x

++++
++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, 
Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and 
Open Source Programs and Applications Office (8212) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/

++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/

++


On 2/5/18, 9:25 AM, "Nick Burch" <apa...@gagravarr.org> wrote:

On Mon, 5 Feb 2018, Chris Mattmann wrote:
> Let's have a go at implementing it! You know my thoughts (make it 
like
> OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case 
though
:)

Nick

> On 2/5/18, 8:37 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>
>Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
>any ideas on the content part?
    >
>On Tue, 2 Jan 2018, Nick Burch wrote:
>> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>> On collision, the precedence order defines what key takes
precedence and
>>> _overwrites_ the other. Overwrite is but one option (you
could save *all*
>>> the values it’s a multi-valued key structure so…)
>>
>> OK, I think that's fine. I've had a go at updating the wiki
for the metadata
>> case:
>> https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
>> And example Tika Config settings for it
>> https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
>> If people are happy with how that sounds/looks, I can have a
stab at
>> implementing it, as I *think* it's quite 

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
I think we should just say, OK now we're trying  a different parser



On 2/5/18, 9:51 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

To my mind, the real challenge is what to do with content that should be 
ignored...

If the strategy is back-off-on-exception (try the DOCX parser, but if 
there's an exception, use the Zip parser), what do we do with the sax elements 
that have already been written?  Do we need a new handler type that has a 
reset() method?

Or do we just say, hey, now we're trying a different parser...


-Original Message-
From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2xyes I get it will induce 
overhead, but as a start, why not?
In short just run through the stream 2x

++++++
    Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced 
IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source 
Programs and Applications Office (8212) NASA Jet Propulsion Laboratory 
Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
    
    On Mon, 5 Feb 2018, Chris Mattmann wrote:
> Let's have a go at implementing it! You know my thoughts (make it 
like 
> OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case 
though 
:)

Nick

> On 2/5/18, 8:37 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>
>Ping - anyone got any thoughts on the proposed metadata parser 
stuff, and
>any ideas on the content part?
>
>On Tue, 2 Jan 2018, Nick Burch wrote:
>> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>> On collision, the precedence order defines what key takes 
precedence and
>>> _overwrites_ the other. Overwrite is but one option (you could 
save *all*
>>> the values it’s a multi-valued key structure so…)
>>
>> OK, I think that's fine. I've had a go at updating the wiki for 
the metadata
>> case:
>> 
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
>> And example Tika Config settings for it
>> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
>> If people are happy with how that sounds/looks, I can have a 
stab at
>> implementing it, as I *think* it's quite easy
>>
>>
>> However... that still leaves the Context (XHTML SAX events) case 
to solve!
>>
>> Anyone have any ideas on how we can append to or cancel/reset 
the Content
>> Handler series of SAX events when we move onto a second+ parser 
for a file?
>>
    >    > Thanks
>> Nick
>>
>>> On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>>>
>>>On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>>> My general approach to conflicting metadata is simply to 
define
>>>> precedence orders.
>>>>
>>>> For example here is one documented from OODT:
>>>>
>>>>
>>> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>>>
>>>> We can do similar things with Tika, e.g.,
>>>>
>>>> [CoreMetadata.PROPERTIES]
>>>> [ImageParser.METADATA]
>>>> [TikaOCR.METADATA]
>>>

Re: Not-yet-broken breaking changes for Tika 2?

2018-02-05 Thread Chris Mattmann
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) 
)\



On 2/5/18, 8:37 AM, "Nick Burch" <apa...@gagravarr.org> wrote:

Ping - anyone got any thoughts on the proposed metadata parser stuff, and 
any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence and 
>> _overwrites_ the other. Overwrite is but one option (you could save 
*all* 
>> the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the 
metadata 
> case:
> 
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
> And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at 
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the Content 
> Handler series of SAX events when we move onto a second+ parser for a 
file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org> wrote:
>>
>>On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>> My general approach to conflicting metadata is simply to define
>>> precedence orders.
>>>
>>> For example here is one documented from OODT:
>>>
>>> 
>> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>>
>>> We can do similar things with Tika, e.g.,
>>>
>>> [CoreMetadata.PROPERTIES]
>>> [ImageParser.METADATA]
>>> [TikaOCR.METADATA]
>>
>>What happens if two different parsers both output the same bit of 
>> metadata
>>though? eg Tim's example of one giving dc:creator of Tim and the 
second
>>giving dc:creator of Chris?
>> 
>>
>>Secondly, what about the XHTML sax events stream? I think that's 
>> probably
>>the harder case...
>>
>>Nick




Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
I think we can't merge this b/c it references an external repository:

https://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/
https://blog.sonatype.com/2009/02/why-putting-repositories-in-your-poms-is-a-bad-idea/

Before it can be merged it needs to be uploaded to OSSRH and synced
 


On 2/5/18, 9:01 AM, "Chris Mattmann" <mattm...@apache.org> wrote:

Hmmm...the problem here is that Sonatype won't let us publish to Central 
with
the below. It's not even an ASF policy thing - it's a Sonatype thing


On 2/5/18, 5:55 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Sorry for the duplication, but I wanted to check on this and didn't 
want it to get lost in a github comment.


>Fellow devs on Apache Tika, are we ok with relying on a non-Maven 
central repo?

-Original Message-
From: ASF GitHub Bot (JIRA) [mailto:j...@apache.org] 
Sent: Monday, February 5, 2018 8:47 AM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-2565) Upgrade edu.ucar dependencies 
to 4.6.11


[ 
https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352395#comment-16352395
 ] 

ASF GitHub Bot commented on TIKA-2565:
--

tballison commented on a change in pull request #218: TIKA-2565 Upgrade 
edu.ucar dependencies to 4.6.11
URL: https://github.com/apache/tika/pull/218#discussion_r165975019
 
 

 ##
 File path: tika-parsers/pom.xml
 ##
 @@ -44,14 +44,22 @@
 0.8
 2.0.8
 1.8.13
-4.5.5
+4.6.11
 0.8
 
 1.54
 1.3
 4.5.4
   
 
+  
+
+unidata-all
+Unidata All
+
+ https://artifacts.unidata.ucar.edu/repository/unidata-all/
 
 Review comment:
   Fellow devs on Apache Tika, are we ok with relying on a non-Maven 
central repo?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the URL above 
to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade edu.ucar dependencies to 4.6.11
> ---
>
> Key: TIKA-2565
> URL: https://issues.apache.org/jira/browse/TIKA-2565
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.17
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.0
>
>
> An [existing PR|https://github.com/apache/tika/pull/212/files] 
suggests to upgrade the netcdf4-java dependency, however it does not address 
the issue.
> This PR will add the correct Maven repository configuration and then 
make the upgrade(s).
> https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/refe
> rence/BuildDependencies.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)









Re: relying on a non-Maven central repo?

2018-02-05 Thread Chris Mattmann
Hmmm...the problem here is that Sonatype won't let us publish to Central with
the below. It's not even an ASF policy thing - it's a Sonatype thing


On 2/5/18, 5:55 AM, "Allison, Timothy B."  wrote:

Sorry for the duplication, but I wanted to check on this and didn't want it 
to get lost in a github comment.


>Fellow devs on Apache Tika, are we ok with relying on a non-Maven central 
repo?

-Original Message-
From: ASF GitHub Bot (JIRA) [mailto:j...@apache.org] 
Sent: Monday, February 5, 2018 8:47 AM
To: dev@tika.apache.org
Subject: [jira] [Commented] (TIKA-2565) Upgrade edu.ucar dependencies to 
4.6.11


[ 
https://issues.apache.org/jira/browse/TIKA-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352395#comment-16352395
 ] 

ASF GitHub Bot commented on TIKA-2565:
--

tballison commented on a change in pull request #218: TIKA-2565 Upgrade 
edu.ucar dependencies to 4.6.11
URL: https://github.com/apache/tika/pull/218#discussion_r165975019
 
 

 ##
 File path: tika-parsers/pom.xml
 ##
 @@ -44,14 +44,22 @@
 0.8
 2.0.8
 1.8.13
-4.5.5
+4.6.11
 0.8
 
 1.54
 1.3
 4.5.4
   
 
+  
+
+unidata-all
+Unidata All
+
+ https://artifacts.unidata.ucar.edu/repository/unidata-all/
 
 Review comment:
   Fellow devs on Apache Tika, are we ok with relying on a non-Maven 
central repo?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the URL above to go 
to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Upgrade edu.ucar dependencies to 4.6.11
> ---
>
> Key: TIKA-2565
> URL: https://issues.apache.org/jira/browse/TIKA-2565
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.17
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.0
>
>
> An [existing PR|https://github.com/apache/tika/pull/212/files] suggests 
to upgrade the netcdf4-java dependency, however it does not address the issue.
> This PR will add the correct Maven repository configuration and then make 
the upgrade(s).
> https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/refe
> rence/BuildDependencies.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)






Re: 1.17 rc1 and two repos in nexus?!

2017-12-08 Thread Chris Mattmann
RC #2….



On 12/8/17, 2:30 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Wait, no that's totally hosed, there's not even a source zip file in: 
https://repository.apache.org/content/repositories/orgapachetika-1027


https://repository.apache.org/content/repositories/orgapachetika-1027/org/apache/tika/tika/1.17/

Do I need to respin w rc2?  Or is there a way to push to nexus again?


-Original Message-
From: Allison, Timothy B. 
Sent: Friday, December 8, 2017 5:18 PM
To: dev@tika.apache.org
Subject: RE: 1.17 rc1 and two repos in nexus?!

Do we expect only the src to be in nexus, not the jar artifacts (with sigs 
and digests) for app, server, eval?

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, December 8, 2017 5:07 PM
To: dev@tika.apache.org
Subject: Re: 1.17 rc1 and two repos in nexus?!

Hey Tim, probably just upload errors on the first one and so it tried 
again. No worries. Drop and close the first, and just use the 2nd.

Cheers,
Chris




On 12/8/17, 12:05 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Not sure what happened, but two repos were created in Nexus:

https://repository.apache.org/content/repositories/orgapachetika-1026/
https://repository.apache.org/content/repositories/orgapachetika-1027/

The first one (1026) failed with checksum problems, and I dropped it.
I closed the second one (1027), and that one is now live.

I can't see anything strange in the terminal output.

Any ideas...is this normal?

Cheers,

   Tim










Re: 1.17 rc1 and two repos in nexus?!

2017-12-08 Thread Chris Mattmann
Hey Tim, probably just upload errors on the first one and so it tried again. No 
worries. Drop and close
the first, and just use the 2nd.

Cheers,
Chris




On 12/8/17, 12:05 PM, "Allison, Timothy B."  wrote:

Not sure what happened, but two repos were created in Nexus:

https://repository.apache.org/content/repositories/orgapachetika-1026/
https://repository.apache.org/content/repositories/orgapachetika-1027/

The first one (1026) failed with checksum problems, and I dropped it.
I closed the second one (1027), and that one is now live.

I can't see anything strange in the terminal output.

Any ideas...is this normal?

Cheers,

   Tim







Re: Tika 1.17?

2017-11-29 Thread Chris Mattmann
Thanks so much for fixing this. It worked during MEMEX and then I think has 
since fallen out
of date and perhaps I committed Zarana’s code wrong or something. Will be great 
to get this
working!



On 11/29/17, 9:54 AM, "David Meikle" <loo...@gmail.com> wrote:

I am thinking TIKA-2385. I've got a resized image that I can commit tonight
that should close this one off.

Cheers,
Dave


On 29 Nov 2017 14:42, "Allison, Timothy B." <talli...@mitre.org> wrote:

Many thanks to Bob for help on TIKA-2502!

Anything else we want to put into 1.17 before I run the regression tests?

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, November 13, 2017 1:42 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.17?

Y.  You're right.  Thank you!

 I think I've been avoiding that because there were some regressions in
metadata-extractor last I looked at this.  Let's hope those are gone in
2.10.1.

-Original Message-
From: Tyler Bui-Palsulich [mailto:tpalsul...@apache.org]
Sent: Sunday, November 12, 2017 2:54 PM
To: dev@tika.apache.org
Subject: RE: Tika 1.17?

TIKA-2486 might be worth blocking on since there is a CVE.

Tyler

On Nov 6, 2017 5:26 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

> Y.  I'm happy enough  to wait a few more days.  I wasn't able to kick
> off the regression tests last week.  Should I wait for the new parsers
> to run the regression tests?
>
> -Original Message-
> From: David Meikle [mailto:loo...@gmail.com]
> Sent: Friday, November 3, 2017 7:42 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.17?
>
> Sounds good. I have a couple of new parsers I would like to slot in
> but not had a chance the last few months. Will go for it over the
> weekend, if that works for you Tim.
>
> Cheers,
> Dave
>
>
>
> On 3 November 2017 at 15:19, Mattmann, Chris A (3010) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> > Let’s make it so (
> >
> > 
> ++
> > Chris Mattmann, Ph.D.
> > Principal Data Scientist, Engineering Administrative Office (3010)
> > Manager, NSF & Open Source Projects Formulation and Development
> > Offices
> > (8212)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 180-503E, Mailstop: 180-503
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > 
> ++
> > Director, Information Retrieval and Data Science Group (IRDS)
> > Adjunct Associate Professor, Computer Science Department University
> > of Southern California, Los Angeles, CA 90089 USA
> > WWW: http://irds.usc.edu/
> > 
> ++
> >
> >
> >
> > On 11/3/17, 7:35 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
> >
> > All,
> >
> > PDFBox 2.0.8 is now integrated.  I want to fix TIKA-2490 before
> > we release 1.17.  Are there other issues that are blockers or you'd
> > like to fix before 1.17 (TIKA-2471, maybe?)?
> >
> > I plan to run initial large scale regression tests shortly for
> > rfc822 and mbox because of TIKA-2478.  I'll run the full regression
> > tests before cutting the RC, but I want to focus on those for now.
Other requests?
> >
> > Cheers,
> >
> > Tim
> >
> >
> >
>





Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
On collision, the precedence order defines what key takes precedence and 
_overwrites_ the
other. Overwrite is but one option (you could save *all* the values it’s a 
multi-valued key structure
so…)

Cheers,
Chris




On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org> wrote:

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> My general approach to conflicting metadata is simply to define 
> precedence orders.
>
> For example here is one documented from OODT:
>
> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>
> We can do similar things with Tika, e.g.,
>
> [CoreMetadata.PROPERTIES]
> [ImageParser.METADATA]
> [TikaOCR.METADATA]

What happens if two different parsers both output the same bit of metadata 
though? eg Tim's example of one giving dc:creator of Tim and the second 
giving dc:creator of Chris?


Secondly, what about the XHTML sax events stream? I think that's probably 
the harder case...

Nick





Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Thanks Nick.

My general approach to conflicting metadata is simply to define precedence 
orders.

For example here is one documented from OODT:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
 

We can do similar things with Tika, e.g.,

[CoreMetadata.PROPERTIES]
[ImageParser.METADATA]
[TikaOCR.METADATA]
…

And then start with the top, and then overlay heading downwards. Make sense?

Cheers,
Chris

P.S. The metadata key/value merging principles could be configurable, but a 
default base one of
overlay according to some configured precedence order maybe in tika-config.xml 
would be a fine
start.




On 10/26/17, 9:14 AM, "Nick Burch" <apa...@gagravarr.org> wrote:

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> Why don’t we just store N copies of the stream, and parse it twice?

I'm not sure that's the challenge though? Using TikaInputStream we can 
buffer to a temp file if needed to re-read the input

> Of course that’s the ugly way, but currently the way I’ve hacked this in 
> all of my projects is simply to call Tika N times OUTSIDE of Tika. Why 
> don’t we just use that as the weakest baseline and work backwards from 
> there?

I think our main challenge right now is on the output end. How do you deal 
with multiple different Metadata results that might clash after running 
Tika server times? How do you deal with multiple (some potentially empty, 
some overlapping) XHTML outputs from multiple parses? Can we copy those 
approaches?

Thanks
Nick




Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Why don’t we just store N copies of the stream, and parse it twice?

Of course that’s the ugly way, but currently the way I’ve hacked this in all of
my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just 
use
that as the weakest baseline and work backwards from there?

Chris




On 10/26/17, 3:56 AM, "Nick Burch"  wrote:

Hi All

Based on the plan on the wiki 
 
, we still have a 
major breaking change or two planned for Tika 2 that we haven't yet 
"broken". (There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having 
multiple parsers available + active for a given format. This could be to 
support fallback parsers, eg "try this fancy new parser, but if it falls 
retry with this simpler one" or "try this xml parser, if that fails just 
try strings". A related but different case is to cleanly support multiple 
parsers covering different aspects, eg OCR an image plus extract metadata, 
or NER on the contents of a scientific PDF + text + metadata + NER of the 
OCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) 
handled via one parser (eg OCR or NER) having an embedded hard-code 
reference to another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers 
interact with the SAX content handler. For the fallback case, that's how 
to say "sorry, ignore all that XML we already sent, we're starting again 
with this XML now". For the multiple parser case, it's how we could have 
the image parser "finish" the (empty) XHTML but then have the OCR one send 
some text, or have the NER parser get at the XHTML text of the PDF + OCR 
of embedded images to enhance with the entities.


What do we think for this? Can we come up with a solution to let this go 
forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and 
do this stuff in Tika 3 instead?

Nick





Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-24 Thread Chris Mattmann
This makes sense to me, +1 Giuseppe!



On 10/24/17, 6:12 PM, "Giuseppe Totaro"  wrote:

Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of 
course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and 
the
   parameters to be used. I think we could get the ContentHandler by using 
the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

= X-Content-Handler:  *[, ]
=  *[; ]
=  = 

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler *[,]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]

https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]

https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin 
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to 
comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion 
about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, 
so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro 
>>> wrote:
>>>
>>> Hi folks,

 if I am not wrong, currently you cannot configure a specific
 ContentHandler
 while using tika-server. I mean that you can configure your own parser
 [0]
 but you cannot control which ContentHandler the parser leverages to
 extract
 text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
 StandardsExtractingContentHandler, etc).
 If it is correct, it would be nice to enable the use of specific
 ContentHandlers within tika-server and I would like to discuss how to
 solve
 this issue generally.

 I propose two solutions:

 1. augment the TikaConfig class so that a specific ContentHandler
 can be
 used in tika-config.xml;
 2. determine the ContentHandler to use for parsing through HTTP
 

Re: Announcing go-tika, a Go package for Tika

2017-10-06 Thread Chris Mattmann
I saw this Tyler, and it’s awesome. I forked it already though I’m not a Go 
programmer thank you
for increasing the community here (

CC’ing Jim Jag who I know has done some Go programming, Jim spread the word ;)

Cheers,
Chris




On 10/6/17, 10:12 AM, "Tyler Bui-Palsulich"  
wrote:

(Bumping this since it looks like the first message didn't go through.)

Tyler

On Mon, Oct 2, 2017 at 1:27 PM, Tyler Bui-Palsulich 
wrote:

> Hi Everyone,
>
> I am happy to announce go-tika, a Go package which makes it easy to use
> Tika from Go! See https://github.com/google/go-tika and the corresponding
> GoDoc . I marked a
> release as 0.1.16, indicating the latest version supported by the package.
>
> I added a link to the wiki (https://wiki.apache.org/tika/
> API%20Bindings%20for%20Tika).
>
> Here is the relevant bit for how to parse a file:
>
> f, err := os.Open("path/to/file")
> if err != nil {
> log.Fatal(err)
> }
> defer f.Close()
>
> client := tika.NewClient(nil, s.URL())
> body, err := client.Parse(context.Background(), f)
>
> Hopefully this is useful for someone! Feel free to file an issue if you
> have a question or spot a bug.
>
> Thanks to Chris et al. for inspiration with tika-python.
>
> Tyler
>





Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Chris Mattmann
Hmm, cool.

Can we support both? If I don’t have to modify/ship a Tika config (which is a 
runtime
configuration) and I can, on a per call invocation, change the ContentHandler, 
it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST 
server.
These are documented here:

https://wiki.apache.org/tika/API%20Bindings%20for%20Tika

Cheers,
Chris




On 9/28/17, 2:26 PM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:

Hi

Option #1 is also good - a question how to pass a ContentHandler to a 
Beam function was open, and given that passing TikaConfig is needed 
anyway, having a way to specify a handler there can be handy too...

Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
> I am +1 for this. Option #2 sounds like a slick way to handle this for me 
that would
> remain back compat with tika-python which is of strong interest to me.
> 
> Cheers,
> Chris
> 
> 
> 
> 
> On 9/28/17, 1:35 PM, "Giuseppe Totaro" <totarope...@gmail.com> wrote:
> 
>  Hi folks,
>  
>  if I am not wrong, currently you cannot configure a specific 
ContentHandler
>  while using tika-server. I mean that you can configure your own 
parser [0]
>  but you cannot control which ContentHandler the parser leverages to 
extract
>  text and metadata (e.g., you cannot use 
PhoneExtractingContentHandler,
>  StandardsExtractingContentHandler, etc).
>  If it is correct, it would be nice to enable the use of specific
>  ContentHandlers within tika-server and I would like to discuss how 
to solve
>  this issue generally.
>  
>  I propose two solutions:
>  
> 1. augment the TikaConfig class so that a specific ContentHandler 
can be
> used in tika-config.xml;
> 2. determine the ContentHandler to use for parsing through HTTP 
headers,
> for example:
> curl -T filename.pdf http://localhost:9998/meta --header
> "X-Content-Handler: PhoneExtractingContentHandler"
> This should affect also the TikaResource.java class.
>  
>  I look forward to having your feedback. I strongly believe that 
every user
>  who wants to use Tika as a service through tika-server and needs to 
extract
>  content and metadata like phone numbers, standard references, etc 
would be
>  very happy.
>  
>  Thanks a lot,
>  Giuseppe
>  
> 
> 





Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Chris Mattmann
I am +1 for this. Option #2 sounds like a slick way to handle this for me that 
would
remain back compat with tika-python which is of strong interest to me.

Cheers,
Chris




On 9/28/17, 1:35 PM, "Giuseppe Totaro"  wrote:

Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

   1. augment the TikaConfig class so that a specific ContentHandler can be
   used in tika-config.xml;
   2. determine the ContentHandler to use for parsing through HTTP headers,
   for example:
   curl -T filename.pdf http://localhost:9998/meta --header
   "X-Content-Handler: PhoneExtractingContentHandler"
   This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe





Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data 
either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described 
here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B."  wrote:

>>1) We've gathered a TB of data from CommonCrawl and we run regression 
tests against this TB (thank you, Rackspace for hosting our vm!) to try to 
identify these problems.

And if anyone with connections at a big company doing open source + cloud 
would be interested in floating us some storage and cycles,  we'd be happy to 
move off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim






Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
Hi all,

One other thing is that Tika extracts metadata, and language information in 
which order
doesn’t matter (Keys can be out of order).

Would this be useful?

Cheers,
Chris




On 9/21/17, 2:10 PM, "Sergey Beryozkin"  wrote:

Hi Eugene

Thank you, very helpful, let me read it few times before I get what 
exactly I need to clarify :-), two questions so far:

On 21/09/17 21:40, Eugene Kirpichov wrote:
> Thanks all for the discussion. It seems we have consensus that both
> within-document order and association with the original filename are
> necessary, but currently absent from TikaIO.
> 
> *Association with original file:*
> Sergey - Beam does not *automatically* provide a way to associate an
> element with the file it originated from: automatically tracking data
> provenance is a known very hard research problem on which many papers have
> been written, and obvious solutions are very easy to break. See related
> discussion at
> 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>   .
> 
> If you want the elements of your PCollection to contain additional
> information, you need the elements themselves to contain this information:
> the elements are self-contained and have no metadata associated with them
> (beyond the timestamp and windows, universal to the whole Beam model).
> 
> *Order within a file:*
> The only way to have any kind of order within a PCollection is to have the
> elements of the PCollection contain something ordered, e.g. have a
> PCollection, where each List is for one file [I'm 
assuming
> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
> can be applied to very large files, this could produce very large 
elements,
> which is a bad idea. Because of this, I don't think the result of applying
> Tika to a single file can be encoded as a PCollection element.
> 
> Given both of these, I think that it's not possible to create a
> *general-purpose* TikaIO transform that will be better than manual
> invocation of Tika as a DoFn on the result of FileIO.readMatches().
> 
> However, looking at the examples at
> https://tika.apache.org/1.16/examples.html - almost all of the examples
> involve extracting a single String from each document. This use case, with
> the assumption that individual documents are small enough, can certainly 
be
> simplified and TikaIO could be a facade for doing just this.
> 
> E.g. TikaIO could:
> - take as input a PCollection
> - return a PCollection>, where ParseResult
> is a class with properties { String content, Metadata metadata }

and what is the 'String' in KV given that TikaIO.ParseResult 
represents the content + (Tika) Metadata of the file such as the author 
name, etc ? Is it the file name ?
> - be configured by: a Parser (it implements Serializable so can be
> specified at pipeline construction time) and a ContentHandler whose
> toString() will go into "content". ContentHandler does not implement
> Serializable, so you can not specify it at construction time - however, 
you
> can let the user specify either its class (if it's a simple handler like a
> BodyContentHandler) or specify a lambda for creating the handler
> (SerializableFunction), and potentially you can have
> a simpler facade for Tika.parseAsString() - e.g. call it
> TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>PCollection> parseResults =
> p.apply(FileIO.match().filepattern(...))
>  .apply(FileIO.readMatches())
>  .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>  .apply(TikaIO.parseAll()
>  .withParser(new AutoDetectParser())
>  .withContentHandler(() -> new BodyContentHandler(new
> ToXMLContentHandler(
> 
> You could also have shorthands for letting the user avoid using FileIO
> directly in simple cases, for example:
>  p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and
> you'll be able to share the code between parseAll and regular parse.
> 
OK. What about the current source on the master, should be marked 
Experimental till I manage to write something new with the above ideas 
in mind ? Or there's enough time till 2.2.0 gets released ?

Thanks, Sergey
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin 
> wrote:
> 
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I 

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Chris Mattmann
Thanks Sergey, feel free to CC me directly at mattm...@apache.org on the Beam 
thread.
My own 2c is that Tika’s “metadata” extraction can be any order, and with our 
tika-dl module
and the new feature extraction from multimedia files using Tensorflow and DL4j 
these are 
perfect examples where the order/extraction doesn’t matter…



On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:

Hi Guys

TikaIO is getting some serious attention now on the Beam dev, and 
unfortunately it is not all about it being a great addition to Beam.

The team is wondering what one can do with TikaIO vs someone just doing 
some custom Beam function.

TikaIO and as any other Bounded text reader will produce the data in the 
ordered way, but they can be made totally unordered to the pipeline by 
the Beam runtime.

I gave one example where we used the Tika output to save it all to 
Lucene (with the file name associated) and then search for the files 
which contain a certain word.

Tim, Chris, others, if you have some interesting examples to share where 
it did not matter in which order Tika-produced data were made eventually 
available, then please let me know, or reply directly to a Beam dev 
thread titled "TikaIO concerns".

Note, if Beam devs decide they don't want it then one option can be to 
create a tika-integrations/beam module and experiment there - I'm not 
saying it will need to be done but it's something that may be worth 
considering

Sergey
On 15/09/17 12:02, Sergey Beryozkin wrote:
> Hi Chris
> 
> thanks,
> 
> at the moment TikaIO (originally renamed TikaReader as it can only read 
> but we renamed it to follow the convention) is a bounded reader, so you 
> can say ask it to read
> 
> /files/*.pdf
> 
> and it will read all the N files there, and will end the run.
> 
> I'm not sure yet what is the best strategy to making it the unbounded 
> reader where it can continuously poll or be notified of the new files 
> becoming available...There are some ideas about scheduling the bounded 
> Beam pipelines, haven't looked yet...
> 
> In the short term, the simplest solution would be simply to create a new 
> instance of TikaIO pipeline, and point it to the new temp folder where a 
> new batch of files has been dropped to.
> 
> Thanks, Sergey
> On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>> Amazing work, thank you Sergey!!
>>
>> 
++ 
>>
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010)
>> Manager, NSF & Open Source Projects Formulation and Development 
>> Offices (8212)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> 
++ 
>>
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> 
++ 
>>
>>
>> On 9/11/17, 7:33 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>>
>>  What great news!  Thank you, Sergey!!!
>>  -Original Message-
>>  From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
>>  Sent: Monday, September 11, 2017 9:18 AM
>>  To: Allison, Timothy B. <talli...@mitre.org>; dev@tika.apache.org
>>  Subject: Re: Integrating Tika with Apache Beam
>>  Hi Tim, All
>>  It took it some time, but finally Beam TikaIO component is in its 
>> 2.2.0-SNAPSHOT master,
>>  https://github.com/apache/beam/tree/master/sdks/java/io/tika
>>  I've created a basic project which can help with running it quickly:
>>  https://github.com/sberyozkin/beamTikaExample
>>  One can just build it and run as suggested in Readme.md, simply 
>> have some PDF files for example, and point to one or all of them.
>>  By default, Beam will output the data to /tmp/tika.
>>  main() can be updated with supporting more options, they can be 
>> collected from the command line either with TikaOptions:
>>  
 

Re: Tika 2.0?

2017-09-12 Thread Chris Mattmann
B it is, proceed (



On 9/12/17, 5:10 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

I'd strongly advocate for 2.  I _think_ the hard work was laying out the 
general structure and adding the ProxyParser workaround.  Copying and 
pasting/reworking into that structure will be: 

A) far less dangerous than 1 
And
B) we'll have a cleaner history.

On A), I know that we didn't add some major components including: 
configurability of parsers, completely cleaned up logging, numerous bug fixes 
and even entire modules (tika-dl).

On B), there were a few times where I "caught a parser up" in 2.0 not by 
individual commits based on the original history but based on a copy/paste from 
the contemporaneous master.  This obliterated the history of some commits on 
the 2.0 branch and would force us to look back at master.

-Original Message-
From: Bob Paulin [mailto:b...@bobpaulin.com] 
Sent: Monday, September 11, 2017 9:48 PM
To: dev@tika.apache.org
Subject: Re: Tika 2.0?

Just so it's clear are we going to:

1) Rename the 2.0 branch over to master

or

2) Re-apply the changes on master. 

I recall Chris' preference was 1 which would be quicker.  However there is 
very likely missed patches.  2 will be more time consuming but it would be more 
likely to include all the most recent code.  I'm open to either.  Not sure how 
far out of date 2.0 branch is so I defer to Tim on the risk of going with #1.


- Bob


On 9/11/2017 5:15 PM, Chris Mattmann wrote:
> +1000
>
>
>
> On 9/11/17, 12:03 PM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> Y, well, I didn't say _which_ September...
> 
> Given my limited availability to work on this in Sept and POI's 
decision to move to Java 1.8, I propose releasing Tika 1.17 after the release 
of POI 3.17 and PDFBox 2.0.8.  This would be the last version of Tika at the 
Java 1.7 level, and then we bump the Java requirement to 1.8, switch master to 
the 2.0 layout and create a 1.x maintenance branch (with Java 1.8) for quick 
critical bug fixes/security vulnerabilities until we release 2.0.
> 
> What do you all think?
> 
>  
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org] 
> Sent: Monday, August 28, 2017 9:33 AM
> To: dev@tika.apache.org
> Subject: Tika 2.0?
> 
> All,
> 
>   We're getting some increasing deltas btwn the 2.0 and trunk 
branches.  Many of these are my fault; I gave up making updates to 2.0 around 
April/May, I think.
> 
>   What would people think of punting on some of the desired goals of 
2.0 (e.g. chaining parsers, more structured but still simple metadata) and 
releasing 2.0 soonish...say 2.0-BETA end of September?
> 
>   We've been able to make some major improvements to Tika without 
breaking backwards compatibility.  We _might_ be able to do that with the 
outstanding issues for 2.0 when someone has time.
> 
>   We could also do the upgrade to jdk 8 with Tika 2.0.
> 
>   If this sounds reasonable, I propose creating a 1.x branch from 
trunk for 1.x maintenance and then reworking trunk to the 2.x structure that 
Bob Paulin so elegantly worked out.  I figure we can either copy/paste from 
trunk to the current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 
as a model for restructuring trunk.  At this point, I'd prefer the second 
option.  The key here is to switch "trunk" to 2.0 so that we all have the 
mindset that 2.0 is what we're focused on.
> 
>The main benefit of this proposal is that we'd have a more modular 
Tika soon.
> 
>What do you think?
> 
>  Best,
> 
>Tim
> 
>
>
>







Re: Tika 2.0?

2017-09-11 Thread Chris Mattmann
+1000



On 9/11/17, 12:03 PM, "Allison, Timothy B."  wrote:

Y, well, I didn't say _which_ September...

Given my limited availability to work on this in Sept and POI's decision to 
move to Java 1.8, I propose releasing Tika 1.17 after the release of POI 3.17 
and PDFBox 2.0.8.  This would be the last version of Tika at the Java 1.7 
level, and then we bump the Java requirement to 1.8, switch master to the 2.0 
layout and create a 1.x maintenance branch (with Java 1.8) for quick critical 
bug fixes/security vulnerabilities until we release 2.0.

What do you all think?

 
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, August 28, 2017 9:33 AM
To: dev@tika.apache.org
Subject: Tika 2.0?

All,

  We're getting some increasing deltas btwn the 2.0 and trunk branches.  
Many of these are my fault; I gave up making updates to 2.0 around April/May, I 
think.

  What would people think of punting on some of the desired goals of 2.0 
(e.g. chaining parsers, more structured but still simple metadata) and 
releasing 2.0 soonish...say 2.0-BETA end of September?

  We've been able to make some major improvements to Tika without breaking 
backwards compatibility.  We _might_ be able to do that with the outstanding 
issues for 2.0 when someone has time.

  We could also do the upgrade to jdk 8 with Tika 2.0.

  If this sounds reasonable, I propose creating a 1.x branch from trunk for 
1.x maintenance and then reworking trunk to the 2.x structure that Bob Paulin 
so elegantly worked out.  I figure we can either copy/paste from trunk to the 
current 2.x (and _hope_ we get all the updates) or use Bob's 2.0 as a model for 
restructuring trunk.  At this point, I'd prefer the second option.  The key 
here is to switch "trunk" to 2.0 so that we all have the mindset that 2.0 is 
what we're focused on.

   The main benefit of this proposal is that we'd have a more modular Tika 
soon.

   What do you think?

 Best,

   Tim





Re: [ANNOUNCE] Welcome Madhav Sharan as Tika Committer and PMC Member

2017-08-31 Thread Chris Mattmann
Welcome Madhav!

Cheers,
Chris




On 8/31/17, 12:29 PM, "loo...@gmail.com on behalf of Dave Meikle" 
 wrote:

Hello Everyone,

Please join me in welcoming Madhav Sharan as a PMC Members and Committer to
the project!

Welcome to the team, Madhav. Feel free to say a bit about yourselves and
how you got involved in Tika.

Cheers,
Dave





Re: Query related to Apache Tika dependencies

2017-08-08 Thread Chris Mattmann
 

 

 

 

From: Deepanshu Bhardwaj 
Date: Tuesday, August 8, 2017 at 2:53 AM
To: "dev-ow...@tika.apache.org" 
Subject: Query related to Apache Tika dependencies

 

Hi Team,

 

I need one help. I need to know the list of libraries (jar files) that are 
being used in apache tika app 1.14 jar as this will play an important role in 
my project.

 

Requesting you to share the list of libraries or any link from where I can find 
the same. It will be very helpful for me.

 

Thanks in advance.

 

 

Best Regards,
Deepanshu Bhardwaj | Sr. Software Engineer | Data Resolve Technologies.
Mobile: +91 99918 63211  | Office:  +91 120 4129871

Corporate Office: G-30, 3rd  Floor, Sector-3, Noida, INDIA, PIN 201301 

Registered Office: 2/F, Elegance, Mathura Road, Jasola, New Delhi, INDIA, PIN 
110025

Website:  www.dataresolve.com
_
(This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail by error, please immediately 
notify the sender by replying to this e-mail and deleting the message including 
any attachment(s) from your system. Thank you in advance for your cooperation 
and assistance.)



Re: [VOTE] Release Apache Tika 1.16 Candidate #1

2017-07-08 Thread Chris Mattmann
+1 from me SIGS and CHECKSUMS look good. 

Thanks Tim!

Cheers,
Chris

LMC-053601:apache-tika-1.16-rc1 mattmann$ for type in "" \-app \-eval \-server; 
do $HOME/bin/stage_apache_rc tika$type 1.16 
https://dist.apache.org/repos/dist/dev/tika/; done
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 53.5M  100 53.5M0 0  3992k  0  0:00:13  0:00:13 --:--:-- 5122k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1092  0 --:--:-- --:--:-- --:--:--  1092
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 96  0 --:--:-- --:--:-- --:--:--96
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 41.6M  100 41.6M0 0  6578k  0  0:00:06  0:00:06 --:--:-- 8297k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1012  0 --:--:-- --:--:-- --:--:--  1012
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 46  0 --:--:-- --:--:-- --:--:--46
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 56.4M  100 56.4M0 0  3950k  0  0:00:14  0:00:14 --:--:-- 4742k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   1470  0 --:--:-- --:--:-- --:--:--  1469
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 65  0 --:--:-- --:--:-- --:--:--65
LMC-053601:apache-tika-1.16-rc1 mattmann$ $HOME/bin/stage_apache_rc tika 
1.16-src https://dist.apache.org/repos/dist/dev/tika/
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 84.2M  100 84.2M0 0  6563k  0  0:00:13  0:00:13 --:--:-- 5261k
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100   836  100   8360 0   2129  0 --:--:-- --:--:-- --:--:--  2127
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
10034  100340 0 47  0 --:--:-- --:--:-- --:--:--47
LMC-053601:apache-tika-1.16-rc1 mattmann$ ls
tika-1.16-src.zip   tika-app-1.16.jar   
tika-eval-1.16.jar  tika-server-1.16.jar
tika-1.16-src.zip.asc   tika-app-1.16.jar.asc   
tika-eval-1.16.jar.asc  tika-server-1.16.jar.asc
tika-1.16-src.zip.md5   tika-app-1.16.jar.md5   
tika-eval-1.16.jar.md5  tika-server-1.16.jar.md5
LMC-053601:apache-tika-1.16-rc1 mattmann$ $HOME/bin/verify_gpg_sigs
Verifying Signature for file tika-1.16-src.zip.asc
gpg: assuming signed data in `tika-1.16-src.zip'
gpg: Signature made Fri Jul  7 19:27:42 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 833C 1CC4 926C 1DDE 29BB  8731 E403 2DC4 EF0C F38A
Verifying Signature for file tika-app-1.16.jar.asc
gpg: assuming signed data in `tika-app-1.16.jar'
gpg: Signature made Fri Jul  7 19:13:16 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 833C 1CC4 926C 1DDE 29BB  8731 E403 2DC4 EF0C F38A
Verifying Signature for file tika-eval-1.16.jar.asc
gpg: assuming signed data in `tika-eval-1.16.jar'
gpg: Signature made Fri Jul  7 19:20:17 2017 PDT using RSA key ID EF0CF38A
gpg: Good signature from "Tim Allison (ASF signing key) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no 

Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
Hey Tim,

I usually do a search in JIRA, then I go to the upper right of the screen and 
select
“Bulk Change” from there. Then I Edit the fix version and push off those in my 
search scheduled for  but with resolution 

Hope that helps!

Cheers,
Chris




On 7/7/17, 11:31 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

Thank you, Chris!

Now, how do I bulk move open 1.16->1.17 on JIRA?  

-Original Message-----
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, July 7, 2017 11:39 AM
To: dev@tika.apache.org
Subject: Re: [tika] branch master updated: TIKA-1988 -- allow for errors 
downloading models

Sure



On 7/7/17, 7:57 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

I'll leave the moving to a new module to you?

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, July 7, 2017 10:32 AM
To: dev@tika.apache.org; comm...@tika.apache.org
Subject: Re: [tika] branch master updated: TIKA-1988 -- allow for 
errors downloading models

Great Tim thanks!




On 7/7/17, 7:28 AM, "talli...@apache.org" <talli...@apache.org> wrote:

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this 
push:
 new 632f52d  TIKA-1988 -- allow for errors downloading models
632f52d is described below

commit 632f52db4713977aa93504517e57b8afe86e6e91
Author: tballison <talli...@mitre.org>
AuthorDate: Fri Jul 7 10:28:48 2017 -0400

TIKA-1988 -- allow for errors downloading models
---
 .../tika/parser/recognition/AgeRecogniserConfig.java   | 18 
--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
index 84c1f3e..92427f4 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
@@ -17,7 +17,9 @@
 
 package org.apache.tika.parser.recognition;
 
+import java.net.URL;
 import java.util.Map;
+
 import org.apache.tika.config.Param;
 
 
@@ -30,8 +32,20 @@ public class AgeRecogniserConfig {
private String pathClassifyRegression = null;
 
public AgeRecogniserConfig(Map<String, Param> params) {
-   
setPathClassifyModel(AgeRecogniserConfig.class.getResource(params.get("age.path.classify").getValue().toString()).getFile());
-   
setPathClassifyRegression(AgeRecogniserConfig.class.getResource(params.get("age.path.regression").getValue().toString()).getFile());
+
+   URL classifyUrl = AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.classify").getValue().toString());
+
+   if (classifyUrl != null) {
+   setPathClassifyModel(classifyUrl.getFile());
+   }
+
+   URL regressionUrl = 
AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.regression").getValue().toString());
+
+   if (regressionUrl != null) {
+   
setPathClassifyRegression(regressionUrl.getFile());
+   }
}
 
public String getPathClassifyModel() {

-- 
To stop receiving notification emails like this one, please contact
['"comm...@tika.apache.org" <comm...@tika.apache.org>'].











Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
Sure



On 7/7/17, 7:57 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

I'll leave the moving to a new module to you?

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, July 7, 2017 10:32 AM
To: dev@tika.apache.org; comm...@tika.apache.org
Subject: Re: [tika] branch master updated: TIKA-1988 -- allow for errors 
downloading models

Great Tim thanks!




On 7/7/17, 7:28 AM, "talli...@apache.org" <talli...@apache.org> wrote:

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new 632f52d  TIKA-1988 -- allow for errors downloading models
632f52d is described below

commit 632f52db4713977aa93504517e57b8afe86e6e91
Author: tballison <talli...@mitre.org>
AuthorDate: Fri Jul 7 10:28:48 2017 -0400

TIKA-1988 -- allow for errors downloading models
---
 .../tika/parser/recognition/AgeRecogniserConfig.java   | 18 
--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
index 84c1f3e..92427f4 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
@@ -17,7 +17,9 @@
 
 package org.apache.tika.parser.recognition;
 
+import java.net.URL;
 import java.util.Map;
+
 import org.apache.tika.config.Param;
 
 
@@ -30,8 +32,20 @@ public class AgeRecogniserConfig {
private String pathClassifyRegression = null;
 
public AgeRecogniserConfig(Map<String, Param> params) {
-   
setPathClassifyModel(AgeRecogniserConfig.class.getResource(params.get("age.path.classify").getValue().toString()).getFile());
-   
setPathClassifyRegression(AgeRecogniserConfig.class.getResource(params.get("age.path.regression").getValue().toString()).getFile());
+
+   URL classifyUrl = AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.classify").getValue().toString());
+
+   if (classifyUrl != null) {
+   setPathClassifyModel(classifyUrl.getFile());
+   }
+
+   URL regressionUrl = 
AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.regression").getValue().toString());
+
+   if (regressionUrl != null) {
+   
setPathClassifyRegression(regressionUrl.getFile());
+   }
}
 
public String getPathClassifyModel() {

-- 
To stop receiving notification emails like this one, please contact
['"comm...@tika.apache.org" <comm...@tika.apache.org>'].








Re: [tika] branch master updated: TIKA-1988 -- allow for errors downloading models

2017-07-07 Thread Chris Mattmann
Great Tim thanks!




On 7/7/17, 7:28 AM, "talli...@apache.org"  wrote:

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git


The following commit(s) were added to refs/heads/master by this push:
 new 632f52d  TIKA-1988 -- allow for errors downloading models
632f52d is described below

commit 632f52db4713977aa93504517e57b8afe86e6e91
Author: tballison 
AuthorDate: Fri Jul 7 10:28:48 2017 -0400

TIKA-1988 -- allow for errors downloading models
---
 .../tika/parser/recognition/AgeRecogniserConfig.java   | 18 
--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
index 84c1f3e..92427f4 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/recognition/AgeRecogniserConfig.java
@@ -17,7 +17,9 @@
 
 package org.apache.tika.parser.recognition;
 
+import java.net.URL;
 import java.util.Map;
+
 import org.apache.tika.config.Param;
 
 
@@ -30,8 +32,20 @@ public class AgeRecogniserConfig {
private String pathClassifyRegression = null;
 
public AgeRecogniserConfig(Map params) {
-   
setPathClassifyModel(AgeRecogniserConfig.class.getResource(params.get("age.path.classify").getValue().toString()).getFile());
-   
setPathClassifyRegression(AgeRecogniserConfig.class.getResource(params.get("age.path.regression").getValue().toString()).getFile());
+
+   URL classifyUrl = AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.classify").getValue().toString());
+
+   if (classifyUrl != null) {
+   setPathClassifyModel(classifyUrl.getFile());
+   }
+
+   URL regressionUrl = AgeRecogniserConfig.class.getResource(
+   
params.get("age.path.regression").getValue().toString());
+
+   if (regressionUrl != null) {
+   setPathClassifyRegression(regressionUrl.getFile());
+   }
}
 
public String getPathClassifyModel() {

-- 
To stop receiving notification emails like this one, please contact
['"comm...@tika.apache.org" '].





Re: Tika 1.15.1? -> 1.16

2017-07-07 Thread Chris Mattmann
OK Tim / all, TIKA-1988 is done! Age resolution is in.

Enjoy and proceed with the release, please +1.

Cheers,
Chris




On 7/5/17, 8:37 PM, "Luís Filipe Nassif" <lfcnas...@gmail.com> wrote:

Hi Tim,

Taking a fast look at Nick's fix on TIKA-2419 seems conservative to me,
restricted to corrupted xml, so I think there is no need to rerun the
regression tests.

So +1 from me, ++1 with age detection :)

2017-07-05 22:35 GMT-03:00 Allison, Timothy B. <talli...@mitre.org>:

> All,
>   I'm waiting to get some resolution on TIKA-2399.  The regression tests
> came back with nothing surprising.  I fixed the npe that they uncovered in
> the new ppt macro extraction code.
>   Will I need to rerun with the updates to mime detection that Nick just
> made?  Or are we good enough to go once we figure out what we can do w
> TIKA-2399?
>
>   Onward.
>
>Cheers,
>  Tim
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, July 3, 2017 2:35 PM
> To: dev@tika.apache.org
> Subject: RE: Tika 1.15.1? -> 1.16
>
> Sounds good. I'll kick off regression tests now, with a goal of creating
> 1.16-rc1 on Wednesday 14:00 UTC?
>
> -Original Message-
> From: Mattmann, Chris A (3010) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, July 3, 2017 2:24 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15.1? -> 1.16
>
> Hey Tim, if I don’t get it done by today, push 1.16 and we’ll put Age
    > Detection in 1.17.
>
> ++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices
> (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 7/3/17, 7:17 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
>
> All,
>   I think we're now solidly at 1.16.  Anyone still strongly in favor
> of 1.15.1?
>
> Chris,
>   Will age detection be ready soon, or should we push that to 1.17?
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Friday, June 30, 2017 7:01 AM
> To: dev@tika.apache.org; lfcnas...@gmail.com
> Subject: RE: Tika 1.15.1? -> 1.16
>
> Y, I was thinking that I may have already pushed us over this
> threshold with the * below.  1.16 it is then?
>
> Chris, let us know when the age detection is good to go or if 1.17 is
> a better target.
>
>
>   * Allow extraction of scripts as embedded "MACRO". Users
> must turn this on via TikaConfig (TIKA-2391).
>
>   * Allow users to turn off extraction of headers and footers
> from .doc, .docx, .xls, .xlsx, .xlsb (TIKA-2362)
>
>   * Extract text from charts in .docx, .pptx, .xlsx and .xlsb
> (TIKA-2254).
>
>   * Extract text from diagrams in .docx, .pptx, .xlsx and .xlsb
> (TIKA-1945).
>
>   * Enable base32 encoding of digests and enable BouncyCastle
> implementations
> of digest algorithms (TIKA-2386).
>
> -Original Message-
> From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com]
> Sent: Thursday, June 29, 2017 4:12 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.15.1?
>
> Agreed.
>
> Luis
>
>
> 2017-06-29 15:45 GMT-03:00 Bob Paulin <b...@bobpaulin.com>:
>
> > If we're adding features does it make sense just to bump to 1.16
> > rather than 1.15.1?  Traditionally point releases would be bug fixes
> only [1].
> >
> >
> > - Bob
> >
> > [1] http://semver.org/
> &g

  1   2   3   >