[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832336#comment-17832336 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL:

[jira] [Commented] (TIKA-4181) Grpc + Tika Pipes - pipe iterator and emitter

2024-03-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832337#comment-17832337 ] ASF GitHub Bot commented on TIKA-4181: -- bartek commented on code in PR #1702: URL:

Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]

2024-03-29 Thread via GitHub
bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a

Re: [PR] TIKA-4181 - Tika Pipes Grpc Server [tika]

2024-03-29 Thread via GitHub
bartek commented on code in PR #1702: URL: https://github.com/apache/tika/pull/1702#discussion_r1544981545 ## tika-pipes/tika-grpc/src/main/proto/tika.proto: ## Review Comment: For your consideration @nddipiazza, I ran `buf lint` on this protobuf (as I am syncing it to a

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832293#comment-17832293 ] Aamir commented on TIKA-4231: - No, this doesn't look better. Actually, I would say that it looks worse than

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832291#comment-17832291 ] Tilman Hausherr commented on TIKA-4231: --- I have attached an extraction with pdfbox 2.0.31:

[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4231: -- Attachment: arabic-pdfbox.txt > Parsing Arabic PDF is returning bad data >

[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Affects Version/s: 2.9.1 > Parsing Arabic PDF is returning bad data > > >

[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Description: Attached is a PDF with arabic text in it.  When parsed using tika version 2.6.0 or 2.9.1, it produces

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832289#comment-17832289 ] Aamir commented on TIKA-4231: - The problem persists with 2.9.1 I am updating the versions in this ticket as

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832284#comment-17832284 ] Tilman Hausherr commented on TIKA-4231: --- This doesn't change my argument. The latest version is

[jira] [Updated] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aamir updated TIKA-4231: Description: Attached is a PDF with arabic text in it.  When parsed using tika version 2.6.0, it produces gibberish

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832260#comment-17832260 ] Aamir commented on TIKA-4231: - Sorry, I meant tika-parsers-standard-package 2.6.0 > Parsing Arabic PDF is

[jira] [Commented] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/TIKA-4231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832258#comment-17832258 ] Tilman Hausherr commented on TIKA-4231: --- The current tika version is 2.9.1, soon to be 2.9.2. There

[jira] [Created] (TIKA-4231) Parsing Arabic PDF is returning bad data

2024-03-29 Thread Aamir (Jira)
Aamir created TIKA-4231: --- Summary: Parsing Arabic PDF is returning bad data Key: TIKA-4231 URL: https://issues.apache.org/jira/browse/TIKA-4231 Project: Tika Issue Type: Bug Affects Versions:

[PR] Tika 4181 grpc [tika]

2024-03-29 Thread via GitHub
nddipiazza opened a new pull request, #1702: URL: https://github.com/apache/tika/pull/1702 Add an Apache Tika GRPC Server -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: JUnit4 dependency with Grpc

2024-03-29 Thread Nicholas DiPiazza
Never mind - found a way to make it work with junit5 with some googling On Fri, Mar 29, 2024 at 3:01 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Is there some easy way I can relax the Junit4 ban for the Gprc service? > > >

JUnit4 dependency with Grpc

2024-03-29 Thread Nicholas DiPiazza
Is there some easy way I can relax the Junit4 ban for the Gprc service?

[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

2024-03-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832055#comment-17832055 ] ASF GitHub Bot commented on TIKA-2696: -- THausherr commented on PR #246: URL:

Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]

2024-03-29 Thread via GitHub
THausherr commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026763252 This is a closed issue from years ago, please ask this in the user's mailing list (don't forget to subscribe) or on stackoverflow.com. -- This is an automated message from the Apache Git

Re: [PR] Bump commons-io:commons-io from 2.15.1 to 2.16.0 [tika]

2024-03-29 Thread via GitHub
THausherr merged PR #1701: URL: https://github.com/apache/tika/pull/1701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[jira] [Commented] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

2024-03-29 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832050#comment-17832050 ] ASF GitHub Bot commented on TIKA-2696: -- Tarik37 commented on PR #246: URL:

Re: [PR] TIKA-2696 Add support for OSD output, contributed by @4U6U57 [tika]

2024-03-29 Thread via GitHub
Tarik37 commented on PR #246: URL: https://github.com/apache/tika/pull/246#issuecomment-2026729362 Hello, I am currently using the Tika 2.9.1 server version and need the output of the OSD in my metadata, particularly the value of the script (Latin, Cyrillic, etc.). So my questions are the