Re: Tika documentation?

2022-09-01 Thread Mark Kerzner SHMsoft, Inc.
Hi, A few questions, please. You can point me to the answers I should read - if they exist. 1. I am 'kerzner' is the cwiki. How do I edit, for example, this page, https://tika.apache.org/2.4.1/examples.html? 2. Should I fork all branches when I fork this, https://github.com/apache/tika? Which is

Re: Tika documentation?

2022-09-01 Thread Mark Kerzner SHMsoft, Inc.
Nick, username is kerzner Thank you, Mark Mark Kerzner, SHMsoft , Book a call with me here Mobile: 713-724-2534 Skype: mark.kerzner1 On Thu, Sep 1, 2022 at 10:54 AM Nick Burch wrote: > On Thu, 1 Sep 2022, Mark Kerz

Re: Tika documentation?

2022-09-01 Thread Nick Burch
On Thu, 1 Sep 2022, Mark Kerzner SHMsoft, Inc. wrote: Yes, please. If I make some changes, I will start with small ones. I will also verify them with you. Great, thanks in advance for your contributions! Can you please head to https://cwiki.apache.org/confluence/display/tika/ , click Sign Up

Re: Tika documentation?

2022-09-01 Thread Tim Allison
I regret that when I'm stuck on configuration or surprises, I typically look to the unit tests. :( We need all the help we can get in updating and filling out the documentation. Thank you! Cheers, Tim On Thu, Sep 1, 2022 at 11:42 AM Tim Allison wrote: > +1 > > On Thu, Sep 1, 2022 at

Re: Tika documentation?

2022-09-01 Thread Tim Allison
+1 On Thu, Sep 1, 2022 at 11:22 AM Mark Kerzner SHMsoft, Inc. < mark.kerz...@shmsoft.com> wrote: > Tim, > > Yes, please. If I make some changes, I will start with small ones. I will > also verify them with you. > > Thank you, > Mark > > > Mark Kerzner, SHMsoft , > Book a call

Re: Tika documentation?

2022-09-01 Thread Mark Kerzner SHMsoft, Inc.
Tim, Yes, please. If I make some changes, I will start with small ones. I will also verify them with you. Thank you, Mark Mark Kerzner, SHMsoft , Book a call with me here Mobile: 713-724-2534 Skype: mark.kerzner1 O

Re: Tika documentation?

2022-09-01 Thread Tim Allison
I don't disagree. Let us know if you'd like write permissions on the wiki. On Thu, Sep 1, 2022 at 10:01 AM Mark Kerzner SHMsoft, Inc. < mark.kerz...@shmsoft.com> wrote: > Hi, > > I am reviewing Tika documentation, and I am finding it out of date. The > latest books are 5-7 years old, and the wik

Tika documentation?

2022-09-01 Thread Mark Kerzner SHMsoft, Inc.
Hi, I am reviewing Tika documentation, and I am finding it out of date. The latest books are 5-7 years old, and the wiki, for example, has outdated examples. For instance, ParsingExample is mentioned here, https://tika.apache.org/2.4.1/examples.html, but it has been taken out of 2.41. So may I su

Re: .TesseractOCRParser does not extract text although Tesseract does

2022-09-01 Thread Tim Allison
Ugh. I think you just ran into: https://issues.apache.org/jira/browse/TIKA-3812 This will be fixed in the next release, hopefully out next week. The problem is that gdal is taking precedence over the ImageParser, and the gdal parser doesn't know about OCR. On Thu, Sep 1, 2022 at 7:43 AM David P

Re: .TesseractOCRParser does not extract text although Tesseract does

2022-09-01 Thread David Pilato
Here is the content of the metadata object: X-TIKA:Parsed-By=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By=org.apache.tika.parser.gdal.GDALParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.DefaultParser X-TIKA:Parsed-By-Full-Set=org.apache.tika.parser.gdal.GDALParser Content-Type=im

Re: .TesseractOCRParser does not extract text although Tesseract does

2022-09-01 Thread Tim Allison
And, what is recorded in the X-Tika-ParsedBy value in the metadata object? On Thu, Sep 1, 2022 at 5:36 AM Tim Allison wrote: > What are your dependencies? Which parsers are in AutoDetectParser? > > On Thu, Sep 1, 2022 at 4:38 AM David Pilato wrote: > >> Hey team >> >> >> I'm wondering what's wr

Re: .TesseractOCRParser does not extract text although Tesseract does

2022-09-01 Thread Tim Allison
What are your dependencies? Which parsers are in AutoDetectParser? On Thu, Sep 1, 2022 at 4:38 AM David Pilato wrote: > Hey team > > > I'm wondering what's wrong with my config. > I'm running this very basic piece of code: > > @Test > public void testTika() throws TikaException, IOException, SAX

.TesseractOCRParser does not extract text although Tesseract does

2022-09-01 Thread David Pilato
Hey team I'm wondering what's wrong with my config. I'm running this very basic piece of code: @Test public void testTika() throws TikaException, IOException, SAXException { BodyContentHandler handler = new BodyContentHandler(new WriteOutContentHandler(1000)); new AutoDetectParser().parse(