RE: Extracting Text from embedded images in PDF docs

2017-05-23 Thread Allison, Timothy B.
Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim Sure, once I get an initial PR ready I'll send an update and I'll explain what I did for a start and we will discuss it further

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
. :) -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:40 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim On 19/05/17 17:31, Allison, Timothy B. wrote: The autoscaling feature of Beam and the job

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
This is fantastic news! Let me know if I can help...I know _nothing_ about Beam, tho. :) -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:40 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
-2328 It will take me few more weeks to create a PR, Thanks, Sergey -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:27 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Chris I'm getting

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
age- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:27 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it actually d

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Chris Mattmann
 On 5/19/17, 9:27 AM, "Sergey Beryozkin" wrote: Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it actually does work, for me at least :-) Cheers, Sergey On 19/05/17 17:23, Mattmann,

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
> Well, I'm trying to integrate Tika with Apache Beam, Awesome! I saw two fantastic Beam talks at ApacheCon (two days ago?). I won't tell anyone. ;)

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it actually does work, for me at least :-) Cheers, Sergey On 19/05/17 17:23, Mattmann, Chris A (3010) wrote: Thanks Sergey what an awesome surprise you are the best!

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Tim On 19/05/17 16:47, Allison, Timothy B. wrote: Yes I was asking about it as I thought it was confusing it did not work - I saw you following up on this possible issue in the other email... Y, I agree. That _should_ work. I'm doing some work with Tika now so it was of an immediate

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
>Yes I was asking about it as I thought it was confusing it did not work >- I saw you following up on this possible issue in the other email... Y, I agree. That _should_ work. >I'm doing some work with Tika now so it was of an immediate interest to me... Yay! What are you working on? >Sure. By

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
On 19/05/17 16:25, Allison, Timothy B. wrote: and when is "extractInlineImages" actually effective ? Not sure I understand the question exactly? If the question is "why didn't extractInlineImages work on a specific document"? That's probably a bug or could be user error in the

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
>>and when is "extractInlineImages" actually effective ? Not sure I understand the question exactly? If the question is "why didn't extractInlineImages work on a specific document"? That's probably a bug or could be user error in the configuration...either way, please follow up and help us

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
at the very beginning of integrating OCR with PDFs. We’d like to add a strategy that applies OCR on a given page if, say, < 10 words are extracted from the text…WDYT? From: David Pilato [mailto:da...@pilato.fr] Sent: Friday, May 19, 2017 5:55 AM To: user@tika.apache.org Subject: Re: Extracting Text f

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D / /<*properties*> <*parsers*> <*parser class="org.apache.tika.parser.DefaultParser"*/> <*parser cl

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
the documentation so that you don’t waste an hour? From: David Pilato [mailto:da...@pilato.fr] Sent: Friday, May 19, 2017 5:55 AM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Chris Mattmann
: Friday, May 19, 2017 at 2:55 AM To: "user@tika.apache.org" <user@tika.apache.org> Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my confi

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread David Pilato
Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D ocr_and_text David > Le 19 mai 2017 à 10:59, David Pilato a écrit : > > So I saw

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread David Pilato
So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config. :D David > Le 18 mai 2017 à 22:18, David Pilato a écrit : > > Hey guys > > > First post here ;) > > I'm trying to play with OCR with Tika. I installed Tesseract