This is fantastic news! Let me know if I can help...I know _nothing_ about Beam, tho. :)
-----Original Message----- From: Sergey Beryozkin [mailto:[email protected]] Sent: Friday, May 19, 2017 12:40 PM To: [email protected] Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim On 19/05/17 17:31, Allison, Timothy B. wrote: > The autoscaling feature of Beam and the job stealing (not their term) look to > be fantastic for Tika jobs. > >> Though, it actually does work, for me at least :-) > Have you tried the MockParser? That's where the fun really begins. Simulate > an oom or permanent hang! Thanks for the hint. The initial issue that will need to be handled is to how to adapt the SAX stream of events to the Beam Pipeline API, so for the moment I'm using an internal ExecutorService and Queue to adapt. I've created https://issues.apache.org/jira/browse/BEAM-2328 It will take me few more weeks to create a PR, Thanks, Sergey > > > > -----Original Message----- > From: Sergey Beryozkin [mailto:[email protected]] > Sent: Friday, May 19, 2017 12:27 PM > To: [email protected] > Subject: Re: Extracting Text from embedded images in PDF docs > > Hi Chris > > I'm getting nervous now, what will happen to me if it will not work > out in the end :-). Though, it actually does work, for me at least :-) > > Cheers, Sergey > On 19/05/17 17:23, Mattmann, Chris A (3010) wrote: >> Thanks Sergey what an awesome surprise you are the best! >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Principal Data Scientist, Engineering Administrative Office (3010) >> Manager, NSF & Open Source Projects Formulation and Development >> Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 180-503E, Mailstop: 180-503 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Director, Information Retrieval and Data Science Group (IRDS) Adjunct >> Associate Professor, Computer Science Department University of >> Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> On 5/19/17, 9:11 AM, "Sergey Beryozkin" <[email protected]> wrote: >> >> Hi Tim >> On 19/05/17 16:47, Allison, Timothy B. wrote: >> > >> >> Yes I was asking about it as I thought it was confusing it did not >> work >> >> - I saw you following up on this possible issue in the other >> email... >> > Y, I agree. That _should_ work. >> > >> >> I'm doing some work with Tika now so it was of an immediate >> interest to me... >> > Yay! What are you working on? >> > >> Was supposed to be a secret for few weeks but I'll let you know, but do >> not tell anyone please :-). Well, I'm trying to integrate Tika with >> Apache Beam, hoping to get something ready in a couple of weeks, if it >> won't make it to the Beam source then I'll create a standalone demo, >> will share the link either way... >> >> Sure. By the way I was not complaining... >> > I didn't take it that way at all! I apologize if anything I wrote >> came across that way. >> > >> Np, my apologies instead :-), I thought may be I asked it the way which >> sounded like a 'why does it just not work' question which would indeed >> be strange to hear from a Tika committer (nearly veteran I should say >> :-)). >> >> Thanks, Sergey >> > Cheers, >> > >> > Tim >> > >> >>
