[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-21 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175914#comment-16175914 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for

[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-21 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175913#comment-16175913 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for

[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-21 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175911#comment-16175911 ] ASF GitHub Bot commented on TIKA-2400: -- smadha commented on a change in pull request #208: Fix for

[jira] [Commented] (TIKA-2466) Remove JAXB usage

2017-09-21 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175537#comment-16175537 ] Hudson commented on TIKA-2466: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1370 (See

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
Like Sergey, it’ll take me some time to understand your recommendations. Thank you! On one small point: >return a PCollection>, where ParseResult is a >class with properties { String content, Metadata metadata } For this option, I’d strongly encourage using the

Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
Hi all, One other thing is that Tika extracts metadata, and language information in which order doesn’t matter (Keys can be out of order). Would this be useful? Cheers, Chris On 9/21/17, 2:10 PM, "Sergey Beryozkin" wrote: Hi Eugene Thank you, very

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi Eugene Thank you, very helpful, let me read it few times before I get what exactly I need to clarify :-), two questions so far: On 21/09/17 21:40, Eugene Kirpichov wrote: Thanks all for the discussion. It seems we have consensus that both within-document order and association with the

[jira] [Resolved] (TIKA-2466) Remove JAXB usage

2017-09-21 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2466. --- Resolution: Fixed Fix Version/s: 1.17 Many thanks [~rombert] for your patches! We'll probably

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi all, Please also welcome Chris to this thread, Chris, thanks for joining in :-), FYI, the main concern that was raised is that it was not obvious when to use TikaIO in the current form, given that Beam+TikaIO will have a totally unordered sequence of data (originally extracted by Tika in

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Sergey Beryozkin
Hi Chris, thanks, On 21/09/17 18:54, Chris Mattmann wrote: Thanks Sergey, feel free to CC me directly at mattm...@apache.org on the Beam thread. My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module and the new feature extraction from multimedia files

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Chris Mattmann
Thanks Sergey, feel free to CC me directly at mattm...@apache.org on the Beam thread. My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module and the new feature extraction from multimedia files using Tensorflow and DL4j these are perfect examples where the

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
Hi Tim On 21/09/17 14:33, Allison, Timothy B. wrote: Thank you, Sergey. My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet. From my perspective, if I

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
Thank you, Sergey. My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet. From my perspective, if I understand this thread (and I may not!), getting unordered

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Sergey Beryozkin
Hi Tim Thanks, will link you to the thread shortly In general, I'd say TikaIO has probably generated more interest then some of the other Beam IOs which is a good sign :-) The questions at the moment: 1) what interesting things can be done with the unordered Tika produced data 2) would it

RE: Integrating Tika with Apache Beam

2017-09-21 Thread Allison, Timothy B.
Hi Sergey, I just subscribed to Beam's dev list. Can you forward me your latest email so that I can respond to the thread? Or can you ping me via their list? Thank you! -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 5:53

Re: Integrating Tika with Apache Beam

2017-09-21 Thread Sergey Beryozkin
Hi Guys TikaIO is getting some serious attention now on the Beam dev, and unfortunately it is not all about it being a great addition to Beam. The team is wondering what one can do with TikaIO vs someone just doing some custom Beam function. TikaIO and as any other Bounded text reader will