[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177225#comment-16177225 ] Hudson commented on TIKA-2470: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1371 (See

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim, All On 22/09/17 18:17, Allison, Timothy B. wrote: Y, I think you have it right. Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1],

[jira] [Resolved] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2470. --- Resolution: Fixed Fix Version/s: 1.17 > Another Illegal reflective Access -- more cleanup for

[jira] [Created] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2470: - Summary: Another Illegal reflective Access -- more cleanup for Java 9 Key: TIKA-2470 URL: https://issues.apache.org/jira/browse/TIKA-2470 Project: Tika Issue

[jira] [Updated] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2470: -- Description: WARNING: Illegal reflective access by org.apache.tika.utils.XMLReaderUtils

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Konstantin Gribov (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177131#comment-16177131 ] Konstantin Gribov commented on TIKA-2470: - [~talli...@apache.org], speaking of 1.x -- perfectly ok,

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi, On 22/09/17 22:02, Eugene Kirpichov wrote: Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash. Some kinds of issues are easier to address than others. We can catch

[jira] [Comment Edited] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119 ] Tim Allison edited comment on TIKA-2470 at 9/22/17 9:03 PM: [~grossws] I used

[jira] [Commented] (TIKA-2470) Another Illegal reflective Access -- more cleanup for Java 9

2017-09-22 Thread Tim Allison (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119 ] Tim Allison commented on TIKA-2470: --- [~grossws] I used the JUL Logger in tika-core is this the right

[jira] [Commented] (TIKA-2466) Remove JAXB usage

2017-09-22 Thread Robert Munteanu (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175978#comment-16175978 ] Robert Munteanu commented on TIKA-2466: --- Thanks for applying! Looking forward to the next Tika

[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-22 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176010#comment-16176010 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi, On 22/09/17 00:42, Eugene Kirpichov wrote: Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. OK, thanks - Yes, the String in KV is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. Sure. If

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? Not at the moment, we’d have to do some coding on our end or within Beam. The format is a list of maps/dicts for each file. Each

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
Hi Tim Sorry for getting into the RecursiveParserWrapper discussion first, I was certain the time zone difference was on my side :-) How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>> How will it work now, with new Metadata() passed to the AutoDetect parser, >> will this Metadata have a Metadata value per every attachment, possibly >> keyed by a name ? An example of how to call the RecursiveParserWrapper:

[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-22 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175997#comment-16175997 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for

[jira] [Commented] (TIKA-2400) Standardizing current Object Recognition REST parsers

2017-09-22 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175996#comment-16175996 ] ASF GitHub Bot commented on TIKA-2400: -- ThejanW commented on a change in pull request #208: Fix for

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Great. Thank you! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, September 22, 2017 1:46 PM To: dev@tika.apache.org Subject: Re: TikaIO concerns [dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Y, I think you have it right. > Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Do tell... Interesting. Any pointers? -Original Message- From: Ben Chambers [mailto:bchamb...@google.com.INVALID] Sent: Friday, September 22, 2017 12:50 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Regarding specifically elements that are failing --

Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92 This follows the pattern

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Nice! Thank you! -Original Message- From: Ben Chambers [mailto:bchamb...@apache.org] Sent: Friday, September 22, 2017 1:24 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns BigQueryIO allows a side-output for elements that failed to be inserted when

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests >>against this TB (thank you, Rackspace for hosting our vm!) to try to identify >>these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some

Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression.