[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177225#comment-16177225
]
Hudson commented on TIKA-2470:
--
SUCCESS: Integrated in Jenkins build Tika-trunk #1371 (See
Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:
Y, I think you have it right.
Tika library has a big problem with crashes and freezes
I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but
when you are processing millions/billions of files in the wild [1],
[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2470.
---
Resolution: Fixed
Fix Version/s: 1.17
> Another Illegal reflective Access -- more cleanup for
Tim Allison created TIKA-2470:
-
Summary: Another Illegal reflective Access -- more cleanup for
Java 9
Key: TIKA-2470
URL: https://issues.apache.org/jira/browse/TIKA-2470
Project: Tika
Issue
[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2470:
--
Description: WARNING: Illegal reflective access by
org.apache.tika.utils.XMLReaderUtils
[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177131#comment-16177131
]
Konstantin Gribov commented on TIKA-2470:
-
[~talli...@apache.org], speaking of 1.x -- perfectly ok,
Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:
Sure - with hundreds of different file formats and the abundance of weird /
malformed / malicious files in the wild, it's quite expected that sometimes
the library will crash.
Some kinds of issues are easier to address than others. We can catch
[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119
]
Tim Allison edited comment on TIKA-2470 at 9/22/17 9:03 PM:
[~grossws] I used
[
https://issues.apache.org/jira/browse/TIKA-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177119#comment-16177119
]
Tim Allison commented on TIKA-2470:
---
[~grossws] I used the JUL Logger in tika-core is this the right
[
https://issues.apache.org/jira/browse/TIKA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175978#comment-16175978
]
Robert Munteanu commented on TIKA-2466:
---
Thanks for applying! Looking forward to the next Tika
[
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176010#comment-16176010
]
ASF GitHub Bot commented on TIKA-2400:
--
ThejanW commented on a change in pull request #208: Fix for
@Eugene: What's the best way to have Beam help us with these issues, or do
these come for free with the Beam framework?
1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document
Hi,
On 22/09/17 00:42, Eugene Kirpichov wrote:
Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
OK, thanks
- Yes, the String in KV is the filename. I guess we
could alternatively put it into ParseResult - don't have a strong opinion.
Sure. If
@Timothy: can you tell more about this RecursiveParserWrapper? Is this
something that the user can configure by specifying the Parser on TikaIO if
they so wish?
Not at the moment, we’d have to do some coding on our end or within Beam. The
format is a list of maps/dicts for each file. Each
Hi Tim
Sorry for getting into the RecursiveParserWrapper discussion first, I
was certain the time zone difference was on my side :-)
How will it work now, with new Metadata() passed to the AutoDetect
parser, will this Metadata have a Metadata value per every attachment,
possibly keyed by a
>> How will it work now, with new Metadata() passed to the AutoDetect parser,
>> will this Metadata have a Metadata value per every attachment, possibly
>> keyed by a name ?
An example of how to call the RecursiveParserWrapper:
[
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175997#comment-16175997
]
ASF GitHub Bot commented on TIKA-2400:
--
ThejanW commented on a change in pull request #208: Fix for
[
https://issues.apache.org/jira/browse/TIKA-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175996#comment-16175996
]
ASF GitHub Bot commented on TIKA-2400:
--
ThejanW commented on a change in pull request #208: Fix for
Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.
On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
Reuven,
Thank you! This suggests to me that it is a good idea to integrate Tika with
Beam so that people don't have to 1) (re)discover the need to make their
wrappers robust and then 2) have to reinvent these wheels for robustness.
For kicks, see William Palmer's post on his toe-stubbing
Great. Thank you!
-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org]
Sent: Friday, September 22, 2017 1:46 PM
To: dev@tika.apache.org
Subject: Re: TikaIO concerns
[dropping Beam on this]
Tim, another thing is that you can finally download the TREC-DD Polar data
Y, I think you have it right.
> Tika library has a big problem with crashes and freezes
I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but
when you are processing millions/billions of files in the wild [1], they will
happen. We fix the problems or try to get our
Do tell...
Interesting. Any pointers?
-Original Message-
From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:50 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
Regarding specifically elements that are failing --
BigQueryIO allows a side-output for elements that failed to be inserted
when using the Streaming BigQuery sink:
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92
This follows the pattern
Nice! Thank you!
-Original Message-
From: Ben Chambers [mailto:bchamb...@apache.org]
Sent: Friday, September 22, 2017 1:24 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns
BigQueryIO allows a side-output for elements that failed to be inserted when
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests
>>against this TB (thank you, Rackspace for hosting our vm!) to try to identify
>>these problems.
And if anyone with connections at a big company doing open source + cloud would
be interested in floating us some
[dropping Beam on this]
Tim, another thing is that you can finally download the TREC-DD Polar data
either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described
here:
http://github.com/chrismattmann/trec-dd-polar/
In case we want to use as part of our regression.
27 matches
Mail list logo