Re: TikaIO concerns
it's not a Tika library's 'fault' that the crashes might occur. Tika does its best to get the latest libraries helping it to parse the files, but indeed there will always be some file there that might use some incomplete format specific tag etc which may cause the specific parser to spin - but Tika will include the updated parser library asap. And with Beam's help the crashes that can kill the Tika jobs completely will probably become a history... Cheers, Sergey but given our past history, I have no reason to believe that these problems won't happen again. Thank you, again! Best, Tim [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files P.S./FTR 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 2) We've started a fuzzing effort to try to identify problems. 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 4) We're trying to get the message out. Thank you for working with us!!! -Original Message- From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] Sent: Friday, September 22, 2017 12:48 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct. On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. < talli...@mitre.org> wrote: Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1]. He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. We have a MockParser that you can get to simulate regular exceptions, OOMs and permanent hangs by asking Tika to parse a xml [2]. However if processing the document causes the process to crash, then it will be retried. Any ideas on how to get around this? Thank you again. Cheers, Tim [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w eb-content-nanite/ [2] https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
Re: TikaIO concerns
Hi, On 22/09/17 22:02, Eugene Kirpichov wrote: Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash. Some kinds of issues are easier to address than others. We can catch exceptions and return a ParseResult representing a failure to parse this document. Addressing freezes and native JVM process crashes is much harder and probably not necessary in the first version. Sergey - I think, the moment you introduce ParseResult into the code, other changes I suggested will follow "by construction": - There'll be 1 ParseResult per document, containing filename, content and metadata, since per discussion above it probably doesn't make sense to deliver these in separate PCollection elements I was still harboring the hope that may be using a container bean like ParseResult (with the other changes you proposed) can somehow let us stream from Tika into the pipeline. If it is 1 ParseResult per document then it means that until Tika has parsed all the document the pipeline will not see it. I'm sorry if I may be starting to go in circles. But let me ask this. How can a Beam user write a Beam function which will ensure the Tika content pieces are seen ordered by the pipeline, without TikaIO ? May be knowing that will help coming up with the idea how to generalize somehow with the help of TikaIO ? - Since you're returning a single value per document, there's no reason to use a BoundedReader - Likewise, there's no reason to use asynchronicity because you're not delivering the result incrementally I'd suggest to start the refactoring by removing the asynchronous codepath, then converting from BoundedReader to ParDo or MapElements, then converting from String to ParseResult. This is a good plan, thanks, I guess at least for small documents it should work well (unless I've misunderstood a ParseResult idea) Thanks, Sergey On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com> wrote: Hi Tim, All On 22/09/17 18:17, Allison, Timothy B. wrote: Y, I think you have it right. Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our dependencies to fix the problems when we can, I only would like to add to this that IMHO it would be more correct to state it's not a Tika library's 'fault' that the crashes might occur. Tika does its best to get the latest libraries helping it to parse the files, but indeed there will always be some file there that might use some incomplete format specific tag etc which may cause the specific parser to spin - but Tika will include the updated parser library asap. And with Beam's help the crashes that can kill the Tika jobs completely will probably become a history... Cheers, Sergey but given our past history, I have no reason to believe that these problems won't happen again. Thank you, again! Best, Tim [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files P.S./FTR 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 2) We've started a fuzzing effort to try to identify problems. 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 4) We're trying to get the message out. Thank you for working with us!!! -Original Message- From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] Sent: Friday, September 22, 2017 12:48 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct. On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org> wrote: Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1]. He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. We have a MockParser that you can get to simulate regular exceptions, OOMs an
Re: TikaIO concerns
Hi Tim, All On 22/09/17 18:17, Allison, Timothy B. wrote: Y, I think you have it right. Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our dependencies to fix the problems when we can, I only would like to add to this that IMHO it would be more correct to state it's not a Tika library's 'fault' that the crashes might occur. Tika does its best to get the latest libraries helping it to parse the files, but indeed there will always be some file there that might use some incomplete format specific tag etc which may cause the specific parser to spin - but Tika will include the updated parser library asap. And with Beam's help the crashes that can kill the Tika jobs completely will probably become a history... Cheers, Sergey but given our past history, I have no reason to believe that these problems won't happen again. Thank you, again! Best, Tim [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files P.S./FTR 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 2) We've started a fuzzing effort to try to identify problems. 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 4) We're trying to get the message out. Thank you for working with us!!! -Original Message- From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] Sent: Friday, September 22, 2017 12:48 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct. On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org> wrote: Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1]. He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. We have a MockParser that you can get to simulate regular exceptions, OOMs and permanent hangs by asking Tika to parse a xml [2]. However if processing the document causes the process to crash, then it will be retried. Any ideas on how to get around this? Thank you again. Cheers, Tim [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w eb-content-nanite/ [2] https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
RE: TikaIO concerns
Great. Thank you! -Original Message- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Friday, September 22, 2017 1:46 PM To: dev@tika.apache.org Subject: Re: TikaIO concerns [dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression. Cheers, Chris On 9/22/17, 10:43 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: >>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles, we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests. :D But seriously, thank you for this discussion and collaboration! Cheers, Tim
Re: TikaIO concerns
[dropping Beam on this] Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here: http://github.com/chrismattmann/trec-dd-polar/ In case we want to use as part of our regression. Cheers, Chris On 9/22/17, 10:43 AM, "Allison, Timothy B."wrote: >>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles, we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests. :D But seriously, thank you for this discussion and collaboration! Cheers, Tim
RE: TikaIO concerns
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests >>against this TB (thank you, Rackspace for hosting our vm!) to try to identify >>these problems. And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles, we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests. :D But seriously, thank you for this discussion and collaboration! Cheers, Tim
RE: TikaIO concerns
Nice! Thank you! -Original Message- From: Ben Chambers [mailto:bchamb...@apache.org] Sent: Friday, September 22, 2017 1:24 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92 This follows the pattern of a DoFn with multiple outputs, as described here https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow So, the DoFn that runs the Tika code could be configured in terms of how different failures should be handled, with the option of just outputting them to a different PCollection that is then processed in some other way. On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org> wrote: > Do tell... > > Interesting. Any pointers? > > -Original Message- > From: Ben Chambers [mailto:bchamb...@google.com.INVALID] > Sent: Friday, September 22, 2017 12:50 PM > To: d...@beam.apache.org > Cc: dev@tika.apache.org > Subject: Re: TikaIO concerns > > Regarding specifically elements that are failing -- I believe some > other IO has used the concept of a "Dead Letter" side-output,, where > documents that failed to process are side-output so the user can > handle them appropriately. > > On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > > Hi Tim, > > From what you're saying it sounds like the Tika library has a big > > problem with crashes and freezes, and when applying it at scale (eg. > > in the context of Beam) requires explicitly addressing this problem, > > eg. accepting the fact that in many realistic applications some > > documents will just need to be skipped because they are unprocessable? > > This would be first example of a Beam IO that has this concern, so > > I'd like to confirm that my understanding is correct. > > > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. > > <talli...@mitre.org> > > wrote: > > > > > Reuven, > > > > > > Thank you! This suggests to me that it is a good idea to > > > integrate Tika with Beam so that people don't have to 1) > > > (re)discover the need to make their wrappers robust and then 2) > > > have to reinvent these wheels for robustness. > > > > > > For kicks, see William Palmer's post on his toe-stubbing efforts > > > with Hadoop [1]. He and other Tika users independently have wound > > > up carrying out exactly your recommendation for 1) below. > > > > > > We have a MockParser that you can get to simulate regular > > > exceptions, > > OOMs > > > and permanent hangs by asking Tika to parse a xml [2]. > > > > > > > However if processing the document causes the process to crash, > > > > then it > > > will be retried. > > > Any ideas on how to get around this? > > > > > > Thank you again. > > > > > > Cheers, > > > > > >Tim > > > > > > [1] > > > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising > > -w > > eb-content-nanite/ > > > [2] > > > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res > > ou rces/test-documents/mock/example.xml > > > > > >
Re: TikaIO concerns
BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92 This follows the pattern of a DoFn with multiple outputs, as described here https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow So, the DoFn that runs the Tika code could be configured in terms of how different failures should be handled, with the option of just outputting them to a different PCollection that is then processed in some other way. On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org> wrote: > Do tell... > > Interesting. Any pointers? > > -Original Message- > From: Ben Chambers [mailto:bchamb...@google.com.INVALID] > Sent: Friday, September 22, 2017 12:50 PM > To: d...@beam.apache.org > Cc: dev@tika.apache.org > Subject: Re: TikaIO concerns > > Regarding specifically elements that are failing -- I believe some other > IO has used the concept of a "Dead Letter" side-output,, where documents > that failed to process are side-output so the user can handle them > appropriately. > > On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov > <kirpic...@google.com.invalid> wrote: > > > Hi Tim, > > From what you're saying it sounds like the Tika library has a big > > problem with crashes and freezes, and when applying it at scale (eg. > > in the context of Beam) requires explicitly addressing this problem, > > eg. accepting the fact that in many realistic applications some > > documents will just need to be skipped because they are unprocessable? > > This would be first example of a Beam IO that has this concern, so I'd > > like to confirm that my understanding is correct. > > > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. > > <talli...@mitre.org> > > wrote: > > > > > Reuven, > > > > > > Thank you! This suggests to me that it is a good idea to integrate > > > Tika with Beam so that people don't have to 1) (re)discover the need > > > to make their wrappers robust and then 2) have to reinvent these > > > wheels for robustness. > > > > > > For kicks, see William Palmer's post on his toe-stubbing efforts > > > with Hadoop [1]. He and other Tika users independently have wound > > > up carrying out exactly your recommendation for 1) below. > > > > > > We have a MockParser that you can get to simulate regular > > > exceptions, > > OOMs > > > and permanent hangs by asking Tika to parse a xml [2]. > > > > > > > However if processing the document causes the process to crash, > > > > then it > > > will be retried. > > > Any ideas on how to get around this? > > > > > > Thank you again. > > > > > > Cheers, > > > > > >Tim > > > > > > [1] > > > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > > eb-content-nanite/ > > > [2] > > > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou > > rces/test-documents/mock/example.xml > > > > > >
RE: TikaIO concerns
Do tell... Interesting. Any pointers? -Original Message- From: Ben Chambers [mailto:bchamb...@google.com.INVALID] Sent: Friday, September 22, 2017 12:50 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <kirpic...@google.com.invalid> wrote: > Hi Tim, > From what you're saying it sounds like the Tika library has a big > problem with crashes and freezes, and when applying it at scale (eg. > in the context of Beam) requires explicitly addressing this problem, > eg. accepting the fact that in many realistic applications some > documents will just need to be skipped because they are unprocessable? > This would be first example of a Beam IO that has this concern, so I'd > like to confirm that my understanding is correct. > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. > <talli...@mitre.org> > wrote: > > > Reuven, > > > > Thank you! This suggests to me that it is a good idea to integrate > > Tika with Beam so that people don't have to 1) (re)discover the need > > to make their wrappers robust and then 2) have to reinvent these > > wheels for robustness. > > > > For kicks, see William Palmer's post on his toe-stubbing efforts > > with Hadoop [1]. He and other Tika users independently have wound > > up carrying out exactly your recommendation for 1) below. > > > > We have a MockParser that you can get to simulate regular > > exceptions, > OOMs > > and permanent hangs by asking Tika to parse a xml [2]. > > > > > However if processing the document causes the process to crash, > > > then it > > will be retried. > > Any ideas on how to get around this? > > > > Thank you again. > > > > Cheers, > > > >Tim > > > > [1] > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > eb-content-nanite/ > > [2] > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou > rces/test-documents/mock/example.xml > > >
RE: TikaIO concerns
Y, I think you have it right. > Tika library has a big problem with crashes and freezes I wouldn't want to overstate it. Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen. We fix the problems or try to get our dependencies to fix the problems when we can, but given our past history, I have no reason to believe that these problems won't happen again. Thank you, again! Best, Tim [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files P.S./FTR 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 2) We've started a fuzzing effort to try to identify problems. 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 4) We're trying to get the message out. Thank you for working with us!!! -Original Message- From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] Sent: Friday, September 22, 2017 12:48 PM To: d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi Tim, From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct. On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org> wrote: > Reuven, > > Thank you! This suggests to me that it is a good idea to integrate > Tika with Beam so that people don't have to 1) (re)discover the need > to make their wrappers robust and then 2) have to reinvent these > wheels for robustness. > > For kicks, see William Palmer's post on his toe-stubbing efforts with > Hadoop [1]. He and other Tika users independently have wound up > carrying out exactly your recommendation for 1) below. > > We have a MockParser that you can get to simulate regular exceptions, > OOMs and permanent hangs by asking Tika to parse a xml [2]. > > > However if processing the document causes the process to crash, then > > it > will be retried. > Any ideas on how to get around this? > > Thank you again. > > Cheers, > >Tim > > [1] > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > eb-content-nanite/ > [2] > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml >
Re: TikaIO concerns
Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately. On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichovwrote: > Hi Tim, > From what you're saying it sounds like the Tika library has a big problem > with crashes and freezes, and when applying it at scale (eg. in the context > of Beam) requires explicitly addressing this problem, eg. accepting the > fact that in many realistic applications some documents will just need to > be skipped because they are unprocessable? This would be first example of a > Beam IO that has this concern, so I'd like to confirm that my understanding > is correct. > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. > wrote: > > > Reuven, > > > > Thank you! This suggests to me that it is a good idea to integrate Tika > > with Beam so that people don't have to 1) (re)discover the need to make > > their wrappers robust and then 2) have to reinvent these wheels for > > robustness. > > > > For kicks, see William Palmer's post on his toe-stubbing efforts with > > Hadoop [1]. He and other Tika users independently have wound up carrying > > out exactly your recommendation for 1) below. > > > > We have a MockParser that you can get to simulate regular exceptions, > OOMs > > and permanent hangs by asking Tika to parse a xml [2]. > > > > > However if processing the document causes the process to crash, then it > > will be retried. > > Any ideas on how to get around this? > > > > Thank you again. > > > > Cheers, > > > >Tim > > > > [1] > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > > [2] > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml > > >
RE: TikaIO concerns
Reuven, Thank you! This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness. For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1]. He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. We have a MockParser that you can get to simulate regular exceptions, OOMs and permanent hangs by asking Tika to parse a xml [2]. > However if processing the document causes the process to crash, then it will > be retried. Any ideas on how to get around this? Thank you again. Cheers, Tim [1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ [2] https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
RE: TikaIO concerns
>> How will it work now, with new Metadata() passed to the AutoDetect parser, >> will this Metadata have a Metadata value per every attachment, possibly >> keyed by a name ? An example of how to call the RecursiveParserWrapper: https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java#L138 To serialize the List, use: https://github.com/apache/tika/blob/master/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java#L47
Re: TikaIO concerns
Hi Tim Sorry for getting into the RecursiveParserWrapper discussion first, I was certain the time zone difference was on my side :-) How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a name ? Thanks, Sergey On 22/09/17 12:58, Allison, Timothy B. wrote: @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? Not at the moment, we’d have to do some coding on our end or within Beam. The format is a list of maps/dicts for each file. Each map contains all of the metadata, with one key reserved for the content. If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file. Unlike our legacy xhtml, this format maintains metadata for attachments. The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it. On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output. So, y, large files are a problem. :\ Example with purely made-up keys representing a pdf file containing an RTF attachment [ { Name : “container file”, Author: “Chris Mattmann”, Content: “Four score and seven years ago…”, Content-type: “application/pdf” … }, { Name : “embedded file1” Author: “Nick Burch”, Content: “When in the course of human events…”, Content-type: “application/rtf” } ] From: Eugene Kirpichov [mailto:kirpic...@google.com] Sent: Thursday, September 21, 2017 7:42 PM To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. @Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead). @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: Like Sergey, it’ll take me some time to understand your recommendations. Thank you! On one small point: return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata } For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea Large documents are a problem, no doubt about it… From: Eugene Kirpichov [mailto:kirpic...@google.com<mailto:kirpic...@google.com>] Sent: Thursday, September 21, 2017 4:41 PM To: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>; d...@beam.apache.org<mailto:d...@beam.apache.org> Cc: dev@tika.apache.org<mailto:dev@tika.apache.org> Subject: Re: TikaIO concerns Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO. Association with original file: Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E . If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model). Order within a file: The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I
RE: TikaIO concerns
@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document
Re: TikaIO concerns
Hi, On 22/09/17 00:42, Eugene Kirpichov wrote: Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. OK, thanks - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. Sure. If you don't mind then the 1st thing I'd like to try hopefully early next week is to introduce ParseResult first into the existing code. I know it won't 'fix' the issues related to the ordering, but starting with a complete re-write would be a steep curve for me, so I'd try to experiment first with the idea (which I like very much) of wrapping several related pieces (content fragment, metadata, and the doc id/file name) into ParseResult. By the way, reporting Tika file (output) metadata with every ParseResult instance will work much better, I thought first it won't because Tika does not callback when it populates the file metadata; it only does it for the actual content, but it will update the Metadata instance passed to it while it keeps parsing and finding the new metadata, so the metadata pieces will be available to the pipeline as soon as they may become available. Though Tika (1.17 ?) may need to ensure its Metadata is backed up by the concurrent map for this approach to work, not sure yet... @Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead). @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? As a general note, Metadata passed to the top-level parser acts as a file (and embedded attachments) metadata sink but also as a 'helper' to the parser, right now TikaIO uses it to pass a media type hint if available (to help the auto-detect parser select the correct parser faster), and also a parser which will be used to parse the embedded attachments (I did it after Tim hinted about it earlier on...). Not sure if RecusriveParserWrapper can act as a top-level parser or needs to be passed as a metadata property to AutoDetectParser, Tim will know :-) Thanks, Sergey On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <talli...@mitre.org> wrote: Like Sergey, it’ll take me some time to understand your recommendations. Thank you! On one small point: return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata } For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea Large documents are a problem, no doubt about it… *From:* Eugene Kirpichov [mailto:kirpic...@google.com] *Sent:* Thursday, September 21, 2017 4:41 PM *To:* Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org *Cc:* dev@tika.apache.org *Subject:* Re: TikaIO concerns Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO. *Association with original file:* Sergey - Beam does not *automatically* provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E . If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model). *Order within a file:* The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element. Given both of these, I think that it's not possible to create a *general-purpose* TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches(). However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost al
RE: TikaIO concerns
@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? Not at the moment, we’d have to do some coding on our end or within Beam. The format is a list of maps/dicts for each file. Each map contains all of the metadata, with one key reserved for the content. If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file. Unlike our legacy xhtml, this format maintains metadata for attachments. The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it. On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output. So, y, large files are a problem. :\ Example with purely made-up keys representing a pdf file containing an RTF attachment [ { Name : “container file”, Author: “Chris Mattmann”, Content: “Four score and seven years ago…”, Content-type: “application/pdf” … }, { Name : “embedded file1” Author: “Nick Burch”, Content: “When in the course of human events…”, Content-type: “application/rtf” } ] From: Eugene Kirpichov [mailto:kirpic...@google.com] Sent: Thursday, September 21, 2017 7:42 PM To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Hi, @Sergey: - I already marked TikaIO @Experimental, so we can make changes. - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion. @Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead). @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish? On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: Like Sergey, it’ll take me some time to understand your recommendations. Thank you! On one small point: >return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a >class with properties { String content, Metadata metadata } For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents. > However, since TikaIO can be applied to very large files, this could produce > very large elements, which is a bad idea Large documents are a problem, no doubt about it… From: Eugene Kirpichov [mailto:kirpic...@google.com<mailto:kirpic...@google.com>] Sent: Thursday, September 21, 2017 4:41 PM To: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>; d...@beam.apache.org<mailto:d...@beam.apache.org> Cc: dev@tika.apache.org<mailto:dev@tika.apache.org> Subject: Re: TikaIO concerns Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO. Association with original file: Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E . If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model). Order within a file: The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element. Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches(). However, looking at the examples at https://tika.apache.org/1.16/exam
RE: TikaIO concerns
Like Sergey, it’ll take me some time to understand your recommendations. Thank you! On one small point: >return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a >class with properties { String content, Metadata metadata } For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents. > However, since TikaIO can be applied to very large files, this could produce > very large elements, which is a bad idea Large documents are a problem, no doubt about it… From: Eugene Kirpichov [mailto:kirpic...@google.com] Sent: Thursday, September 21, 2017 4:41 PM To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org Cc: dev@tika.apache.org Subject: Re: TikaIO concerns Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO. Association with original file: Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E . If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model). Order within a file: The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element. Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches(). However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this. E.g. TikaIO could: - take as input a PCollection - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata } - be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings(). Example usage would look like: PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...)) .apply(FileIO.readMatches()) .apply(TikaIO.parseAllAsStrings()) or: .apply(TikaIO.parseAll() .withParser(new AutoDetectParser()) .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler( You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example: p.apply(TikaIO.parseAsStrings().from(filepattern)) This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse. On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote: Hi Tim On 21/09/17 14:33, Allison, Timothy B. wrote: > Thank you, Sergey. > > My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's > talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't > had a chance to work with it yet. > > From my perspective, if I understand this thread (and I may not!), getting > unordered text from _a given file_ is a non-starter for most applications. > The implementation needs to guarantee order per file, and the user has to be &
Re: TikaIO concerns
released ? Thanks, Sergey > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com> > wrote: > >> Hi Tim >> On 21/09/17 14:33, Allison, Timothy B. wrote: >>> Thank you, Sergey. >>> >>> My knowledge of Apache Beam is limited -- I saw Davor and >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally >> impressed, but I haven't had a chance to work with it yet. >>> >>> From my perspective, if I understand this thread (and I may not!), >> getting unordered text from _a given file_ is a non-starter for most >> applications. The implementation needs to guarantee order per file, and >> the user has to be able to link the "extract" back to a unique identifier >> for the document. If the current implementation doesn't do those things, >> we need to change it, IMHO. >>> >> Right now Tika-related reader does not associate a given text fragment >> with the file name, so a function looking at some text and trying to >> find where it came from won't be able to do so. >> >> So I asked how to do it in Beam, how to attach some context to the given >> piece of data. I hope it can be done and if not - then perhaps some >> improvement can be applied. >> >> Re the unordered text - yes - this is what we currently have with Beam + >> TikaIO :-). >> >> The use-case I referred to earlier in this thread (upload PDFs - save >> the possibly unordered text to Lucene with the file name 'attached', let >> users search for the files containing some words - phrases, this works >> OK given that I can see PDF parser for ex reporting the lines) can be >> supported OK with the current TikaIO (provided we find a way to 'attach' >> a file name to the flow). >> >> I see though supporting the total ordering can be a big deal in other >> cases. Eugene, can you please explain how it can be done, is it >> achievable in principle, without the users having to do some custom >> coding ? >> >>> To the question of -- why is this in Beam at all; why don't we let users >> call it if they want it?... >>> >>> No matter how much we do to Tika, it will behave badly sometimes -- >> permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks >> using Beam -- folks likely with large batches of unruly/noisy documents -- >> are more likely to run into these problems than your average >> couple-of-thousand-docs-from-our-own-company user. So, if there are things >> we can do in Beam to prevent developers around the world from having to >> reinvent the wheel for defenses against these problems, then I'd be >> enormously grateful if we could put Tika into Beam. That means: >>> >>> 1) a process-level timeout (because you can't actually kill a thread in >> Java) >>> 2) a process-level restart on OOM >>> 3) avoid trying to reprocess a badly behaving document >>> >>> If Beam automatically handles those problems, then I'd say, y, let users >> write their own code. If there is so much as a single configuration knob >> (and it sounds like Beam is against complex configuration...yay!) to get >> that working in Beam, then I'd say, please integrate Tika into Beam. From >> a safety perspective, it is critical to keep the extraction process >> entirely separate (jvm, vm, m, rack, data center!) from the >> transformation+loading steps. IMHO, very few devs realize this because >> Tika works well lots of the time...which is why it is critical for us to >> make it easy for people to get it right all of the time. >>> >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch >> mode first in one jvm, and then I kick off another process to do >> transform/loading into Lucene/Solr from the .json files that Tika generates >> for each input file. If I were to scale up, I'd want to maintain this >> complete separation of steps. >>> >>> Apologies if I've derailed the conversation or misunderstood this thread. >>> >> Major thanks for your input :-) >> >> Cheers, Sergey >> >>> Cheers, >>> >>> Tim >>> >>> -Original Message- >>> From: Sergey Beryozkin [mailto:s
Re: TikaIO concerns
o I asked how to do it in Beam, how to attach some context to the given piece of data. I hope it can be done and if not - then perhaps some improvement can be applied. Re the unordered text - yes - this is what we currently have with Beam + TikaIO :-). The use-case I referred to earlier in this thread (upload PDFs - save the possibly unordered text to Lucene with the file name 'attached', let users search for the files containing some words - phrases, this works OK given that I can see PDF parser for ex reporting the lines) can be supported OK with the current TikaIO (provided we find a way to 'attach' a file name to the flow). I see though supporting the total ordering can be a big deal in other cases. Eugene, can you please explain how it can be done, is it achievable in principle, without the users having to do some custom coding ? To the question of -- why is this in Beam at all; why don't we let users call it if they want it?... No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam. That means: 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document If Beam automatically handles those problems, then I'd say, y, let users write their own code. If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam. From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps. IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time. Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file. If I were to scale up, I'd want to maintain this complete separation of steps. Apologies if I've derailed the conversation or misunderstood this thread. Major thanks for your input :-) Cheers, Sergey Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the consumers) then please do so :-). At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order. (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support) Other comments would be welcome too Thanks, Sergey On 21/09/17 10:55, Sergey Beryozkin wrote: I noticed that the PDF and ODT parsers actually split by lines, not individual words and nearly 100% sure I saw Tika reporting individual lines when it was parsing the text files. The 'min text length' feature can help with reporting several lines at a time, etc... I'm working with this PDF all the time: https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf try it too if you get a chance. (and I can imagine not all PDFs/etc representing the 'story' but can be for ex a log-like content too) That said, I don't know how a parser for the format N will behave, it depends on the individual parsers. IMHO it's an equal candidate alongside Text-based bounded IOs... I'd like to know though how to make a file name available to the pipeline which is working with the current text fragment ? Going to try and do some measurements and compare the sync vs async parsing modes... Asked the Tika team to support with some more examples... Cheers, Sergey On 20/09/17 22:17, Sergey Beryozkin wrote: Hi, thanks for the explanations, On 20/09/17 16:41, Eugene Kirpichov wrote: Hi! TextIO returns an unordered soup of lines contained in all files you ask it to read. People usually use TextIO for reading files where 1 line corres
Re: TikaIO concerns
ers, Sergey Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the consumers) then please do so :-). At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order. (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support) Other comments would be welcome too Thanks, Sergey On 21/09/17 10:55, Sergey Beryozkin wrote: I noticed that the PDF and ODT parsers actually split by lines, not individual words and nearly 100% sure I saw Tika reporting individual lines when it was parsing the text files. The 'min text length' feature can help with reporting several lines at a time, etc... I'm working with this PDF all the time: https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf try it too if you get a chance. (and I can imagine not all PDFs/etc representing the 'story' but can be for ex a log-like content too) That said, I don't know how a parser for the format N will behave, it depends on the individual parsers. IMHO it's an equal candidate alongside Text-based bounded IOs... I'd like to know though how to make a file name available to the pipeline which is working with the current text fragment ? Going to try and do some measurements and compare the sync vs async parsing modes... Asked the Tika team to support with some more examples... Cheers, Sergey On 20/09/17 22:17, Sergey Beryozkin wrote: Hi, thanks for the explanations, On 20/09/17 16:41, Eugene Kirpichov wrote: Hi! TextIO returns an unordered soup of lines contained in all files you ask it to read. People usually use TextIO for reading files where 1 line corresponds to 1 independent data element, e.g. a log entry, or a row of a CSV file - so discarding order is ok. Just a side note, I'd probably want that be ordered, though I guess it depends... However, there is a number of cases where TextIO is a poor fit: - Cases where discarding order is not ok - e.g. if you're doing natural language processing and the text files contain actual prose, where you need to process a file as a whole. TextIO can't do that. - Cases where you need to remember which file each element came from, e.g. if you're creating a search index for the files: TextIO can't do this either. Both of these issues have been raised in the past against TextIO; however it seems that the overwhelming majority of users of TextIO use it for logs or CSV files or alike, so solving these issues has not been a priority. Currently they are solved in a general form via FileIO.read() which gives you access to reading a full file yourself - people who want more flexibility will be able to use standard Java text-parsing utilities on a ReadableFile, without involving TextIO. Same applies for XmlIO: it is specifically designed for the narrow use case where the files contain independent data entries, so returning an unordered soup of them, with no association to the original file, is the user's intention. XmlIO will not work for processing more complex XML files that are not simply a sequence of entries with the same tag, and it also does not remember the original filename. OK... However, if my understanding of Tika use cases is correct, it is mainly used for extracting content from complex file formats - for example, extracting text and images from PDF files or Word documents. I believe this is the main difference between it and TextIO - people usually use Tika for complex use cases where the "unordered soup of stuff" abstraction is not useful. My suspicion about this is confirmed by the fact that the crux of the Tika API is ContentHandler http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler. html?is-external=true whose documentation says "The order of events in this interface is very important, and mirrors the order of information in the document itself." All that says is that a (Tika) ContentHandler will be a true SAX ContentHandler... Let me give a few examples of what I think is possible with the raw Tika API, but I think is not currently possible with TikaIO - please correct me where I'm wrong, because I'm not particularly familiar with Tika and am judging just based on what I read about it. - User has 100,000 Word documents and wants to convert each of them to text files for future natural language processing. - User has 100,000 PDF files with financial st
Re: TikaIO concerns
Hi Tim On 21/09/17 14:33, Allison, Timothy B. wrote: Thank you, Sergey. My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet. From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications. The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document. If the current implementation doesn't do those things, we need to change it, IMHO. Right now Tika-related reader does not associate a given text fragment with the file name, so a function looking at some text and trying to find where it came from won't be able to do so. So I asked how to do it in Beam, how to attach some context to the given piece of data. I hope it can be done and if not - then perhaps some improvement can be applied. Re the unordered text - yes - this is what we currently have with Beam + TikaIO :-). The use-case I referred to earlier in this thread (upload PDFs - save the possibly unordered text to Lucene with the file name 'attached', let users search for the files containing some words - phrases, this works OK given that I can see PDF parser for ex reporting the lines) can be supported OK with the current TikaIO (provided we find a way to 'attach' a file name to the flow). I see though supporting the total ordering can be a big deal in other cases. Eugene, can you please explain how it can be done, is it achievable in principle, without the users having to do some custom coding ? To the question of -- why is this in Beam at all; why don't we let users call it if they want it?... No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam. That means: 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document If Beam automatically handles those problems, then I'd say, y, let users write their own code. If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam. From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps. IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time. Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file. If I were to scale up, I'd want to maintain this complete separation of steps. Apologies if I've derailed the conversation or misunderstood this thread. Major thanks for your input :-) Cheers, Sergey Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the consumers) then please do so :-). At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order. (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support) Other comments would be welcome too Thanks, Sergey On 21/09/17 10:55, Sergey Beryozkin wrote: I noticed that the PDF and ODT parsers actually split by lines, not individual words and nearly 100% sure I saw Tika reporting individual lines when it was parsing the text files. The 'min text length' feature can help with reporting several lines at a time, etc... I'm working with this PDF all the time: https://rwc.iacr.org/2017/Slides/nguy
RE: TikaIO concerns
Thank you, Sergey. My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet. From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications. The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document. If the current implementation doesn't do those things, we need to change it, IMHO. To the question of -- why is this in Beam at all; why don't we let users call it if they want it?... No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few. I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam. That means: 1) a process-level timeout (because you can't actually kill a thread in Java) 2) a process-level restart on OOM 3) avoid trying to reprocess a badly behaving document If Beam automatically handles those problems, then I'd say, y, let users write their own code. If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam. From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps. IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time. Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file. If I were to scale up, I'd want to maintain this complete separation of steps. Apologies if I've derailed the conversation or misunderstood this thread. Cheers, Tim -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Thursday, September 21, 2017 9:07 AM To: d...@beam.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: TikaIO concerns Hi All Please welcome Tim, one of Apache Tika leads and practitioners. Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the consumers) then please do so :-). At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order. (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support) Other comments would be welcome too Thanks, Sergey On 21/09/17 10:55, Sergey Beryozkin wrote: > I noticed that the PDF and ODT parsers actually split by lines, not > individual words and nearly 100% sure I saw Tika reporting individual > lines when it was parsing the text files. The 'min text length' > feature can help with reporting several lines at a time, etc... > > I'm working with this PDF all the time: > https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf > > try it too if you get a chance. > > (and I can imagine not all PDFs/etc representing the 'story' but can > be for ex a log-like content too) > > That said, I don't know how a parser for the format N will behave, it > depends on the individual parsers. > > IMHO it's an equal candidate alongside Text-based bounded IOs... > > I'd like to know though how to make a file name available to the > pipeline which is working with the current text fragment ? > > Going to try and do some measurements and compare the sync vs async > parsing modes... > > Asked the Tika team to support with some more examples... > > Cheers, Sergey > On 20/09/17 22:17, Sergey Beryozkin wrote: >> Hi, >> >> thanks for the explanations, >> >> On 20/09/17 16:41, Eugene Kirpichov wrote: >>> Hi! >>> >>> TextIO returns an unordered soup of lines contained in all files you >>> ask it to read. People usually use TextIO for reading files where 1 >>> line corresponds to 1 independent