Re: TikaIO concerns

2017-09-23 Thread Sergey Beryozkin
it's not a Tika library's 'fault' that the crashes might occur.
Tika does its best to get the latest libraries helping it to parse the
files, but indeed there will always be some file there that might use
some incomplete format specific tag etc which may cause the specific
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely
will probably become a history...

Cheers, Sergey

but given our past history, I have no reason to believe that these

problems won't happen again.


Thank you, again!

Best,

   Tim

[1] Stuff on the internet or ... some of our users are forensics

examiners dealing with broken/corrupted files


P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression

tests against this TB (thank you, Rackspace for hosting our vm!) to try

to

identify these problems.

2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare

processing for our low volume users

4) We're trying to get the message out.  Thank you for working with

us!!!


-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
   From what you're saying it sounds like the Tika library has a big

problem with crashes and freezes, and when applying it at scale (eg. in

the

context of Beam) requires explicitly addressing this problem, eg.

accepting

the fact that in many realistic applications some documents will just

need

to be skipped because they are unprocessable? This would be first

example

of a Beam IO that has this concern, so I'd like to confirm that my
understanding is correct.


On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <

talli...@mitre.org>

wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to parse a  xml [2].


However if processing the document causes the process to crash, then
it

will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

  Tim

[1]


http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w

eb-content-nanite/
[2]




https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml












Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:

Sure - with hundreds of different file formats and the abundance of weird /
malformed / malicious files in the wild, it's quite expected that sometimes
the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much harder
and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code, other
changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content and
metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements


I was still harboring the hope that may be using a container bean like 
ParseResult (with the other changes you proposed) can somehow let us 
stream from Tika into the pipeline.


If it is 1 ParseResult per document then it means that until Tika has 
parsed all the document the pipeline will not see it.


I'm sorry if I may be starting to go in circles. But let me ask this. 
How can a Beam user write a Beam function which will ensure the Tika 
content pieces are seen ordered by the pipeline, without TikaIO ?


May be knowing that will help coming up with the idea how to generalize 
somehow with the help of TikaIO ?



- Since you're returning a single value per document, there's no reason to
use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous codepath,
then converting from BoundedReader to ParDo or MapElements, then converting
from String to ParseResult.
This is a good plan, thanks, I guess at least for small documents it 
should work well (unless I've misunderstood a ParseResult idea)


Thanks, Sergey


On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly

rare, but when you are processing millions/billions of files in the wild
[1], they will happen.  We fix the problems or try to get our dependencies
to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to
state it's not a Tika library's 'fault' that the crashes might occur.
Tika does its best to get the latest libraries helping it to parse the
files, but indeed there will always be some file there that might use
some incomplete format specific tag etc which may cause the specific
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely
will probably become a history...

Cheers, Sergey

but given our past history, I have no reason to believe that these

problems won't happen again.


Thank you, again!

Best,

  Tim

[1] Stuff on the internet or ... some of our users are forensics

examiners dealing with broken/corrupted files


P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression

tests against this TB (thank you, Rackspace for hosting our vm!) to try to
identify these problems.

2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare

processing for our low volume users

4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
  From what you're saying it sounds like the Tika library has a big

problem with crashes and freezes, and when applying it at scale (eg. in the
context of Beam) requires explicitly addressing this problem, eg. accepting
the fact that in many realistic applications some documents will just need
to be skipped because they are unprocessable? This would be first example
of a Beam IO that has this concern, so I'd like to confirm that my
understanding is correct.


On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org>
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs an

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can,


I only would like to add to this that IMHO it would be more correct to 
state it's not a Tika library's 'fault' that the crashes might occur. 
Tika does its best to get the latest libraries helping it to parse the 
files, but indeed there will always be some file there that might use 
some incomplete format specific tag etc which may cause the specific 
parser to spin - but Tika will include the updated parser library asap.


And with Beam's help the crashes that can kill the Tika jobs completely 
will probably become a history...


Cheers, Sergey

but given our past history, I have no reason to believe that these problems 
won't happen again.

Thank you, again!

Best,

 Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems.
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users
4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
 From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org>
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to parse a  xml [2].


However if processing the document causes the process to crash, then
it

will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

Tim

[1]
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
eb-content-nanite/
[2]
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml



RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Great.  Thank you!

-Original Message-
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Friday, September 22, 2017 1:46 PM
To: dev@tika.apache.org
Subject: Re: TikaIO concerns

[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data 
either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as 
described here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:

>>1) We've gathered a TB of data from CommonCrawl and we run regression 
tests against this TB (thank you, Rackspace for hosting our vm!) to try to 
identify these problems.

And if anyone with connections at a big company doing open source + cloud 
would be interested in floating us some storage and cycles,  we'd be happy to 
move off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim






Re: TikaIO concerns

2017-09-22 Thread Chris Mattmann
[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data 
either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described 
here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B."  wrote:

>>1) We've gathered a TB of data from CommonCrawl and we run regression 
tests against this TB (thank you, Rackspace for hosting our vm!) to try to 
identify these problems.

And if anyone with connections at a big company doing open source + cloud 
would be interested in floating us some storage and cycles,  we'd be happy to 
move off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim






RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>>1) We've gathered a TB of data from CommonCrawl and we run regression tests 
>>against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
>>these problems.

And if anyone with connections at a big company doing open source + cloud would 
be interested in floating us some storage and cycles,  we'd be happy to move 
off our single vm to increase coverage and improve the speed for our 
large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

 Tim



RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Nice!  Thank you!

-Original Message-
From: Ben Chambers [mailto:bchamb...@apache.org] 
Sent: Friday, September 22, 2017 1:24 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

BigQueryIO allows a side-output for elements that failed to be inserted when 
using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here 
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how 
different failures should be handled, with the option of just outputting them 
to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -Original Message-
> From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: d...@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some 
> other IO has used the concept of a "Dead Letter" side-output,, where 
> documents that failed to process are side-output so the user can 
> handle them appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov 
> <kirpic...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big 
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem, 
> > eg. accepting the fact that in many realistic applications some 
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so 
> > I'd like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to 
> > > integrate Tika with Beam so that people don't have to 1) 
> > > (re)discover the need to make their wrappers robust and then 2) 
> > > have to reinvent these wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > > with Hadoop [1].  He and other Tika users independently have wound 
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular 
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a  xml [2].
> > >
> > > > However if processing the document causes the process to crash, 
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res
> > ou rces/test-documents/mock/example.xml
> > >
> >
>


Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
BigQueryIO allows a side-output for elements that failed to be inserted
when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how
different failures should be handled, with the option of just outputting
them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -Original Message-
> From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: d...@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some other
> IO has used the concept of a "Dead Letter" side-output,, where documents
> that failed to process are side-output so the user can handle them
> appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
> <kirpic...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem,
> > eg. accepting the fact that in many realistic applications some
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so I'd
> > like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <talli...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to integrate
> > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > to make their wrappers robust and then 2) have to reinvent these
> > > wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > with Hadoop [1].  He and other Tika users independently have wound
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a  xml [2].
> > >
> > > > However if processing the document causes the process to crash,
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > rces/test-documents/mock/example.xml
> > >
> >
>


RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Do tell...

Interesting.  Any pointers?

-Original Message-
From: Ben Chambers [mailto:bchamb...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:50 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Regarding specifically elements that are failing -- I believe some other IO has 
used the concept of a "Dead Letter" side-output,, where documents that failed 
to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <kirpic...@google.com.invalid> 
wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big 
> problem with crashes and freezes, and when applying it at scale (eg. 
> in the context of Beam) requires explicitly addressing this problem, 
> eg. accepting the fact that in many realistic applications some 
> documents will just need to be skipped because they are unprocessable? 
> This would be first example of a Beam IO that has this concern, so I'd 
> like to confirm that my understanding is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> <talli...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate 
> > Tika with Beam so that people don't have to 1) (re)discover the need 
> > to make their wrappers robust and then 2) have to reinvent these 
> > wheels for robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > with Hadoop [1].  He and other Tika users independently have wound 
> > up carrying out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular 
> > exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a  xml [2].
> >
> > > However if processing the document causes the process to crash, 
> > > then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> rces/test-documents/mock/example.xml
> >
>


RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Y, I think you have it right.

> Tika library has a big problem with crashes and freezes

I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can, but given our past history, I have no reason to believe that these 
problems won't happen again.

Thank you, again!

Best,

Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems. 
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users 
4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate 
> Tika with Beam so that people don't have to 1) (re)discover the need 
> to make their wrappers robust and then 2) have to reinvent these 
> wheels for robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with 
> Hadoop [1].  He and other Tika users independently have wound up 
> carrying out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, 
> OOMs and permanent hangs by asking Tika to parse a  xml [2].
>
> > However if processing the document causes the process to crash, then 
> > it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>


Re: TikaIO concerns

2017-09-22 Thread Ben Chambers
Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
 wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big problem
> with crashes and freezes, and when applying it at scale (eg. in the context
> of Beam) requires explicitly addressing this problem, eg. accepting the
> fact that in many realistic applications some documents will just need to
> be skipped because they are unprocessable? This would be first example of a
> Beam IO that has this concern, so I'd like to confirm that my understanding
> is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate Tika
> > with Beam so that people don't have to 1) (re)discover the need to make
> > their wrappers robust and then 2) have to reinvent these wheels for
> > robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts with
> > Hadoop [1].  He and other Tika users independently have wound up carrying
> > out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a  xml [2].
> >
> > > However if processing the document causes the process to crash, then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >
>


RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
Reuven,

Thank you!  This suggests to me that it is a good idea to integrate Tika with 
Beam so that people don't have to 1) (re)discover the need to make their 
wrappers robust and then 2) have to reinvent these wheels for robustness.  

For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop 
[1].  He and other Tika users independently have wound up carrying out exactly 
your recommendation for 1) below. 

We have a MockParser that you can get to simulate regular exceptions, OOMs and 
permanent hangs by asking Tika to parse a  xml [2]. 

> However if processing the document causes the process to crash, then it will 
> be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

   Tim

[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
 


RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
>> How will it work now, with new Metadata() passed to the AutoDetect parser, 
>> will this Metadata have a Metadata value per every attachment, possibly 
>> keyed by a name ?

An example of how to call the RecursiveParserWrapper:

https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java#L138

To serialize the List, use:

https://github.com/apache/tika/blob/master/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java#L47
 




Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi Tim

Sorry for getting into the RecursiveParserWrapper discussion first, I 
was certain the time zone difference was on my side :-)


How will it work now, with new Metadata() passed to the AutoDetect 
parser, will this Metadata have a Metadata value per every attachment, 
possibly keyed by a name ?


Thanks, Sergey
On 22/09/17 12:58, Allison, Timothy B. wrote:

@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

Not at the moment, we’d have to do some coding on our end or within Beam.  The 
format is a list of maps/dicts for each file.  Each map contains all of the 
metadata, with one key reserved for the content.  If a file is a file with no 
attachments, the list has length 1; otherwise there’s a map for each embedded 
file.  Unlike our legacy xhtml, this format maintains metadata for attachments.

The downside to this extract format is that it requires a full parse of the 
document and all data to be held in-memory before writing it.  On the other 
hand, while Tika tries to be streaming, and that was one of the critical early 
design goals, for some file formats, we simply have to parse the whole thing 
before we can have any output.

So, y, large files are a problem. :\

Example with purely made-up keys representing a pdf file containing an RTF 
attachment
[
{
Name : “container file”,
Author: “Chris Mattmann”,
Content: “Four score and seven years ago…”,
Content-type: “application/pdf”
   …
},
{
   Name : “embedded file1”
   Author: “Nick Burch”,
   Content: “When in the course of human events…”,
   Content-type: “application/rtf”
}
]

From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 7:42 PM
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could 
alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each 
Metadata item into a separate PCollection element, but that's not what we want 
to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
Like Sergey, it’ll take me some time to understand your recommendations.  Thank 
you!

On one small point:

return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
class with properties { String content, Metadata metadata }


For this option, I’d strongly encourage using the Json output from the 
RecursiveParserWrapper that contains metadata and content, and captures 
metadata even from embedded documents.


However, since TikaIO can be applied to very large files, this could produce 
very large elements, which is a bad idea

Large documents are a problem, no doubt about it…

From: Eugene Kirpichov 
[mailto:kirpic...@google.com<mailto:kirpic...@google.com>]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>; 
d...@beam.apache.org<mailto:d...@beam.apache.org>
Cc: dev@tika.apache.org<mailto:dev@tika.apache.org>
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both 
within-document order and association with the original filename are necessary, 
but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with 
the file it originated from: automatically tracking data provenance is a known 
very hard research problem on which many papers have been written, and obvious 
solutions are very easy to break. See related discussion at 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional information, 
you need the elements themselves to contain this information: the elements are 
self-contained and have no metadata associated with them (beyond the timestamp 
and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the 
PCollection contain something ordered, e.g. have a 
PCollection<List>, where each List is for one file [I'm assuming 
Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to 
very large files, this could produce very large elements, which is a bad idea. Because of 
this, I

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Eugene: What's the best way to have Beam help us with these issues, or do 
these come for free with the Beam framework? 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document



Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi,
On 22/09/17 00:42, Eugene Kirpichov wrote:

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.

OK, thanks

- Yes, the String in KV<String, ParseResult> is the filename. I guess we
could alternatively put it into ParseResult - don't have a strong opinion.

Sure. If you don't mind then the 1st thing I'd like to try hopefully 
early next week is to introduce ParseResult first into the existing code.
I know it won't 'fix' the issues related to the ordering, but starting 
with a complete re-write would be a steep curve for me, so I'd try to 
experiment first with the idea (which I like very much) of wrapping 
several related pieces (content fragment, metadata, and the doc id/file 
name) into ParseResult.


By the way, reporting Tika file (output) metadata with every ParseResult 
instance will work much better, I thought first it won't because Tika 
does not callback when it populates the file metadata; it only does it 
for the actual content, but it will update the Metadata instance passed 
to it while it keeps parsing and finding the new metadata, so the 
metadata pieces will be available to the pipeline as soon as they may 
become available. Though Tika (1.17 ?) may need to ensure its Metadata 
is backed up by the concurrent map for this approach to work, not sure 
yet...




@Chris: unorderedness of Metadata would have helped if we extracted each
Metadata item into a separate PCollection element, but that's not what we
want to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this
something that the user can configure by specifying the Parser on TikaIO if
they so wish?




As a general note, Metadata passed to the top-level parser acts as a 
file (and embedded attachments) metadata sink but also as a 'helper' to 
the parser, right now TikaIO uses it to pass a media type hint if 
available (to help the auto-detect parser select the correct parser 
faster), and also a parser which will be used to parse the embedded 
attachments (I did it after Tim hinted about it earlier on...).


Not sure if RecusriveParserWrapper can act as a top-level parser or 
needs to be passed as a metadata property to AutoDetectParser, Tim will 
know :-)


Thanks, Sergey


On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <talli...@mitre.org>
wrote:


Like Sergey, it’ll take me some time to understand your recommendations.
Thank you!



On one small point:


return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult

is a class with properties { String content, Metadata metadata }



For this option, I’d strongly encourage using the Json output from the
RecursiveParserWrapper that contains metadata and content, and captures
metadata even from embedded documents.




However, since TikaIO can be applied to very large files, this could

produce very large elements, which is a bad idea

Large documents are a problem, no doubt about it…



*From:* Eugene Kirpichov [mailto:kirpic...@google.com]
*Sent:* Thursday, September 21, 2017 4:41 PM
*To:* Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
*Cc:* dev@tika.apache.org
*Subject:* Re: TikaIO concerns



Thanks all for the discussion. It seems we have consensus that both
within-document order and association with the original filename are
necessary, but currently absent from TikaIO.



*Association with original file:*

Sergey - Beam does not *automatically* provide a way to associate an
element with the file it originated from: automatically tracking data
provenance is a known very hard research problem on which many papers have
been written, and obvious solutions are very easy to break. See related
discussion at
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
  .



If you want the elements of your PCollection to contain additional
information, you need the elements themselves to contain this information:
the elements are self-contained and have no metadata associated with them
(beyond the timestamp and windows, universal to the whole Beam model).



*Order within a file:*

The only way to have any kind of order within a PCollection is to have the
elements of the PCollection contain something ordered, e.g. have a
PCollection<List>, where each List is for one file [I'm assuming
Tika, at a low level, works on a per-file basis?]. However, since TikaIO
can be applied to very large files, this could produce very large elements,
which is a bad idea. Because of this, I don't think the result of applying
Tika to a single file can be encoded as a PCollection element.



Given both of these, I think that it's not possible to create a
*general-purpose* TikaIO transform that will be better than manual
invocation of Tika as a DoFn on the result of FileIO.readMatches().



However, looking at the examples at
https://tika.apache.org/1.16/examples.html - almost al

RE: TikaIO concerns

2017-09-22 Thread Allison, Timothy B.
@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

Not at the moment, we’d have to do some coding on our end or within Beam.  The 
format is a list of maps/dicts for each file.  Each map contains all of the 
metadata, with one key reserved for the content.  If a file is a file with no 
attachments, the list has length 1; otherwise there’s a map for each embedded 
file.  Unlike our legacy xhtml, this format maintains metadata for attachments.

The downside to this extract format is that it requires a full parse of the 
document and all data to be held in-memory before writing it.  On the other 
hand, while Tika tries to be streaming, and that was one of the critical early 
design goals, for some file formats, we simply have to parse the whole thing 
before we can have any output.

So, y, large files are a problem. :\

Example with purely made-up keys representing a pdf file containing an RTF 
attachment
[
{
   Name : “container file”,
   Author: “Chris Mattmann”,
   Content: “Four score and seven years ago…”,
   Content-type: “application/pdf”
  …
},
{
  Name : “embedded file1”
  Author: “Nick Burch”,
  Content: “When in the course of human events…”,
  Content-type: “application/rtf”
}
]

From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 7:42 PM
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could 
alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each 
Metadata item into a separate PCollection element, but that's not what we want 
to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this 
something that the user can configure by specifying the Parser on TikaIO if 
they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. 
<talli...@mitre.org<mailto:talli...@mitre.org>> wrote:
Like Sergey, it’ll take me some time to understand your recommendations.  Thank 
you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
>class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the 
RecursiveParserWrapper that contains metadata and content, and captures 
metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce 
> very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov 
[mailto:kirpic...@google.com<mailto:kirpic...@google.com>]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>>; 
d...@beam.apache.org<mailto:d...@beam.apache.org>
Cc: dev@tika.apache.org<mailto:dev@tika.apache.org>
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both 
within-document order and association with the original filename are necessary, 
but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with 
the file it originated from: automatically tracking data provenance is a known 
very hard research problem on which many papers have been written, and obvious 
solutions are very easy to break. See related discussion at 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional information, 
you need the elements themselves to contain this information: the elements are 
self-contained and have no metadata associated with them (beyond the timestamp 
and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the 
elements of the PCollection contain something ordered, e.g. have a 
PCollection<List>, where each List is for one file [I'm assuming 
Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be 
applied to very large files, this could produce very large elements, which is a 
bad idea. Because of this, I don't think the result of applying Tika to a 
single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose 
TikaIO transform that will be better than manual invocation of Tika as a DoFn 
on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/exam

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
Like Sergey, it’ll take me some time to understand your recommendations.  Thank 
you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
>class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the 
RecursiveParserWrapper that contains metadata and content, and captures 
metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce 
> very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpic...@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <talli...@mitre.org>; d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both 
within-document order and association with the original filename are necessary, 
but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with 
the file it originated from: automatically tracking data provenance is a known 
very hard research problem on which many papers have been written, and obvious 
solutions are very easy to break. See related discussion at 
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional information, 
you need the elements themselves to contain this information: the elements are 
self-contained and have no metadata associated with them (beyond the timestamp 
and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the 
elements of the PCollection contain something ordered, e.g. have a 
PCollection<List>, where each List is for one file [I'm assuming 
Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be 
applied to very large files, this could produce very large elements, which is a 
bad idea. Because of this, I don't think the result of applying Tika to a 
single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose 
TikaIO transform that will be better than manual invocation of Tika as a DoFn 
on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html 
- almost all of the examples involve extracting a single String from each 
document. This use case, with the assumption that individual documents are 
small enough, can certainly be simplified and TikaIO could be a facade for 
doing just this.

E.g. TikaIO could:
- take as input a PCollection
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at 
pipeline construction time) and a ContentHandler whose toString() will go into 
"content". ContentHandler does not implement Serializable, so you can not 
specify it at construction time - however, you can let the user specify either 
its class (if it's a simple handler like a BodyContentHandler) or specify a 
lambda for creating the handler (SerializableFunction<Void, ContentHandler>), 
and potentially you can have a simpler facade for Tika.parseAsString() - e.g. 
call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = 
p.apply(FileIO.match().filepattern(...))
.apply(FileIO.readMatches())
.apply(TikaIO.parseAllAsStrings())

or:

.apply(TikaIO.parseAll()
.withParser(new AutoDetectParser())
.withContentHandler(() -> new BodyContentHandler(new 
ToXMLContentHandler(

You could also have shorthands for letting the user avoid using FileIO directly 
in simple cases, for example:
p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll 
be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin 
<sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's 
> talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't 
> had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting 
> unordered text from _a given file_ is a non-starter for most applications.  
> The implementation needs to guarantee order per file, and the user has to be 
&

Re: TikaIO concerns

2017-09-21 Thread Chris Mattmann
 released ?

Thanks, Sergey
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com>
> wrote:
> 
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I saw Davor and
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
>> impressed, but I haven't had a chance to work with it yet.
>>>
>>>   From my perspective, if I understand this thread (and I may not!),
>> getting unordered text from _a given file_ is a non-starter for most
>> applications.  The implementation needs to guarantee order per file, and
>> the user has to be able to link the "extract" back to a unique identifier
>> for the document.  If the current implementation doesn't do those things,
>> we need to change it, IMHO.
>>>
>> Right now Tika-related reader does not associate a given text fragment
>> with the file name, so a function looking at some text and trying to
>> find where it came from won't be able to do so.
>>
>> So I asked how to do it in Beam, how to attach some context to the given
>> piece of data. I hope it can be done and if not - then perhaps some
>> improvement can be applied.
>>
>> Re the unordered text - yes - this is what we currently have with Beam +
>> TikaIO :-).
>>
>> The use-case I referred to earlier in this thread (upload PDFs - save
>> the possibly unordered text to Lucene with the file name 'attached', let
>> users search for the files containing some words - phrases, this works
>> OK given that I can see PDF parser for ex reporting the lines) can be
>> supported OK with the current TikaIO (provided we find a way to 'attach'
>> a file name to the flow).
>>
>> I see though supporting the total ordering can be a big deal in other
>> cases. Eugene, can you please explain how it can be done, is it
>> achievable in principle, without the users having to do some custom
>> coding ?
>>
>>> To the question of -- why is this in Beam at all; why don't we let users
>> call it if they want it?...
>>>
>>> No matter how much we do to Tika, it will behave badly sometimes --
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine 
folks
>> using Beam -- folks likely with large batches of unruly/noisy documents 
--
>> are more likely to run into these problems than your average
>> couple-of-thousand-docs-from-our-own-company user. So, if there are 
things
>> we can do in Beam to prevent developers around the world from having to
>> reinvent the wheel for defenses against these problems, then I'd be
>> enormously grateful if we could put Tika into Beam.  That means:
>>>
>>> 1) a process-level timeout (because you can't actually kill a thread in
>> Java)
>>> 2) a process-level restart on OOM
>>> 3) avoid trying to reprocess a badly behaving document
>>>
>>> If Beam automatically handles those problems, then I'd say, y, let users
>> write their own code.  If there is so much as a single configuration knob
>> (and it sounds like Beam is against complex configuration...yay!) to get
>> that working in Beam, then I'd say, please integrate Tika into Beam.  
From
>> a safety perspective, it is critical to keep the extraction process
>> entirely separate (jvm, vm, m, rack, data center!) from the
>> transformation+loading steps.  IMHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
    >> mode first in one jvm, and then I kick off another process to do
>> transform/loading into Lucene/Solr from the .json files that Tika 
generates
>> for each input file.  If I were to scale up, I'd want to maintain this
>> complete separation of steps.
>>>
>>> Apologies if I've derailed the conversation or misunderstood this 
thread.
>>>
>> Major thanks for your input :-)
>>
>> Cheers, Sergey
>>
>>> Cheers,
>>>
>>>  Tim
>>>
>>> -Original Message-
>>> From: Sergey Beryozkin [mailto:s

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
o I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?


To the question of -- why is this in Beam at all; why don't we let users

call it if they want it?...


No matter how much we do to Tika, it will behave badly sometimes --

permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
using Beam -- folks likely with large batches of unruly/noisy documents --
are more likely to run into these problems than your average
couple-of-thousand-docs-from-our-own-company user. So, if there are things
we can do in Beam to prevent developers around the world from having to
reinvent the wheel for defenses against these problems, then I'd be
enormously grateful if we could put Tika into Beam.  That means:


1) a process-level timeout (because you can't actually kill a thread in

Java)

2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users

write their own code.  If there is so much as a single configuration knob
(and it sounds like Beam is against complex configuration...yay!) to get
that working in Beam, then I'd say, please integrate Tika into Beam.  From
a safety perspective, it is critical to keep the extraction process
entirely separate (jvm, vm, m, rack, data center!) from the
transformation+loading steps.  IMHO, very few devs realize this because
Tika works well lots of the time...which is why it is critical for us to
make it easy for people to get it right all of the time.


Even in my desktop (gah, y, desktop!) search app, I run Tika in batch

mode first in one jvm, and then I kick off another process to do
transform/loading into Lucene/Solr from the .json files that Tika generates
for each input file.  If I were to scale up, I'd want to maintain this
complete separation of steps.


Apologies if I've derailed the conversation or misunderstood this thread.


Major thanks for your input :-)

Cheers, Sergey


Cheers,

 Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika

stories to share (preferably involving the cases where it did not really
matter the ordering in which Tika-produced data were dealt with by the

consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered

data, the Beam runtime will have no guarantees that the downstream pipeline
components will see the data coming in the right order.


(FYI, I understand from the earlier comments that the total ordering is

also achievable but would require the extra API support)


Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:

I noticed that the PDF and ODT parsers actually split by lines, not
individual words and nearly 100% sure I saw Tika reporting individual
lines when it was parsing the text files. The 'min text length'
feature can help with reporting several lines at a time, etc...

I'm working with this PDF all the time:
https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf

try it too if you get a chance.

(and I can imagine not all PDFs/etc representing the 'story' but can
be for ex a log-like content too)

That said, I don't know how a parser for the format N will behave, it
depends on the individual parsers.

IMHO it's an equal candidate alongside Text-based bounded IOs...

I'd like to know though how to make a file name available to the
pipeline which is working with the current text fragment ?

Going to try and do some measurements and compare the sync vs async
parsing modes...

Asked the Tika team to support with some more examples...

Cheers, Sergey
On 20/09/17 22:17, Sergey Beryozkin wrote:

Hi,

thanks for the explanations,

On 20/09/17 16:41, Eugene Kirpichov wrote:

Hi!

TextIO returns an unordered soup of lines contained in all files you
ask it to read. People usually use TextIO for reading files where 1
line corres

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin
ers, Sergey


Cheers,

    Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika 
stories to share (preferably involving the cases where it did not 
really matter the ordering in which Tika-produced data were dealt with 
by the

consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered 
data, the Beam runtime will have no guarantees that the downstream 
pipeline components will see the data coming in the right order.


(FYI, I understand from the earlier comments that the total ordering 
is also achievable but would require the extra API support)


Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:

I noticed that the PDF and ODT parsers actually split by lines, not
individual words and nearly 100% sure I saw Tika reporting individual
lines when it was parsing the text files. The 'min text length'
feature can help with reporting several lines at a time, etc...

I'm working with this PDF all the time:
https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf

try it too if you get a chance.

(and I can imagine not all PDFs/etc representing the 'story' but can
be for ex a log-like content too)

That said, I don't know how a parser for the format N will behave, it
depends on the individual parsers.

IMHO it's an equal candidate alongside Text-based bounded IOs...

I'd like to know though how to make a file name available to the
pipeline which is working with the current text fragment ?

Going to try and do some measurements and compare the sync vs async
parsing modes...

Asked the Tika team to support with some more examples...

Cheers, Sergey
On 20/09/17 22:17, Sergey Beryozkin wrote:

Hi,

thanks for the explanations,

On 20/09/17 16:41, Eugene Kirpichov wrote:

Hi!

TextIO returns an unordered soup of lines contained in all files you
ask it to read. People usually use TextIO for reading files where 1
line corresponds to 1 independent data element, e.g. a log entry, or
a row of a CSV file - so discarding order is ok.

Just a side note, I'd probably want that be ordered, though I guess
it depends...

However, there is a number of cases where TextIO is a poor fit:
- Cases where discarding order is not ok - e.g. if you're doing
natural language processing and the text files contain actual prose,
where you need to process a file as a whole. TextIO can't do that.
- Cases where you need to remember which file each element came
from, e.g.
if you're creating a search index for the files: TextIO can't do
this either.

Both of these issues have been raised in the past against TextIO;
however it seems that the overwhelming majority of users of TextIO
use it for logs or CSV files or alike, so solving these issues has
not been a priority.
Currently they are solved in a general form via FileIO.read() which
gives you access to reading a full file yourself - people who want
more flexibility will be able to use standard Java text-parsing
utilities on a ReadableFile, without involving TextIO.

Same applies for XmlIO: it is specifically designed for the narrow
use case where the files contain independent data entries, so
returning an unordered soup of them, with no association to the
original file, is the user's intention. XmlIO will not work for
processing more complex XML files that are not simply a sequence of
entries with the same tag, and it also does not remember the
original filename.



OK...


However, if my understanding of Tika use cases is correct, it is
mainly used for extracting content from complex file formats - for
example, extracting text and images from PDF files or Word
documents. I believe this is the main difference between it and
TextIO - people usually use Tika for complex use cases where the
"unordered soup of stuff" abstraction is not useful.

My suspicion about this is confirmed by the fact that the crux of
the Tika API is ContentHandler
http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
html?is-external=true

whose
documentation says "The order of events in this interface is very
important, and mirrors the order of information in the document 
itself."

All that says is that a (Tika) ContentHandler will be a true SAX
ContentHandler...


Let me give a few examples of what I think is possible with the raw
Tika API, but I think is not currently possible with TikaIO - please
correct me where I'm wrong, because I'm not particularly familiar
with Tika and am judging just based on what I read about it.
- User has 100,000 Word documents and wants to convert each of them
to text files for future natural language processing.
- User has 100,000 PDF files with financial st

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin

Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk 
at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a 
chance to work with it yet.

 From my perspective, if I understand this thread (and I may not!), getting unordered 
text from _a given file_ is a non-starter for most applications.  The implementation 
needs to guarantee order per file, and the user has to be able to link the 
"extract" back to a unique identifier for the document.  If the current 
implementation doesn't do those things, we need to change it, IMHO.

Right now Tika-related reader does not associate a given text fragment 
with the file name, so a function looking at some text and trying to 
find where it came from won't be able to do so.


So I asked how to do it in Beam, how to attach some context to the given 
piece of data. I hope it can be done and if not - then perhaps some 
improvement can be applied.


Re the unordered text - yes - this is what we currently have with Beam + 
TikaIO :-).


The use-case I referred to earlier in this thread (upload PDFs - save 
the possibly unordered text to Lucene with the file name 'attached', let 
users search for the files containing some words - phrases, this works 
OK given that I can see PDF parser for ex reporting the lines) can be 
supported OK with the current TikaIO (provided we find a way to 'attach' 
a file name to the flow).


I see though supporting the total ordering can be a big deal in other 
cases. Eugene, can you please explain how it can be done, is it 
achievable in principle, without the users having to do some custom 
coding ?



To the question of -- why is this in Beam at all; why don't we let users call 
it if they want it?...

No matter how much we do to Tika, it will behave badly sometimes -- permanent 
hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- 
folks likely with large batches of unruly/noisy documents -- are more likely to 
run into these problems than your average 
couple-of-thousand-docs-from-our-own-company user. So, if there are things we 
can do in Beam to prevent developers around the world from having to reinvent 
the wheel for defenses against these problems, then I'd be enormously grateful 
if we could put Tika into Beam.  That means:

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users write 
their own code.  If there is so much as a single configuration knob (and it 
sounds like Beam is against complex configuration...yay!) to get that working 
in Beam, then I'd say, please integrate Tika into Beam.  From a safety 
perspective, it is critical to keep the extraction process entirely separate 
(jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, 
very few devs realize this because Tika works well lots of the time...which is 
why it is critical for us to make it easy for people to get it right all of the 
time.

Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode 
first in one jvm, and then I kick off another process to do transform/loading 
into Lucene/Solr from the .json files that Tika generates for each input file.  
If I were to scale up, I'd want to maintain this complete separation of steps.

Apologies if I've derailed the conversation or misunderstood this thread.


Major thanks for your input :-)

Cheers, Sergey


Cheers,

Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika stories to 
share (preferably involving the cases where it did not really matter the 
ordering in which Tika-produced data were dealt with by the
consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered data, the 
Beam runtime will have no guarantees that the downstream pipeline components 
will see the data coming in the right order.

(FYI, I understand from the earlier comments that the total ordering is also 
achievable but would require the extra API support)

Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:

I noticed that the PDF and ODT parsers actually split by lines, not
individual words and nearly 100% sure I saw Tika reporting individual
lines when it was parsing the text files. The 'min text length'
feature can help with reporting several lines at a time, etc...

I'm working with this PDF all the time:
https://rwc.iacr.org/2017/Slides/nguy

RE: TikaIO concerns

2017-09-21 Thread Allison, Timothy B.
Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk 
at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a 
chance to work with it yet.

From my perspective, if I understand this thread (and I may not!), getting 
unordered text from _a given file_ is a non-starter for most applications.  The 
implementation needs to guarantee order per file, and the user has to be able 
to link the "extract" back to a unique identifier for the document.  If the 
current implementation doesn't do those things, we need to change it, IMHO.

To the question of -- why is this in Beam at all; why don't we let users call 
it if they want it?... 

No matter how much we do to Tika, it will behave badly sometimes -- permanent 
hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- 
folks likely with large batches of unruly/noisy documents -- are more likely to 
run into these problems than your average 
couple-of-thousand-docs-from-our-own-company user. So, if there are things we 
can do in Beam to prevent developers around the world from having to reinvent 
the wheel for defenses against these problems, then I'd be enormously grateful 
if we could put Tika into Beam.  That means: 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users write 
their own code.  If there is so much as a single configuration knob (and it 
sounds like Beam is against complex configuration...yay!) to get that working 
in Beam, then I'd say, please integrate Tika into Beam.  From a safety 
perspective, it is critical to keep the extraction process entirely separate 
(jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, 
very few devs realize this because Tika works well lots of the time...which is 
why it is critical for us to make it easy for people to get it right all of the 
time.

Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode 
first in one jvm, and then I kick off another process to do transform/loading 
into Lucene/Solr from the .json files that Tika generates for each input file.  
If I were to scale up, I'd want to maintain this complete separation of steps.

Apologies if I've derailed the conversation or misunderstood this thread.

Cheers,

   Tim

-Original Message-
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] 
Sent: Thursday, September 21, 2017 9:07 AM
To: d...@beam.apache.org
Cc: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika stories to 
share (preferably involving the cases where it did not really matter the 
ordering in which Tika-produced data were dealt with by the
consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered data, the 
Beam runtime will have no guarantees that the downstream pipeline components 
will see the data coming in the right order.

(FYI, I understand from the earlier comments that the total ordering is also 
achievable but would require the extra API support)

Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:
> I noticed that the PDF and ODT parsers actually split by lines, not 
> individual words and nearly 100% sure I saw Tika reporting individual 
> lines when it was parsing the text files. The 'min text length' 
> feature can help with reporting several lines at a time, etc...
> 
> I'm working with this PDF all the time:
> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> 
> try it too if you get a chance.
> 
> (and I can imagine not all PDFs/etc representing the 'story' but can 
> be for ex a log-like content too)
> 
> That said, I don't know how a parser for the format N will behave, it 
> depends on the individual parsers.
> 
> IMHO it's an equal candidate alongside Text-based bounded IOs...
> 
> I'd like to know though how to make a file name available to the 
> pipeline which is working with the current text fragment ?
> 
> Going to try and do some measurements and compare the sync vs async 
> parsing modes...
> 
> Asked the Tika team to support with some more examples...
> 
> Cheers, Sergey
> On 20/09/17 22:17, Sergey Beryozkin wrote:
>> Hi,
>>
>> thanks for the explanations,
>>
>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>> Hi!
>>>
>>> TextIO returns an unordered soup of lines contained in all files you 
>>> ask it to read. People usually use TextIO for reading files where 1 
>>> line corresponds to 1 independent