Great. I opened TIKA-2514 to track this. Pull requests are welcomed! 😊
-----Original Message----- From: Jim Idle [mailto:[email protected]] Sent: Wednesday, November 29, 2017 8:58 PM To: [email protected] Subject: RE: Very slow parsing of a few PDF files That would be a more practical alternative. I have time scheduled next week for an in-house solution but I will first look properly at ForkParser and see if I could make something akin to that in generic and configurable fashion. If so, I will submit the code. Jim > -----Original Message----- > From: Allison, Timothy B. [mailto:[email protected]] > Sent: Wednesday, November 29, 2017 23:52 > To: [email protected] > Subject: RE: Very slow parsing of a few PDF files > > >I am going to have to write my own application specific solution > > Ugh. I'm sorry. If there's anything shareable, please do share. > > > ForkParser tries to serialize every class it things will be needed > > across the > connection and a lot of third party classes are not serializable. I > think that ForkParser is a good enough idea but I am not sure how > practical it is in a real-life application. > > You make a very good point. We've had issues serializing our own > parsers...let alone user-specific addons. I wonder if we could modify > ForkClient to kick off the forkserver process from a user-specified "bin" > directory (instead of the current bootstrapped jar), and that bin > directory could include at least the tika-core.jar, > tika-fat-parsers.jar and tika- serialization.jar but could also > include optional dependencies and user- specific dependencies. > > Hmmm....
