Re: Need advice: what pdf lib to use?
OK, but even in this case parsing the doc would not be a violation, because actually what we need for lucene is just collection of terms. Has nothing to do with printing or copying of _text_ pieces. As long You provide method returning just Document (I mean lucene document) permissions specified by the author of the PDF document are respected Ben Litchfield <[EMAIL PROTECTED]> 25.10.2004 17:59 Please respond to "Lucene Users List" To: Lucene Users List <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: In order to write software that consumes PDF documents you must agree to a list of conditions. One of those conditions is that permissions specified by the author of the PDF document are respected. PDFBox complies with this statement, if there is software that does not then they are in violation of copyright law. That being said, PDFBox is open source so a user could make modifications to the source code, or as a PDF library could change permissions on a document. Ben On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > Yes Ben, You are right. > > This would be correct functionality from technical perspective. But look > it my way with application programmer eyes reporting to big boss that c. > 30% of doc we cope with could not be indexed because of this stupid > limitation. Neither he or me have any influence on pdf owners and any > ideas about what made them create files with documet security set. > > In short, if You also could implement this "uncorrect functionality" the > "closed source" guys did, it would be really great! > > As far as sponsoring is concerned I would be ready to hack (or at least to > try) it even for 1/3 of that fortune:))) > > J. > > > > > > Ben Litchfield <[EMAIL PROTECTED]> > 25.10.2004 14:02 > Please respond to "Lucene Users List" > > > To: Lucene Users List <[EMAIL PROTECTED]> > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > > PDFBox does not 'stumble' when it gives that message, that is correct > functionality if that permission is not allowed. > > If your company is willing to pay a 'fortune' why not sponsor a change to > an open source project for half a fortune. > > Ben > http://www.pdfbox.org > > On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > > > PDFbox stumbles also with "class java.io.IOException with message: - > You > > do not have permission to extract text" in case the doc is copy/print > > protected. > > I tested now the snowtide commercial product and it looks like it could > > process these files as well. Performance was also not so bad. > Unfortunatly > > the test result could not be considered as 100%, because the free > version > > processed just first 8 pages. After all this product costs a fortune > > (as long the company is ready to pay I don't realy mind:)) > > > > J. > > > > > > > > > > > > Robert Newson <[EMAIL PROTECTED]> > > Sent by: news <[EMAIL PROTECTED]> > > 24.10.2004 17:44 > > Please respond to "Lucene Users List" > > > > > > To: [EMAIL PROTECTED] > > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > > Subject:Re: Need advice: what pdf lib to use? > > Category: > > > > > > > > [EMAIL PROTECTED] wrote: > > > Hello all, > > > > > > I need a piece of advice/experience.. > > > > > > What pdf parser (written in java) u'd recommend? > > > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > > much > > > with it > > > > > > On certain pdf's (not well formated but anyway readable with acrobate) > > it > > > run into dead loop (this I could fix in code), > > > and on one file it produced "out of memory error" and killed jvm:( > (this > > > > > problem I could not identify yet) > > > > > > After all the performance was not too great as well: it took c. 19 h. > to > > > > > index 13000 files (c. 3.5Gb) > > > > > > Regards, > > > J. > > > > > > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > > this to Ben a week or so ago and he has fixed it in the latest > > nightlies. I expect an official release wi
Re: Need advice: what pdf lib to use?
I recently started to work on a project which needed to parse many documents, including pdfs, very quickly and on a large scale. PDF Box seems to look like the best choice except for it's obvious speed issue. Eventually I took the time to go into the pdf box source and rip out the individual string tokens. In doing so I lose the quality of some docs, kerning isn't there, special chars arn't avail, but you can even improve on that yourself a bit with a little more research. But parsing some pdf documents scaled down from 60 seconds to strip the raw text, to 5 seconds an exchange I gladly make for the speed improvement. It's open source, with any IDE, you should be able to trace the function calls to find where to rip out the data you need. -Chris Fraschetti On Mon, 25 Oct 2004 18:07:52 +0200, sergiu gordea <[EMAIL PROTECTED]> wrote: > Ben Litchfield wrote: > > >In order to write software that consumes PDF documents you must agree to a > >list of conditions. One of those conditions is that permissions specified > >by the author of the PDF document are respected. > > > >PDFBox complies with this statement, if there is software that does not > >then they are in violation of copyright law. > > > > > > > I wanted to say something like this in one of my previous emails, when I > said that anyone can modify the code of > PDFBox to replace the restrictions > > >That being said, PDFBox is open source so a user could make modifications > >to the source code, or as a PDF library could change permissions on a > >document. > > > > > This seems to me as beeing a business decision, > > Iouli if your boss tels you that PDFBox is useless because it > prevents you to get the text from protected pdfs, > than you should say him ... I can fix it but it is not legal. You can > hack PDFbox, but before doing this you should > ensure that the authors let you do it. > > All the best, > > Sergiu > > > > > >Ben > > > >On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > > > > > > > >>Yes Ben, You are right. > >> > >>This would be correct functionality from technical perspective. But look > >>it my way with application programmer eyes reporting to big boss that c. > >>30% of doc we cope with could not be indexed because of this stupid > >>limitation. Neither he or me have any influence on pdf owners and any > >>ideas about what made them create files with documet security set. > >> > >>In short, if You also could implement this "uncorrect functionality" the > >>"closed source" guys did, it would be really great! > >> > >>As far as sponsoring is concerned I would be ready to hack (or at least to > >>try) it even for 1/3 of that fortune:))) > >> > >>J. > >> > >> > >> > >> > >> > >>Ben Litchfield <[EMAIL PROTECTED]> > >>25.10.2004 14:02 > >>Please respond to "Lucene Users List" > >> > >> > >>To: Lucene Users List <[EMAIL PROTECTED]> > >>cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > >>Subject:Re: Need advice: what pdf lib to use? > >>Category: > >> > >> > >> > >> > >>PDFBox does not 'stumble' when it gives that message, that is correct > >>functionality if that permission is not allowed. > >> > >>If your company is willing to pay a 'fortune' why not sponsor a change to > >>an open source project for half a fortune. > >> > >>Ben > >>http://www.pdfbox.org > >> > >>On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > >> > >> > >> > >>>PDFbox stumbles also with "class java.io.IOException with message: - > >>> > >>> > >>You > >> > >> > >>>do not have permission to extract text" in case the doc is copy/print > >>>protected. > >>>I tested now the snowtide commercial product and it looks like it could > >>>process these files as well. Performance was also not so bad. > >>> > >>> > >>Unfortunatly > >> > >> > >>>the test result could not be considered as 100%, because the free > >>> > >>> > >>version > >> > >> > >>>processed just first 8 pages. After all this product costs a fortune > >>>(as long the company is ready to pay I don't realy mind:)) &
Re: Need advice: what pdf lib to use?
Ben Litchfield wrote: In order to write software that consumes PDF documents you must agree to a list of conditions. One of those conditions is that permissions specified by the author of the PDF document are respected. PDFBox complies with this statement, if there is software that does not then they are in violation of copyright law. I wanted to say something like this in one of my previous emails, when I said that anyone can modify the code of PDFBox to replace the restrictions That being said, PDFBox is open source so a user could make modifications to the source code, or as a PDF library could change permissions on a document. This seems to me as beeing a business decision, Iouli if your boss tels you that PDFBox is useless because it prevents you to get the text from protected pdfs, than you should say him ... I can fix it but it is not legal. You can hack PDFbox, but before doing this you should ensure that the authors let you do it. All the best, Sergiu Ben On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: Yes Ben, You are right. This would be correct functionality from technical perspective. But look it my way with application programmer eyes reporting to big boss that c. 30% of doc we cope with could not be indexed because of this stupid limitation. Neither he or me have any influence on pdf owners and any ideas about what made them create files with documet security set. In short, if You also could implement this "uncorrect functionality" the "closed source" guys did, it would be really great! As far as sponsoring is concerned I would be ready to hack (or at least to try) it even for 1/3 of that fortune:))) J. Ben Litchfield <[EMAIL PROTECTED]> 25.10.2004 14:02 Please respond to "Lucene Users List" To: Lucene Users List <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: PDFbox stumbles also with "class java.io.IOException with message: - You do not have permission to extract text" in case the doc is copy/print protected. I tested now the snowtide commercial product and it looks like it could process these files as well. Performance was also not so bad. Unfortunatly the test result could not be considered as 100%, because the free version processed just first 8 pages. After all this product costs a fortune (as long the company is ready to pay I don't realy mind:)) J. Robert Newson <[EMAIL PROTECTED]> Sent by: news <[EMAIL PROTECTED]> 24.10.2004 17:44 Please respond to "Lucene Users List" To: [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject:Re: Need advice: what pdf lib to use? Category: [EMAIL PROTECTED] wrote: Hello all, I need a piece of advice/experience.. What pdf parser (written in java) u'd recommend? I played now with PDFBox-0.6.7a and would not say I was satisfied too much with it On certain pdf's (not well formated but anyway readable with acrobate) it run into dead loop (this I could fix in code), and on one file it produced "out of memory error" and killed jvm:( (this problem I could not identify yet) After all the performance was not too great as well: it took c. 19 h. to index 13000 files (c. 3.5Gb) Regards, J. On the specific problem of the "dead loop", I reported an instance of this to Ben a week or so ago and he has fixed it in the latest nightlies. I expect an official release will include this bugfix soon. The file in question was unreadable with any PDF software I have, but someone managed to create it somehow... http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 I've found pdfbox to be pretty good. The only time I get problems is with corrupted or egregiously bad PDF files. B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
In order to write software that consumes PDF documents you must agree to a list of conditions. One of those conditions is that permissions specified by the author of the PDF document are respected. PDFBox complies with this statement, if there is software that does not then they are in violation of copyright law. That being said, PDFBox is open source so a user could make modifications to the source code, or as a PDF library could change permissions on a document. Ben On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > Yes Ben, You are right. > > This would be correct functionality from technical perspective. But look > it my way with application programmer eyes reporting to big boss that c. > 30% of doc we cope with could not be indexed because of this stupid > limitation. Neither he or me have any influence on pdf owners and any > ideas about what made them create files with documet security set. > > In short, if You also could implement this "uncorrect functionality" the > "closed source" guys did, it would be really great! > > As far as sponsoring is concerned I would be ready to hack (or at least to > try) it even for 1/3 of that fortune:))) > > J. > > > > > > Ben Litchfield <[EMAIL PROTECTED]> > 25.10.2004 14:02 > Please respond to "Lucene Users List" > > > To: Lucene Users List <[EMAIL PROTECTED]> > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > > PDFBox does not 'stumble' when it gives that message, that is correct > functionality if that permission is not allowed. > > If your company is willing to pay a 'fortune' why not sponsor a change to > an open source project for half a fortune. > > Ben > http://www.pdfbox.org > > On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > > > PDFbox stumbles also with "class java.io.IOException with message: - > You > > do not have permission to extract text" in case the doc is copy/print > > protected. > > I tested now the snowtide commercial product and it looks like it could > > process these files as well. Performance was also not so bad. > Unfortunatly > > the test result could not be considered as 100%, because the free > version > > processed just first 8 pages. After all this product costs a fortune > > (as long the company is ready to pay I don't realy mind:)) > > > > J. > > > > > > > > > > > > Robert Newson <[EMAIL PROTECTED]> > > Sent by: news <[EMAIL PROTECTED]> > > 24.10.2004 17:44 > > Please respond to "Lucene Users List" > > > > > > To: [EMAIL PROTECTED] > > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > > Subject:Re: Need advice: what pdf lib to use? > > Category: > > > > > > > > [EMAIL PROTECTED] wrote: > > > Hello all, > > > > > > I need a piece of advice/experience.. > > > > > > What pdf parser (written in java) u'd recommend? > > > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > > much > > > with it > > > > > > On certain pdf's (not well formated but anyway readable with acrobate) > > it > > > run into dead loop (this I could fix in code), > > > and on one file it produced "out of memory error" and killed jvm:( > (this > > > > > problem I could not identify yet) > > > > > > After all the performance was not too great as well: it took c. 19 h. > to > > > > > index 13000 files (c. 3.5Gb) > > > > > > Regards, > > > J. > > > > > > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > > this to Ben a week or so ago and he has fixed it in the latest > > nightlies. I expect an official release will include this bugfix soon. > > The file in question was unreadable with any PDF software I have, but > > someone managed to create it somehow... > > > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > > > I've found pdfbox to be pretty good. The only time I get problems is > > with corrupted or egregiously bad PDF files. > > > > B. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
Ben, As far as as dead loop problem is concerned it looks like I experienced a bit different problem. I published it under the same tracking path Regards J. > > I need a piece of advice/experience.. > > > > What pdf parser (written in java) u'd recommend? > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > much > > with it > > > > On certain pdf's (not well formated but anyway readable with acrobate) > it > > run into dead loop (this I could fix in code), > > and on one file it produced "out of memory error" and killed jvm:( (this > > > problem I could not identify yet) > > > > After all the performance was not too great as well: it took c. 19 h. to > > > index 13000 files (c. 3.5Gb) > > > > Regards, > > J. > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > this to Ben a week or so ago and he has fixed it in the latest > nightlies. I expect an official release will include this bugfix soon. > The file in question was unreadable with any PDF software I have, but > someone managed to create it somehow... > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > I've found pdfbox to be pretty good. The only time I get problems is > with corrupted or egregiously bad PDF files. > > B. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
As far as > > I need a piece of advice/experience.. > > > > What pdf parser (written in java) u'd recommend? > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > much > > with it > > > > On certain pdf's (not well formated but anyway readable with acrobate) > it > > run into dead loop (this I could fix in code), > > and on one file it produced "out of memory error" and killed jvm:( (this > > > problem I could not identify yet) > > > > After all the performance was not too great as well: it took c. 19 h. to > > > index 13000 files (c. 3.5Gb) > > > > Regards, > > J. > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > this to Ben a week or so ago and he has fixed it in the latest > nightlies. I expect an official release will include this bugfix soon. > The file in question was unreadable with any PDF software I have, but > someone managed to create it somehow... > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > I've found pdfbox to be pretty good. The only time I get problems is > with corrupted or egregiously bad PDF files. > > B. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
Yes Ben, You are right. This would be correct functionality from technical perspective. But look it my way with application programmer eyes reporting to big boss that c. 30% of doc we cope with could not be indexed because of this stupid limitation. Neither he or me have any influence on pdf owners and any ideas about what made them create files with documet security set. In short, if You also could implement this "uncorrect functionality" the "closed source" guys did, it would be really great! As far as sponsoring is concerned I would be ready to hack (or at least to try) it even for 1/3 of that fortune:))) J. Ben Litchfield <[EMAIL PROTECTED]> 25.10.2004 14:02 Please respond to "Lucene Users List" To: Lucene Users List <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > PDFbox stumbles also with "class java.io.IOException with message: - You > do not have permission to extract text" in case the doc is copy/print > protected. > I tested now the snowtide commercial product and it looks like it could > process these files as well. Performance was also not so bad. Unfortunatly > the test result could not be considered as 100%, because the free version > processed just first 8 pages. After all this product costs a fortune > (as long the company is ready to pay I don't realy mind:)) > > J. > > > > > > Robert Newson <[EMAIL PROTECTED]> > Sent by: news <[EMAIL PROTECTED]> > 24.10.2004 17:44 > Please respond to "Lucene Users List" > > > To: [EMAIL PROTECTED] > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > [EMAIL PROTECTED] wrote: > > Hello all, > > > > I need a piece of advice/experience.. > > > > What pdf parser (written in java) u'd recommend? > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > much > > with it > > > > On certain pdf's (not well formated but anyway readable with acrobate) > it > > run into dead loop (this I could fix in code), > > and on one file it produced "out of memory error" and killed jvm:( (this > > > problem I could not identify yet) > > > > After all the performance was not too great as well: it took c. 19 h. to > > > index 13000 files (c. 3.5Gb) > > > > Regards, > > J. > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > this to Ben a week or so ago and he has fixed it in the latest > nightlies. I expect an official release will include this bugfix soon. > The file in question was unreadable with any PDF software I have, but > someone managed to create it somehow... > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > I've found pdfbox to be pretty good. The only time I get problems is > with corrupted or egregiously bad PDF files. > > B. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote: > PDFbox stumbles also with "class java.io.IOException with message: - You > do not have permission to extract text" in case the doc is copy/print > protected. > I tested now the snowtide commercial product and it looks like it could > process these files as well. Performance was also not so bad. Unfortunatly > the test result could not be considered as 100%, because the free version > processed just first 8 pages. After all this product costs a fortune > (as long the company is ready to pay I don't realy mind:)) > > J. > > > > > > Robert Newson <[EMAIL PROTECTED]> > Sent by: news <[EMAIL PROTECTED]> > 24.10.2004 17:44 > Please respond to "Lucene Users List" > > > To: [EMAIL PROTECTED] > cc: (bcc: Iouli Golovatyi/X/GP/Novartis) > Subject:Re: Need advice: what pdf lib to use? > Category: > > > > [EMAIL PROTECTED] wrote: > > Hello all, > > > > I need a piece of advice/experience.. > > > > What pdf parser (written in java) u'd recommend? > > > > I played now with PDFBox-0.6.7a and would not say I was satisfied too > much > > with it > > > > On certain pdf's (not well formated but anyway readable with acrobate) > it > > run into dead loop (this I could fix in code), > > and on one file it produced "out of memory error" and killed jvm:( (this > > > problem I could not identify yet) > > > > After all the performance was not too great as well: it took c. 19 h. to > > > index 13000 files (c. 3.5Gb) > > > > Regards, > > J. > > > > > > > > On the specific problem of the "dead loop", I reported an instance of > this to Ben a week or so ago and he has fixed it in the latest > nightlies. I expect an official release will include this bugfix soon. > The file in question was unreadable with any PDF software I have, but > someone managed to create it somehow... > > http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 > > I've found pdfbox to be pretty good. The only time I get problems is > with corrupted or egregiously bad PDF files. > > B. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
[EMAIL PROTECTED] wrote: Hi Iouli, If you don't think is illegal, you can hack the pdfbox code to remove the protection ... Sergiu PDFbox stumbles also with "class java.io.IOException with message: - You do not have permission to extract text" in case the doc is copy/print protected. I tested now the snowtide commercial product and it looks like it could process these files as well. Performance was also not so bad. Unfortunatly the test result could not be considered as 100%, because the free version processed just first 8 pages. After all this product costs a fortune (as long the company is ready to pay I don't realy mind:)) J. Robert Newson <[EMAIL PROTECTED]> Sent by: news <[EMAIL PROTECTED]> 24.10.2004 17:44 Please respond to "Lucene Users List" To: [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: [EMAIL PROTECTED] wrote: Hello all, I need a piece of advice/experience.. What pdf parser (written in java) u'd recommend? I played now with PDFBox-0.6.7a and would not say I was satisfied too much with it On certain pdf's (not well formated but anyway readable with acrobate) it run into dead loop (this I could fix in code), and on one file it produced "out of memory error" and killed jvm:( (this problem I could not identify yet) After all the performance was not too great as well: it took c. 19 h. to index 13000 files (c. 3.5Gb) Regards, J. On the specific problem of the "dead loop", I reported an instance of this to Ben a week or so ago and he has fixed it in the latest nightlies. I expect an official release will include this bugfix soon. The file in question was unreadable with any PDF software I have, but someone managed to create it somehow... http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 I've found pdfbox to be pretty good. The only time I get problems is with corrupted or egregiously bad PDF files. B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
PDFbox stumbles also with "class java.io.IOException with message: - You do not have permission to extract text" in case the doc is copy/print protected. I tested now the snowtide commercial product and it looks like it could process these files as well. Performance was also not so bad. Unfortunatly the test result could not be considered as 100%, because the free version processed just first 8 pages. After all this product costs a fortune (as long the company is ready to pay I don't realy mind:)) J. Robert Newson <[EMAIL PROTECTED]> Sent by: news <[EMAIL PROTECTED]> 24.10.2004 17:44 Please respond to "Lucene Users List" To: [EMAIL PROTECTED] cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: [EMAIL PROTECTED] wrote: > Hello all, > > I need a piece of advice/experience.. > > What pdf parser (written in java) u'd recommend? > > I played now with PDFBox-0.6.7a and would not say I was satisfied too much > with it > > On certain pdf's (not well formated but anyway readable with acrobate) it > run into dead loop (this I could fix in code), > and on one file it produced "out of memory error" and killed jvm:( (this > problem I could not identify yet) > > After all the performance was not too great as well: it took c. 19 h. to > index 13000 files (c. 3.5Gb) > > Regards, > J. > > > On the specific problem of the "dead loop", I reported an instance of this to Ben a week or so ago and he has fixed it in the latest nightlies. I expect an official release will include this bugfix soon. The file in question was unreadable with any PDF software I have, but someone managed to create it somehow... http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 I've found pdfbox to be pretty good. The only time I get problems is with corrupted or egregiously bad PDF files. B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
Ben, many thanks for your complrehensive answer. Unfourtunatly I can not send the problem pdfs cause they are the property of company and are of top secrecy:) Regards, J. Ben Litchfield <[EMAIL PROTECTED]> 22.10.2004 14:40 Please respond to "Lucene Users List" To: Lucene Users List <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) Subject: Re: Need advice: what pdf lib to use? Category: Please post any PDFBox issues you notice on the PDFBox sourceforge bug list, if possible attach/email any problem PDFs that you encounter. There are some efforts underway to improve the speed of PDFBox, you can monitor the progress at http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832 As for other suggestions, I know some people have utilized xpdf(open source but non Java) to extract the text. For other Java solutions PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java" http://snowtide.com/home/PDFTextStream/ Etymon PJ (GPL) http://www.etymon.com/ Ben http://www.pdfbox.org On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote: > Hello all, > > I need a piece of advice/experience.. > > What pdf parser (written in java) u'd recommend? > > I played now with PDFBox-0.6.7a and would not say I was satisfied too much > with it > > On certain pdf's (not well formated but anyway readable with acrobate) it > run into dead loop (this I could fix in code), > and on one file it produced "out of memory error" and killed jvm:( (this > problem I could not identify yet) > > After all the performance was not too great as well: it took c. 19 h. to > index 13000 files (c. 3.5Gb) > > Regards, > J. > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
[EMAIL PROTECTED] wrote: Hello all, I need a piece of advice/experience.. What pdf parser (written in java) u'd recommend? I played now with PDFBox-0.6.7a and would not say I was satisfied too much with it On certain pdf's (not well formated but anyway readable with acrobate) it run into dead loop (this I could fix in code), and on one file it produced "out of memory error" and killed jvm:( (this problem I could not identify yet) After all the performance was not too great as well: it took c. 19 h. to index 13000 files (c. 3.5Gb) Regards, J. On the specific problem of the "dead loop", I reported an instance of this to Ben a week or so ago and he has fixed it in the latest nightlies. I expect an official release will include this bugfix soon. The file in question was unreadable with any PDF software I have, but someone managed to create it somehow... http://sourceforge.net/tracker/index.php?func=detail&aid=1037145&group_id=78314&atid=552832 I've found pdfbox to be pretty good. The only time I get problems is with corrupted or egregiously bad PDF files. B. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what pdf lib to use?
Please post any PDFBox issues you notice on the PDFBox sourceforge bug list, if possible attach/email any problem PDFs that you encounter. There are some efforts underway to improve the speed of PDFBox, you can monitor the progress at http://sourceforge.net/tracker/index.php?func=detail&aid=1046300&group_id=78314&atid=552832 As for other suggestions, I know some people have utilized xpdf(open source but non Java) to extract the text. For other Java solutions PDFTextStream(commercial) - "Fastest PDF-to-Text Solution for Java" http://snowtide.com/home/PDFTextStream/ Etymon PJ (GPL) http://www.etymon.com/ Ben http://www.pdfbox.org On Fri, 22 Oct 2004 [EMAIL PROTECTED] wrote: > Hello all, > > I need a piece of advice/experience.. > > What pdf parser (written in java) u'd recommend? > > I played now with PDFBox-0.6.7a and would not say I was satisfied too much > with it > > On certain pdf's (not well formated but anyway readable with acrobate) it > run into dead loop (this I could fix in code), > and on one file it produced "out of memory error" and killed jvm:( (this > problem I could not identify yet) > > After all the performance was not too great as well: it took c. 19 h. to > index 13000 files (c. 3.5Gb) > > Regards, > J. > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Need advice: what pdf lib to use?
Hello all, I need a piece of advice/experience.. What pdf parser (written in java) u'd recommend? I played now with PDFBox-0.6.7a and would not say I was satisfied too much with it On certain pdf's (not well formated but anyway readable with acrobate) it run into dead loop (this I could fix in code), and on one file it produced "out of memory error" and killed jvm:( (this problem I could not identify yet) After all the performance was not too great as well: it took c. 19 h. to index 13000 files (c. 3.5Gb) Regards, J.