[Fwd: Re: Fix existing POM of poi 3.0-FINAL]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 FYI: Pom of 3.0-final has been fixed in central maven repository. - Original-Nachricht Betreff: Re: Fix existing POM of poi 3.0-FINAL Datum: Wed, 30 May 2007 13:22:25 -0700 Von: Carlos Sanchez [EMAIL PROTECTED] An: Joerg Hohwiller [EMAIL PROTECTED] Referenzen: [EMAIL PROTECTED] fixed now On 5/30/07, Joerg Hohwiller [EMAIL PROTECTED] wrote: Hi Carlos, sorry for mailing directly but it is sort of urgent. As commented in: http://jira.codehaus.org/browse/MEV-479 the POM of apache POI in version 3.0-FINAL has been released but is invalid: http://repo1.maven.org/maven2/org/apache/poi/poi/3.0-FINAL/poi-3.0-FINAL.pom Too bad that this can happen at all. I am very very sorry that this has happened and I know the complete philosophy of not changing stuff after it is released to ibiblio. Anyhow could you be so nice and make an excuse and remove the logo tag from the organization section. I also attached the fixed POM to the MEV-479 issue. When we leave it like this it will help nobody and when you fix it soon only very little of local repositories contain the invalid pom while tonsof further involed ones will receive the correct one. The problem is that maven does NOT process the POM as it is now, because it does NOT follow the spec. I somehow did not double check and the POI community is not too interested in maven. Now after the 2.5.1 release from 2004-08-04 the 3.0-FINAL is the next release and if we have to wait another 3 years before 3.1 comes out and fixes the pom it would really be a pitty ;) Thank you so much Jörg - -- I could give you my word as a Spaniard. No good. I've known too many Spaniards. -- The Princess Bride -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGXzRLmPuec2Dcv/8RArl1AJ9LlMtQLGq2bywnmDe2uXP07HJsiACfasMB CdC7TtFDmIqSIpW4fJB6EcI= =Nx60 -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: POI 3.0 RC4
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nick, Congratulations on the 3.0-FINAL release. After such a long time its really good to have it out and it seems to work well. Anyhow, I am very sorry to report that the POM still has a little problem. Even though it did not occure when I checked the RC4 I could now reproduce it. Seems I did not invest enough time to check since I am busy in multiple other projects in my little spare time. Anyways I feel a little ashamed since the problem is a general violation of the POM-spec that I should have noticed especially because it was reported in your cited bug issue: http://issues.apache.org/bugzilla/show_bug.cgi?id=39977 The logo tag is invalid in the organisation section. You need to remove it. I have seen that the POM is not included in the binary distribution and the only problem is here: http://repo1.maven.org/maven2/org/apache/poi/poi/3.0-FINAL/poi-3.0-FINAL.pom I have added a comment here: http://jira.codehaus.org/browse/MEV-479 Hope they gonna fix this soon. I will do my best to push this toppic... BTW did you notice this one: http://jira.codehaus.org/browse/MEV-478 Regards Jörg On Mon, 7 May 2007, Joerg Hohwiller wrote: Thanks for your work. I have tested the mavenized version of POI with maven 2.0.6. Your POM is gracefully accepted and everything works well. Great, that's good to know I personally like the idea of not further splitting the POI artifact into smaller pieces. Anyhow you should face the fact that this has been done for 2.5.1-final: http://repo1.maven.org/maven2/poi/ This will definetly cause confusion if maven users want to upgrade from poi 2.5.1-final to 3.0. Are there really that many people using contrib and scratchpad from 2.5.1 though? Almost everyone who uses contrib and scratchpad will have upgraded to newer version of poi (eg the alphas, which only had the one artificat), so that they can get the new scratchpad functionality. My feeling is that people will either be using 2.5.1 core (but not contrib+scratchpad), so nothing will change for them with 3.0, or they'll already be using the single artificat 3.0 alphas. Hopefully I've guessed right, and we won't cause problems for many people Nick - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGXbq4mPuec2Dcv/8RAg71AJ9+rHQVPouIiI6rVWbsstKyKuxvNwCdHoQT 9zTw1Qwm90F91JU6dMaBlBQ= =kXwq -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: POI 3.0 RC4
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi All Hi Nick, Release candidate 4 of POI 3.0 is now available: http://people.apache.org/~nick/POI-3.0-RC4/ For maven users, there's also the pom, binary and source jars available for testing with: http://people.apache.org/~nick/POI-3.0-RC4/maven/ There have only been a few bug fixes since RC3, but lots of work on the maven artifacts. Thanks for your work. I have tested the mavenized version of POI with maven 2.0.6. Your POM is gracefully accepted and everything works well. I personally like the idea of not further splitting the POI artifact into smaller pieces. Anyhow you should face the fact that this has been done for 2.5.1-final: http://repo1.maven.org/maven2/poi/ This will definetly cause confusion if maven users want to upgrade from poi 2.5.1-final to 3.0. They will have to remove the additional dependencies on poi-contrib and poi-scratchpad. For compatibility reasons I would recommend to keep the existing scheme. Even though this will cause the overhead of building/maintaining 3 POMs instead of just one. Anyways the choice is yours... As before, please let us know if things are still broken, and ensure any open bug reports have all the information we'll need. see above... Nick Best regards Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGPlbMmPuec2Dcv/8RAtJhAJsHIF3QwZHFrmz5PSCZ5gxCwEnnHQCggE+/ dUdHSbPZmKFTAt/onu8r/To= =9DKS -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: 3.0 release (and maven)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nick, On Wed, 25 Apr 2007, Joerg Hohwiller wrote: But from what I see in your subversion, you still need to replace issueTrackingUrl with issueManagement. Done I also went through your ant build and have seen that you do not modify the root tag metadata. The also needs to be changed to project, what might be the major reason for the bugzialla issue about the POM file. Done Since you have 3 artifacts (jar-files): poi, poi-contrib, poi-scratchpad you will need 3 individual pom.xml files. Closer inspection of the build process shows that's not true. For the main release, we do have 3 different jar files. However, we build a different jar file for pushing out to the maven repository. The maven jar file (and source jar file) contains all of main, contrib and scratchpad, rolled into one. So, we do only need the one pom file after all. As far as I can see this has been done additionally to the 3 individual artifacts as you can see here: http://repo1.maven.org/maven2/poi/ Who ever did this, I dont know... Guess that makes things simpler, shame I didn't remember earlier! okay, however this will be the way you want to go with 3.0 if I got you right... Then you would have a version. I would suggest to call the version 3.0. - From your suggesstion above the version would be 3.0-final. If maven needs to compare the version with another one to decide which one is newer, it would be better and more straight forward to use 3.0. The official version is 3.0-final though, not 3.0. It's hardly an unusual naming convention, so I'd be surprised if maven couldn't tell that 3.0-final is newer than 3.0-alpha3. If not, I'm sure they take patches :) It will surely work without patch. The artifacts would be located in the repository like this: org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.pom org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.jar org/apache/poi/poi-contrib/3.0/poi-contrib-3.0-sources.jar It looks like the apache thing is to put poms in one directory under m1-ibiblio-rsync-repository, and jars in another. I guess the ibiblio sync script does the right thing with that When I do rc4, I'll also do a maven pom, and jar + source jar. I'll sling them on people.apache.org as usual, so it'd be great if you could check they look fine. Jep, I will double check this. Let me know when its out... Cheers for all the maven help. No worries - thanks for taking care of the maven users. Not sure you've won too many converts to maven though... I am not an evangelist. I like it but feel free to do what ever you like ;) Nick Regards Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGNnILmPuec2Dcv/8RAqYGAJ9snlEiZgFU2YdZeR/tnthSntCeDwCfdwCK jQImQWYEyyLVnlt5WBaj9a4= =s3TW -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: 3.0 release (and maven)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nick, On Tue, 24 Apr 2007, Joerg Hohwiller wrote: issueManagement systembugzilla/system urlhttp://issues.apache.org/bugzilla//url /issueManagement Thanks, I've updated that okay. But from what I see in your subversion, you still need to replace issueTrackingUrl with issueManagement. I also went through your ant build and have seen that you do not modify the root tag metadata. The also needs to be changed to project, what might be the major reason for the bugzialla issue about the POM file. BTW: I assume that we are just talking about maven2 here and NOT about maven1. Since that seems to be the version most people use, I guess so. Since maven1 does NOT support transitive dependencies the POMs are not too relevant for the users. Maybe the maven guyz create maven1 artifacts automatically if maven2 artifacts are added - I do not know... As I said, I'm not a maven user myself, and I normally end up with a headache whenever I try and learn Since you have 3 artifacts (jar-files): poi, poi-contrib, poi-scratchpad you will need 3 individual pom.xml files. The poi-contrib and poi-scratchpad also need dependencies. Further you could think about creating an additional pom file to keep the parent metadata (licnese, scm, issueManagement, etc.) out of the other pom's and avoid redundancies. But since you named the main artifact poi and not e.g. poi-core this might not fit here. I don't think it'd be too hard to have the ant build file spit out three poms from the same template. That would also save us the faff of deciding what needs to get split into a parent pom, what doesn't etc. Yep, I think the best thing would be to have one template per POM. For contrib and scratchpad, I guess we want to call the output pom something like poi-scratchpad-3.0-final.pom. Is the artificatId then poi-scratchpad? Is the dependency groupId=org.apache.poi, artificatId=poi, and version=(same version) ? The groupId should be the same for all of your artifacts (JAR files and according POMs). As we discussed earlier it would make sense to use org.apache.poi. The artifactId is the major part of the name of the JAR file. So you would have the artifact IDs poi, poi-contrib, and poi-scratchpad. Then you would have a version. I would suggest to call the version 3.0. - From your suggesstion above the version would be 3.0-final. If maven needs to compare the version with another one to decide which one is newer, it would be better and more straight forward to use 3.0. So for groupIdorg.apache.poi/groupId artifactIdpoi-contrib/artifactId version3.0/version The artifacts would be located in the repository like this: org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.pom org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.jar org/apache/poi/poi-contrib/3.0/poi-contrib-3.0-sources.jar The general pattern is: groupId.replace('.', '/')/artifactId/version/artifactId-version* The *-sources.jar is not required, but if you supply it and someone uses maven to configure its IDE (e.g. eclipse, netbeans, intelliJ) then the sources are automatically attached and can be browsed in your IDE what is very handy. If you want I could create the three POM files for you and attach them to the bugzilla issue. If it's just the scratchpad and contrib poms that need updating (the main one now being fine), could you do a version for one of them? I can then work it into the ant build file. project modelVersion4.0.0/modelVersion groupIdorg.apache.poi/groupId artifactIdpoi-contrib/artifactId version@VERSION@/version packagingjar/packaging nameJakarta POI Contrib/name urlhttp://jakarta.apache.org/poi//url descriptionJakarta POI Contrib - TODO/description dependencies dependency groupIdorg.apache.poi/groupId artifactIdpoi/artifactId version@VERSION@/version /dependency /dependencies licenses license nameThe Apache Software License, Version 2.0/name urlhttp://www.apache.org/licenses/LICENSE-2.0.txt/url distributionrepo/distribution /license /licenses /project Feel free to add some of the extra tags (scm, ...) from the other template if you want to. The dependency says that I someone wants to use poi-contrib, then he will also needs poi. If you wanted to create a parent pom then you would need to set packaging to pom there (instead of jar) and use groupIdorg.apache.poi/groupId artifactIdpoi-parent/artifactId version1.0/version Since you would not need to change the parent pom with a new release, the version could be independend of the release version. In the other three poms you would inherit from that by adding parent groupIdorg.apache.poi/groupId artifactIdpoi-parent/artifactId version1.0/version /parent Then you can remove all the tags (license, scm, issueManagement) from the other 3 poms and only keep them together in the parent pom. (A more convenient naming scheme
Re: 3.0 release (and maven)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nick, So bug #39977 is not an issue anymore? The groupId bit isn't. Now we understand what it should be, we can use the java package, and all will be fine. Comment #2 (http://issues.apache.org/bugzilla/show_bug.cgi?id=39977#c2) still needs looking at, by someone who understands what's up with the pom. If we are talking about this: http://svn.apache.org/repos/asf/jakarta/poi/trunk/poi.pom I can tell you that the line issueTrackingUrlhttp://issues.apache.org/bugzilla//issueTrackingUrl should be issueManagement systembugzilla/system urlhttp://issues.apache.org/bugzilla//url /issueManagement instead. I think that was maven1 and does NOT work for maven2 and modelVersion 4.0.0. BTW: I assume that we are just talking about maven2 here and NOT about maven1. For a complete reference see: http://maven.apache.org/maven-model/maven.html Since you have 3 artifacts (jar-files): poi, poi-contrib, poi-scratchpad you will need 3 individual pom.xml files. The poi-contrib and poi-scratchpad also need dependencies. Further you could think about creating an additional pom file to keep the parent metadata (licnese, scm, issueManagement, etc.) out of the other pom's and avoid redundancies. But since you named the main artifact poi and not e.g. poi-core this might not fit here. If you want I could create the three POM files for you and attach them to the bugzilla issue. Then you can see how to adjust your templates and ant build to have such outcome. Whenever you think about switching from ant to maven for the build just let me know, too ;) Actually the groupId should be compliant to the package name. OK, I've fixed it in svn. The next build will use org.apache.poi as the groupId okay. Also, as well as changing the group id, should I put the files under /poi/, or under /org.apache.poi/ ? It looks like most apache projects just use their short name, but a few use org.apache.name . We currently use /poi/. The shorthand comes from the maven1 times where nobody cared about it. But its ugly and also causes that browsing the repository at top-level produces a really long list. There is quite a reasonable load on the server... Anyways it is the same for java-packages. If you do NOT use it properly it might clush with another project and causes big trouble. Anyways its up to you how to decide... Well, we've used it for ages without complaint from users or server operators, so I don't see the need to change it right now. I'll try to have a chat with some of the Mavern guys at apachecon EU in a couple of weeks, and see what they think we should do about the distribution directory name. Its always a good idea to ask several people before making a decision ;) If anyone does have any thoughts on the poi pom file (eg values we should add, or current ones to update), do drop an email to poi-dev saying what we should fix and why :) Well if you do NOT use maven for the build and you also do NOT want to have the overhead of an additional parent POM then you could think about kicking out the scm and issueManagement sections. The url points to the website and everything is there. Having the license in the pom as well is a good decision and should NOT be changed. Nick Regards Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGLnOhmPuec2Dcv/8RAgw8AJ9ttCYkWDRh9FyTo/xKRDaEk/tapACeICEb 9+NP9CwZAWs5LHZ1ZTMGBTs= =BnfS -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: 3.0 release (and maven)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Nick, Are you planning to put the 3.0 release into the maven repository as soon as it is out? As part of the release process, I'll copy them onto people.apache.org, as per http://www.apache.org/dev/release-publishing.html#repository-guide . Apparently, the files then appear on the ibiblio mirrors, and all is good. So bug #39977 is not an issue anymore? Is there something to discuss, that could be done now rather than after the release has been pulled out? E.g. if the goupId should change from poi to org.apache.poi what would make sense (lucene did do this too). Given that our java package is org.apache.poi, I don't think that should be an issue to change. However, I don't use maven, so I have no idea what sort of an effect that sort of change would have. Actually the groupId should be compliant to the package name. Cite from http://maven.apache.org/guides/getting-started/index.html groupId This element indicates the unique identifier of the organization or group that created the project. The groupId is one of the key identifiers of a project and is typically based on the fully qualified domain name of your organization. For example org.apache.maven.plugins is the designated groupId for all Maven plug-ins. Any maven users care to comment? Also, as David said, any help with the pom files appreciated :) I would like to help. Where is help needed? Issue #39977 or something else? I Will make a review on your POM templates. Also, as well as changing the group id, should I put the files under /poi/, or under /org.apache.poi/ ? It looks like most apache projects just use their short name, but a few use org.apache.name . We currently use /poi/. The shorthand comes from the maven1 times where nobody cared about it. But its ugly and also causes that browsing the repository at top-level produces a really long list. There is quite a reasonable load on the server... Anyways it is the same for java-packages. If you do NOT use it properly it might clush with another project and causes big trouble. Anyways its up to you how to decide... Nick Best Regards Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGJ+XGmPuec2Dcv/8RAr3XAJ0RyEvTFjAiMGUyJu+eZUUU1s3b+ACgh7ae bosfso/hml4Bv+6TeVvYggE= =EeIc -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
3.0 release (and maven)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi there, everybody is waiting for the 3.0 release and everything in head is better than an almost 3 years old release. Anyways take the time you need to have a release that you are happy with. I already migrated my project from 2.5.1-final-20040804 to 3.0-rc3 with good results. I did not see the point for binary incompatibility of HSSF (was it UnicodeString instead of String as return-type? cant remember) but that does not matter for me. The WordExtractor is also very useful and will allow me to kick out tm-extractors. Are you planning to put the 3.0 release into the maven repository as soon as it is out? I would be very pleased if this could happen quickly. If I can help anyhow on this task, just let me know. Is there something to discuss, that could be done now rather than after the release has been pulled out? E.g. if the goupId should change from poi to org.apache.poi what would make sense (lucene did do this too). Regards Jörg -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGJSAgmPuec2Dcv/8RAsU3AJ9dTh0Q1E7iiqnKrRriaq8MjHQe2wCgj+H7 cm5icKwYfePhr10EafiM6Ww= =bMdG -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: help with POI Co.
David Fisher schrieb: On Jan 9, 2007, at 2:28 PM, Joerg Hohwiller wrote: Besides I used the official POI release which is very old. I did NOT try the HEAD from svn. Jörg: Hi David, You want poi3_alpha3, or something like that. (I have Yegor do all my POI work for me :-) Use the latest. I'm sure Nick is talking about it and definitely *not* the ancient, decrepit, last official release. Okay, I guessed it. A new official release is in the works when the POI guys work out some details with the Jakarta overseers. I *think* the community will vote to release current stuff soon (in the next week / month?) That would have been my next question. But I have heared things like this from other projects (e.g. maven plugins) and I waited and waited and finally a year passed. Besides I was a little confused that the latest release is about 2,5 years old. That means something odd has happend to the community in that time. For my personal needs it suites well if I can use a nightly build version but for real usage I tend to use an official version. Good luck to all POI activists for the next release... Regards, Dave Fisher Regards Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: help with POI Co.
Nick Burch schrieb: On Tue, 9 Jan 2007, Joerg Hohwiller wrote: Besides I used the official POI release which is very old. I did NOT try the HEAD from svn. You should probably try with the svn head, you will generally have more luck with HWPF and HSLF from there. Okay, thanks for the tip. I did NOT even open most of the documents. The constructor caused an exception. Something like illegal fileformat or magic-number or something. I use hslf for a web spider that tries lots of random documents, and it's ok on almost all of them, so it's odd that you're having such problems (Normally you want to catch CorruptPowerPointFileException and EncryptedPowerPointFileException, and skip over them, and catch ArrayIndexOutOfBoundsException, and report bugs for those) If an ArrayIndexOutOfBoundException is thrown by a method where the user did not supply an index as parameter the implementation looks like a hack to me. Same applies to NullPointerExceptions. These two are caused by powerpoint files containing things that we didn't know they might, and which our test documents don't. If you report bugs for them, and include the problem document, we can try and figure out which of our assumptions on the file format are wrong, and work to fix them. I already debugged into it. It occured when an UnknownRecord was created. Generally not a good idea to assume anything about you dont even know. I such situations you should always check indices and length before accessing or copying arrays. Besides i have seen printStackTrace() calls which is genrally sick for a library. Please use nested exceptions for situations like this. I hope this is already fixed in the last 2,5 years since the relase... My problem is that I extract many parts of text twice from the file. It seems to me that they are really in there twice even though not visible to the powerpoint application user. Yup, that's to be expected on quicksaved files. QuickButCruddyTextExtractor will do something similar. okay. Your only option if you want to avoid that is to implement all the PersistPtr stuff, then parse SlideListWithTexts, and DoTheRightThing(tm) with it all. At which point, you've re-implemented most of hslf Sounds like some hints on that. I will have a look at it and also compare this option with using the latest trunk. Thanks! Nick Regards Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: help with POI Co.
Hi Nick, You shouldn't really have any problems with HSSF. There are lots of examples for hssf, did you follow them? I had a look at the examples and rewrote my code completely switching to event listener mode. This works a lot better and consumes less memory. I still have some ArrayIndexOutOfBoundsExceptions when UnkownRecord's are created. I promise to try the latest version from trunk and if I still get those bugs, I will open an issue and maybe supply a patch and an example document but the problem is that the errors are manly in documents that contain information not intendet to the public - I will see what I can do for you... BTW: do you agree about what David said about the roadmap for a new POI release? Nick Best Regards Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: help with POI Co.
Nick Burch schrieb: On Mon, 8 Jan 2007, Joerg Hohwiller wrote: For msword I tried HWPF but the result was really bad. Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't filter out all the text entries that aren't really text, but any patches to fix that would be appreciated :) That is what I tried. Well it throw exceptions for most of the documents. My problem is that I have a hughe repository with very old to very new documents. This technically means that you can find all sins of the office history in the documents I need to read... I read that textmining also supports older versions of word that are not supported by HWPF. Besides I used the official POI release which is very old. I did NOT try the HEAD from svn. For spidering, it's normally fine to use, since it doesn't normally matter if you get a few bonus words through for some of the special fields. I have modified the sources so that the constructor can also take a POIFilesystem and not only a File. There are still some bugs. I would fix them but would I be allowed to create a new release of this stuff and publish it with my project? Or is there a way how to submit a patch to textmining.org? textmining.org belongs to Ryan Ackley, who used to contribute to POI, until he went to work for a company that licenses the file format documentation from Microsoft. You'll need to contact him yourself with any patches. I will see what I can do... For powerpoint I tried HSLF what could not parse most of the documents. That's odd. I have almost no trouble using org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of powerpoint documents. What problems did you hit? I did NOT even open most of the documents. The constructor caused an exception. Something like illegal fileformat or magic-number or something. (Normally you want to catch CorruptPowerPointFileException and EncryptedPowerPointFileException, and skip over them, and catch ArrayIndexOutOfBoundsException, and report bugs for those) If an ArrayIndexOutOfBoundException is thrown by a method where the user did not supply an index as parameter the implementation looks like a hack to me. Same applies to NullPointerExceptions. I got all of these... The POIFilesystem and the stuff to extract the metadata seems to be very stable to me. But I did not make good experience with the rest of POI. Anyhow I now have written a PPT extractor from scratch that is only based on POIFilesystem but NOT on the HSLF stuff. The advantage is that I have support for low memory footprint: my class can be configured not to extend a specific buffer size for allocation so users do NOT get OutOfMemoryError if there was an evil file that was to big and especially even those evil files are parsed but only as much data is extracted as allowed by the configured buffer size. My problem is that I extract many parts of text twice from the file. It seems to me that they are really in there twice even though not visible to the powerpoint application user. If someone can help me with that I would be very pleased for any hit: http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-ppt/src/main/java/net/sf/mmm/search/parser/impl/ContentParserPpt.java For excel I tried HSSF what throws an exception for every document I read. You shouldn't really have any problems with HSSF. There are lots of examples for hssf, did you follow them? I suppose NOT. I will look at them. This is my code: http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-xls/src/main/java/net/sf/mmm/search/parser/impl/ContentParserXls.java After I checked my mistakes I will send you the stacktraces of remaining problems. Nick Thanks Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/