[Fwd: Re: Fix existing POM of poi 3.0-FINAL]

2007-05-31 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

FYI: Pom of 3.0-final has been fixed in central maven repository.

-  Original-Nachricht 
Betreff: Re: Fix existing POM of poi 3.0-FINAL
Datum: Wed, 30 May 2007 13:22:25 -0700
Von: Carlos Sanchez [EMAIL PROTECTED]
An: Joerg Hohwiller [EMAIL PROTECTED]
Referenzen: [EMAIL PROTECTED]

fixed now

On 5/30/07, Joerg Hohwiller [EMAIL PROTECTED] wrote:
 Hi Carlos,
 
 sorry for mailing directly but it is sort of urgent.
 
 As commented in:
 http://jira.codehaus.org/browse/MEV-479
 
 the POM of apache POI in version 3.0-FINAL has been released but is invalid:
 http://repo1.maven.org/maven2/org/apache/poi/poi/3.0-FINAL/poi-3.0-FINAL.pom
 
 Too bad that this can happen at all.
 
 I am very very sorry that this has happened and I know the complete philosophy
 of not changing stuff after it is released to ibiblio. Anyhow could you be so
 nice and make an excuse and remove the logo tag from the organization
 section. I also attached the fixed POM to the MEV-479 issue.
 
 When we leave it like this it will help nobody and when you fix it soon
 only very little of local repositories contain the invalid pom while tonsof
 further involed ones will receive the correct one. The problem is that maven
 does NOT process the POM as it is now, because it does NOT follow the spec.
 
 I somehow did not double check and the POI community is not too interested
 in maven. Now after the 2.5.1 release from 2004-08-04 the 3.0-FINAL is the 
 next
 release and if we have to wait another 3 years before 3.1 comes out and fixes
 the pom it would really be a pitty ;)
 
 Thank you so much
   Jörg

- --
I could give you my word as a Spaniard.
No good. I've known too many Spaniards.
-- The Princess Bride

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGXzRLmPuec2Dcv/8RArl1AJ9LlMtQLGq2bywnmDe2uXP07HJsiACfasMB
CdC7TtFDmIqSIpW4fJB6EcI=
=Nx60
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: POI 3.0 RC4

2007-05-30 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nick,

Congratulations on the 3.0-FINAL release.
After such a long time its really good to have it out and it seems to work well.

Anyhow, I am very sorry to report that the POM still has a little problem.
Even though it did not occure when I checked the RC4 I could now reproduce it.
Seems I did not invest enough time to check since I am busy in multiple other
projects in my little spare time. Anyways I feel a little ashamed since
the problem is a general violation of the POM-spec that I should have noticed
especially because it was reported in your cited bug issue:
http://issues.apache.org/bugzilla/show_bug.cgi?id=39977

The logo tag is invalid in the organisation section. You need to remove it.

I have seen that the POM is not included in the binary distribution and
the only problem is here:
http://repo1.maven.org/maven2/org/apache/poi/poi/3.0-FINAL/poi-3.0-FINAL.pom

I have added a comment here:
http://jira.codehaus.org/browse/MEV-479

Hope they gonna fix this soon. I will do my best to push this toppic...

BTW did you notice this one:
http://jira.codehaus.org/browse/MEV-478

Regards
  Jörg

 On Mon, 7 May 2007, Joerg Hohwiller wrote:
 Thanks for your work. I have tested the mavenized version of POI with
 maven 2.0.6. Your POM is gracefully accepted and everything works well.
 
 Great, that's good to know
 
 I personally like the idea of not further splitting the POI artifact into
 smaller pieces. Anyhow you should face the fact that this has been done
 for 2.5.1-final:
 http://repo1.maven.org/maven2/poi/

 This will definetly cause confusion if maven users want to upgrade
 from poi
 2.5.1-final to 3.0.
 
 Are there really that many people using contrib and scratchpad from
 2.5.1 though? Almost everyone who uses contrib and scratchpad will have
 upgraded to newer version of poi (eg the alphas, which only had the one
 artificat), so that they can get the new scratchpad functionality.
 
 My feeling is that people will either be using 2.5.1 core (but not
 contrib+scratchpad), so nothing will change for them with 3.0, or
 they'll already be using the single artificat 3.0 alphas.
 
 
 Hopefully I've guessed right, and we won't cause problems for many people
 
 Nick
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 Mailing List: http://jakarta.apache.org/site/mail2.html#poi
 The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
 
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGXbq4mPuec2Dcv/8RAg71AJ9+rHQVPouIiI6rVWbsstKyKuxvNwCdHoQT
9zTw1Qwm90F91JU6dMaBlBQ=
=kXwq
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: POI 3.0 RC4

2007-05-06 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

 Hi All
Hi Nick,
 
 Release candidate 4 of POI 3.0 is now available:
   http://people.apache.org/~nick/POI-3.0-RC4/
 For maven users, there's also the pom, binary and source jars available
 for testing with:
   http://people.apache.org/~nick/POI-3.0-RC4/maven/
 
 There have only been a few bug fixes since RC3, but lots of work on
 the maven artifacts.
Thanks for your work. I have tested the mavenized version of POI with maven
2.0.6. Your POM is gracefully accepted and everything works well.

I personally like the idea of not further splitting the POI artifact into
smaller pieces. Anyhow you should face the fact that this has been done
for 2.5.1-final:
http://repo1.maven.org/maven2/poi/

This will definetly cause confusion if maven users want to upgrade from poi
2.5.1-final to 3.0. They will have to remove the additional dependencies on
poi-contrib and poi-scratchpad. For compatibility reasons I would
recommend to keep the existing scheme. Even though this will cause the
overhead of building/maintaining 3 POMs instead of just one. Anyways the choice
is yours...
 
 As before, please let us know if things are still broken, and ensure any
 open bug reports have all the information we'll need.
see above...
 
 Nick
Best regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGPlbMmPuec2Dcv/8RAtJhAJsHIF3QwZHFrmz5PSCZ5gxCwEnnHQCggE+/
dUdHSbPZmKFTAt/onu8r/To=
=9DKS
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: 3.0 release (and maven)

2007-04-30 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nick,
 On Wed, 25 Apr 2007, Joerg Hohwiller wrote:
 But from what I see in your subversion, you still need to replace
 issueTrackingUrl with issueManagement.
 
 Done
 
 I also went through your ant build and have seen that you do not
 modify the root tag metadata. The also needs to be changed to
 project, what might be the major reason for the bugzialla issue
 about the POM file.
 
 Done
 
 Since you have 3 artifacts (jar-files): poi, poi-contrib,
 poi-scratchpad you will need 3 individual pom.xml files.
 
 Closer inspection of the build process shows that's not true. For the
 main release, we do have 3 different jar files. However, we build a
 different jar file for pushing out to the maven repository. The maven
 jar file (and source jar file) contains all of main, contrib and
 scratchpad, rolled into one. So, we do only need the one pom file after
 all.
As far as I can see this has been done additionally to the 3 individual
artifacts as you can see here:
http://repo1.maven.org/maven2/poi/

Who ever did this, I dont know...
 
 Guess that makes things simpler, shame I didn't remember earlier!
okay, however this will be the way you want to go with 3.0 if I got you right...
 
 Then you would have a version. I would suggest to call the version
 3.0. - From your suggesstion above the version would be 3.0-final.
 If maven needs to compare the version with another one to decide which
 one is newer, it would be better and more straight forward to use 3.0.
 
 The official version is 3.0-final though, not 3.0. It's hardly an
 unusual naming convention, so I'd be surprised if maven couldn't tell
 that 3.0-final is newer than 3.0-alpha3. If not, I'm sure they take
 patches :)
It will surely work without patch.
 
 The artifacts would be located in the repository like this:
 org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.pom
 org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.jar
 org/apache/poi/poi-contrib/3.0/poi-contrib-3.0-sources.jar
 
 It looks like the apache thing is to put poms in one directory under
 m1-ibiblio-rsync-repository, and jars in another. I guess the ibiblio
 sync script does the right thing with that
 
 
 When I do rc4, I'll also do a maven pom, and jar + source jar. I'll
 sling them on people.apache.org as usual, so it'd be great if you could
 check they look fine.
Jep, I will double check this. Let me know when its out...
 
 Cheers for all the maven help. 
No worries - thanks for taking care of the maven users.
 Not sure you've won too many converts to maven though...
I am not an evangelist. I like it but feel free to do what ever you like ;)
 
 Nick
Regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGNnILmPuec2Dcv/8RAqYGAJ9snlEiZgFU2YdZeR/tnthSntCeDwCfdwCK
jQImQWYEyyLVnlt5WBaj9a4=
=s3TW
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: 3.0 release (and maven)

2007-04-25 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nick,
 On Tue, 24 Apr 2007, Joerg Hohwiller wrote:
  issueManagement
systembugzilla/system
urlhttp://issues.apache.org/bugzilla//url
  /issueManagement
 
 Thanks, I've updated that
okay.
But from what I see in your subversion, you still need to replace
issueTrackingUrl with issueManagement.
I also went through your ant build and have seen that you do not modify the root
tag metadata. The also needs to be changed to project, what might be the
major reason for the bugzialla issue about the POM file.
 
 BTW: I assume that we are just talking about maven2 here and NOT about
 maven1.
 
 Since that seems to be the version most people use, I guess so. 

Since maven1 does NOT support transitive dependencies the POMs are
not too relevant for the users. Maybe the maven guyz create maven1
artifacts automatically if maven2 artifacts are added - I do not know...
 As I said, I'm not a maven user myself, and I normally end up with a headache
 whenever I try and learn

 
 Since you have 3 artifacts (jar-files): poi, poi-contrib,
 poi-scratchpad you will need 3 individual pom.xml files. The
 poi-contrib and poi-scratchpad also need dependencies. Further you
 could think about creating an additional pom file to keep the parent
 metadata (licnese, scm, issueManagement, etc.) out of the other pom's
 and avoid redundancies. But since you named the main artifact poi
 and not e.g. poi-core this might not fit here.
 
 I don't think it'd be too hard to have the ant build file spit out three
 poms from the same template. That would also save us the faff of
 deciding what needs to get split into a parent pom, what doesn't etc.
Yep, I think the best thing would be to have one template per POM.
 
 For contrib and scratchpad, I guess we want to call the output pom
 something like poi-scratchpad-3.0-final.pom. Is the artificatId then
 poi-scratchpad? Is the dependency groupId=org.apache.poi,
 artificatId=poi, and version=(same version) ?
The groupId should be the same for all of your artifacts (JAR files and
according POMs). As we discussed earlier it would make sense to use
org.apache.poi.

The artifactId is the major part of the name of the JAR file.
So you would have the artifact IDs
poi, poi-contrib, and poi-scratchpad.

Then you would have a version. I would suggest to call the version 3.0.
- From your suggesstion above the version would be 3.0-final.
If maven needs to compare the version with another one to decide which one is
newer, it would be better and more straight forward to use 3.0.

So for
groupIdorg.apache.poi/groupId
artifactIdpoi-contrib/artifactId
version3.0/version

The artifacts would be located in the repository like this:
org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.pom
org/apache/poi/poi-contrib/3.0/poi-contrib-3.0.jar
org/apache/poi/poi-contrib/3.0/poi-contrib-3.0-sources.jar

The general pattern is:
groupId.replace('.', '/')/artifactId/version/artifactId-version*

The *-sources.jar is not required, but if you supply it and someone
uses maven to configure its IDE (e.g. eclipse, netbeans, intelliJ)
then the sources are automatically attached and can be browsed in
your IDE what is very handy.

 
 If you want I could create the three POM files for you and attach them
 to the bugzilla issue.
 
 If it's just the scratchpad and contrib poms that need updating (the
 main one now being fine), could you do a version for one of them? I can
 then work it into the ant build file.

project
  modelVersion4.0.0/modelVersion
  groupIdorg.apache.poi/groupId
  artifactIdpoi-contrib/artifactId
  version@VERSION@/version
  packagingjar/packaging
  nameJakarta POI Contrib/name
  urlhttp://jakarta.apache.org/poi//url
  descriptionJakarta POI Contrib - TODO/description
  dependencies
dependency
  groupIdorg.apache.poi/groupId
  artifactIdpoi/artifactId
  version@VERSION@/version
/dependency
  /dependencies
  licenses
license
  nameThe Apache Software License, Version 2.0/name
  urlhttp://www.apache.org/licenses/LICENSE-2.0.txt/url
  distributionrepo/distribution
/license
  /licenses
/project

Feel free to add some of the extra tags (scm, ...) from the other template if
you want to. The dependency says that I someone wants to use poi-contrib,
then he will also needs poi.

If you wanted to create a parent pom then you would need to set packaging to
pom there (instead of jar) and use
  groupIdorg.apache.poi/groupId
  artifactIdpoi-parent/artifactId
  version1.0/version
Since you would not need to change the parent pom with a new release, the
version could be independend of the release version.

In the other three poms you would inherit from that by adding
  parent
groupIdorg.apache.poi/groupId
artifactIdpoi-parent/artifactId
version1.0/version
  /parent

Then you can remove all the tags (license, scm, issueManagement) from the other
3 poms and only keep them together in the parent pom.

(A more convenient naming scheme

Re: 3.0 release (and maven)

2007-04-24 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nick,
 So bug #39977 is not an issue anymore?
 
 The groupId bit isn't. Now we understand what it should be, we can use
 the java package, and all will be fine.
 
 Comment #2 (http://issues.apache.org/bugzilla/show_bug.cgi?id=39977#c2)
 still needs looking at, by someone who understands what's up with the pom.
If we are talking about this:
http://svn.apache.org/repos/asf/jakarta/poi/trunk/poi.pom

I can tell you that the line

issueTrackingUrlhttp://issues.apache.org/bugzilla//issueTrackingUrl

should be

  issueManagement
systembugzilla/system
urlhttp://issues.apache.org/bugzilla//url
  /issueManagement

instead.
I think that was maven1 and does NOT work for maven2 and modelVersion 4.0.0.
BTW: I assume that we are just talking about maven2 here and NOT about maven1.

For a complete reference see:
http://maven.apache.org/maven-model/maven.html

Since you have 3 artifacts (jar-files): poi, poi-contrib, poi-scratchpad
you will need 3 individual pom.xml files. The poi-contrib and poi-scratchpad
also need dependencies. Further you could think about creating an
additional pom file to keep the parent metadata (licnese, scm, issueManagement,
etc.) out of the other pom's and avoid redundancies. But since you named the
main artifact poi and not e.g. poi-core this might not fit here.
If you want I could create the three POM files for you and attach them to the
bugzilla issue. Then you can see how to adjust your templates and ant build
to have such outcome. Whenever you think about switching from ant to maven
for the build just let me know, too ;)
 
 Actually the groupId should be compliant to the package name.
 
 OK, I've fixed it in svn. The next build will use org.apache.poi as the
 groupId
okay.
 
 Also, as well as changing the group id, should I put the files under
 /poi/, or under /org.apache.poi/ ? It looks like most apache projects
 just use their short name, but a few use org.apache.name . We
 currently use /poi/.
 The shorthand comes from the maven1 times where nobody cared about it.
 But its ugly and also causes that browsing the repository at top-level
 produces a really long list. There is quite a reasonable load on the
 server...
 Anyways it is the same for java-packages. If you do NOT use it properly
 it might clush with another project and causes big trouble.
 Anyways its up to you how to decide...
 
 Well, we've used it for ages without complaint from users or server
 operators, so I don't see the need to change it right now. I'll try to
 have a chat with some of the Mavern guys at apachecon EU in a couple of
 weeks, and see what they think we should do about the distribution
 directory name.
Its always a good idea to ask several people before making a decision ;)
 
 
 If anyone does have any thoughts on the poi pom file (eg values we
 should add, or current ones to update), do drop an email to poi-dev
 saying what we should fix and why :)
Well if you do NOT use maven for the build and you also do NOT want to
have the overhead of an additional parent POM then you could think about
kicking out the scm and issueManagement sections. The url points to the website
and everything is there. Having the license in the pom as well is a good
decision and should NOT be changed.
 
 Nick
Regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGLnOhmPuec2Dcv/8RAgw8AJ9ttCYkWDRh9FyTo/xKRDaEk/tapACeICEb
9+NP9CwZAWs5LHZ1ZTMGBTs=
=BnfS
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: 3.0 release (and maven)

2007-04-19 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Nick,
 Are you planning to put the 3.0 release into the maven repository as
 soon as it is out?
 
 As part of the release process, I'll copy them onto people.apache.org,
 as per
 http://www.apache.org/dev/release-publishing.html#repository-guide .
 Apparently, the files then appear on the ibiblio mirrors, and all is good.

So bug #39977 is not an issue anymore?

 
 Is there something to discuss, that could be done now rather than
 after the release has been pulled out? E.g. if the goupId should
 change from poi to org.apache.poi what would make sense (lucene
 did do this too).
 
 Given that our java package is org.apache.poi, I don't think that should
 be an issue to change. However, I don't use maven, so I have no idea
 what sort of an effect that sort of change would have.
Actually the groupId should be compliant to the package name.

Cite from
http://maven.apache.org/guides/getting-started/index.html

groupId This element indicates the unique identifier of the organization or
group that created the project. The groupId is one of the key identifiers of a
project and is typically based on the fully qualified domain name of your
organization. For example org.apache.maven.plugins is the designated groupId for
all Maven plug-ins.


 
 Any maven users care to comment? Also, as David said, any help with the
 pom files appreciated :)
I would like to help. Where is help needed? Issue #39977 or something else?
I Will make a review on your POM templates.
 
 
 Also, as well as changing the group id, should I put the files under
 /poi/, or under /org.apache.poi/ ? It looks like most apache projects
 just use their short name, but a few use org.apache.name . We
 currently use /poi/.
The shorthand comes from the maven1 times where nobody cared about it.
But its ugly and also causes that browsing the repository at top-level
produces a really long list. There is quite a reasonable load on the server...
Anyways it is the same for java-packages. If you do NOT use it properly
it might clush with another project and causes big trouble.
Anyways its up to you how to decide...
 
 Nick
Best Regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGJ+XGmPuec2Dcv/8RAr3XAJ0RyEvTFjAiMGUyJu+eZUUU1s3b+ACgh7ae
bosfso/hml4Bv+6TeVvYggE=
=EeIc
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



3.0 release (and maven)

2007-04-17 Thread Joerg Hohwiller
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there,

everybody is waiting for the 3.0 release and everything in head
is better than an almost 3 years old release. Anyways take the time you
need to have a release that you are happy with.

I already migrated my project from 2.5.1-final-20040804 to 3.0-rc3 with good
results. I did not see the point for binary incompatibility of HSSF (was it
UnicodeString instead of String as return-type? cant remember) but that does not
matter for me.
The WordExtractor is also very useful and will allow me to kick out 
tm-extractors.

Are you planning to put the 3.0 release into the maven repository as soon as it
is out? I would be very pleased if this could happen quickly. If I can help
anyhow on this task, just let me know.

Is there something to discuss, that could be done now rather than after the
release has been pulled out? E.g. if the goupId should change from poi to
org.apache.poi what would make sense (lucene did do this too).

Regards
  Jörg
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGJSAgmPuec2Dcv/8RAsU3AJ9dTh0Q1E7iiqnKrRriaq8MjHQe2wCgj+H7
cm5icKwYfePhr10EafiM6Ww=
=bMdG
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: help with POI Co.

2007-01-10 Thread Joerg Hohwiller
David Fisher schrieb:
 On Jan 9, 2007, at 2:28 PM, Joerg Hohwiller wrote:
 Besides I used the official POI release which is very old. I did NOT
 try the
 HEAD from svn.
 
 Jörg:
Hi David,
 
 You want poi3_alpha3, or something like that. (I have Yegor do all my
 POI work for me :-)
 
 Use the latest. I'm sure Nick is talking about it and definitely *not*
 the ancient, decrepit, last official release.
Okay, I guessed it.
 
 A new official release is in the works when the POI guys work out some
 details with the Jakarta overseers. I *think* the community will vote to
 release current stuff soon (in the next week / month?)
That would have been my next question. But I have heared things like this from
other projects (e.g. maven plugins) and I waited and waited and finally a year
passed. Besides I was a little confused that the latest release is about 2,5
years old. That means something odd has happend to the community in that time.

For my personal needs it suites well if I can use a nightly build version but
for real usage I tend to use an official version.
Good luck to all POI activists for the next release...
 
 Regards,
 Dave Fisher
Regards
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: help with POI Co.

2007-01-10 Thread Joerg Hohwiller
Nick Burch schrieb:
 On Tue, 9 Jan 2007, Joerg Hohwiller wrote:
 Besides I used the official POI release which is very old. I did NOT
 try the
 HEAD from svn.
 
 You should probably try with the svn head, you will generally have more
 luck with HWPF and HSLF from there.
Okay, thanks for the tip.
 
 I did NOT even open most of the documents. The constructor caused an
 exception. Something like illegal fileformat or magic-number or
 something.
 
 I use hslf for a web spider that tries lots of random documents, and
 it's ok on almost all of them, so it's odd that you're having such problems
 
 (Normally you want to catch CorruptPowerPointFileException and
 EncryptedPowerPointFileException, and skip over them, and catch
 ArrayIndexOutOfBoundsException, and report bugs for those)

 If an ArrayIndexOutOfBoundException is thrown by a method where the
 user did not supply an index as parameter the implementation looks
 like a hack to me. Same applies to NullPointerExceptions.
 
 These two are caused by powerpoint files containing things that we
 didn't know they might, and which our test documents don't. If you
 report bugs for them, and include the problem document, we can try and
 figure out which of our assumptions on the file format are wrong, and
 work to fix them.
I already debugged into it. It occured when an UnknownRecord was created.
Generally not a good idea to assume anything about you dont even know.
I such situations you should always check indices and length before
accessing or copying arrays.
Besides i have seen printStackTrace() calls which is genrally sick for a
library. Please use nested exceptions for situations like this.
I hope this is already fixed in the last 2,5 years since the relase...
 
 My problem is that I extract many parts of text twice from the file.
 It seems to me that they are really in there twice even though not
 visible to the powerpoint application user.
 
 Yup, that's to be expected on quicksaved files.
 QuickButCruddyTextExtractor will do something similar.
okay.
 
 Your only option if you want to avoid that is to implement all the
 PersistPtr stuff, then parse SlideListWithTexts, and DoTheRightThing(tm)
 with it all. At which point, you've re-implemented most of hslf
Sounds like some hints on that. I will have a look at it and also compare this
option with using the latest trunk. Thanks!
 
 Nick
Regards
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: help with POI Co.

2007-01-10 Thread Joerg Hohwiller
Hi Nick,

 You shouldn't really have any problems with HSSF. There are lots of
 examples for hssf, did you follow them?
I had a look at the examples and rewrote my code completely switching to event
listener mode. This works a lot better and consumes less memory.
I still have some ArrayIndexOutOfBoundsExceptions when UnkownRecord's are 
created.
I promise to try the latest version from trunk and if I still get those bugs,
I will open an issue and maybe supply a patch and an example document but the
problem is that the errors are manly in documents that contain information not
intendet to the public - I will see what I can do for you...

BTW: do you agree about what David said about the roadmap for a new POI release?
 
 Nick
Best Regards
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/



Re: help with POI Co.

2007-01-09 Thread Joerg Hohwiller
Nick Burch schrieb:
 On Mon, 8 Jan 2007, Joerg Hohwiller wrote:
 For msword I tried HWPF but the result was really bad.
 
 Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't
 filter out all the text entries that aren't really text, but any
 patches to fix that would be appreciated :)
That is what I tried.
Well it throw exceptions for most of the documents.
My problem is that I have a hughe repository with very old to very new
documents. This technically means that you can find all sins of the office
history in the documents I need to read...
I read that textmining also supports older versions of word that are not
supported by HWPF.
Besides I used the official POI release which is very old. I did NOT try the
HEAD from svn.
 
 For spidering, it's normally fine to use, since it doesn't normally
 matter if you get a few bonus words through for some of the special
 fields.
 
 I have modified the sources so that the constructor can also take a
 POIFilesystem and not only a File. There are still some bugs. I would
 fix them but would I be allowed to create a new release of this stuff
 and publish it with my project? Or is there a way how to submit a
 patch to textmining.org?
 
 textmining.org belongs to Ryan Ackley, who used to contribute to POI,
 until he went to work for a company that licenses the file format
 documentation from Microsoft. You'll need to contact him yourself with
 any patches.
I will see what I can do...
 
 For powerpoint I tried HSLF what could not parse most of the documents.
 
 That's odd. I have almost no trouble using
 org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of
 powerpoint documents. What problems did you hit?
I did NOT even open most of the documents. The constructor caused an exception.
Something like illegal fileformat or magic-number or something.
 
 (Normally you want to catch CorruptPowerPointFileException and
 EncryptedPowerPointFileException, and skip over them, and catch
 ArrayIndexOutOfBoundsException, and report bugs for those)
If an ArrayIndexOutOfBoundException is thrown by a method where the user
did not supply an index as parameter the implementation looks like a hack to me.
Same applies to NullPointerExceptions.
I got all of these...
The POIFilesystem and the stuff to extract the metadata seems to be very stable
to me. But I did not make good experience with the rest of POI.
Anyhow I now have written a PPT extractor from scratch that is only based on
POIFilesystem but NOT on the HSLF stuff. The advantage is that I have support
for low memory footprint: my class can be configured not to extend a specific
buffer size for allocation so users do NOT get OutOfMemoryError if there was an
evil file that was to big and especially even those evil files are parsed but
only as much data is extracted as allowed by the configured buffer size.

My problem is that I extract many parts of text twice from the file.
It seems to me that they are really in there twice even though not visible
to the powerpoint application user.

If someone can help me with that I would be very pleased for any hit:

http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-ppt/src/main/java/net/sf/mmm/search/parser/impl/ContentParserPpt.java

 
 For excel I tried HSSF what throws an exception for every document I
 read.
 
 You shouldn't really have any problems with HSSF. There are lots of
 examples for hssf, did you follow them?
I suppose NOT. I will look at them.

This is my code:
http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-xls/src/main/java/net/sf/mmm/search/parser/impl/ContentParserXls.java

After I checked my mistakes I will send you the stacktraces of remaining 
problems.
 
 Nick
Thanks
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/