Hi, On Sat, Nov 29, 2008 at 12:02 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > My comments on RC1 are below. i don't feel comfortable voting for it in > it's current state...
Thanks for the review, much appreciated! I think it's fair to say that with the 0.2 release we're still pretty much in the transition for the Incubator to Lucene (and from a developer-only product to a general end user product). The main drive (at least from my side) for the 0.2 release was just to get whatever we had at the moment released as soon as possible for interested users (release early, release often), and then focus in 0.3 to get all the extra stuff like documentation and extra build artifacts in place. I should also note that Chris Mattman did call (see http://markmail.org/message/ux3uc72zlwarow5i) for the release to be made clearly either as an Incubator release or as a Lucene release once all the project migration is done. I guess I was the main proponent in pushing for the 0.2 release already while the Lucene migration was still incomplete. > 1) release naming: should probably be apache-tika-0.2-src.jar i seem to > recall someone somewhere saying that was important for apache releases > (and it's more consistent with the the 0.1 release) Good point, we probably should do that. Dave, can you take care of this? > 2) release file format: the 0.1 release seems to have been a tar.gz ... > was a concious choice made by the community to switch to distributing as a > src jar? otherwise you may want to publish both, or stick with tar.gz for > consistency (the docs on the website refer to the tarball when giving > examples of downloading and verifying) At least I was pretty vocal about switching to the jar format for our source releases, see most notably http://markmail.org/message/mwi4w2odztsxlcgi and http://markmail.org/message/jnthn2q4pghqxjlc. Unless the PMC prefers a tarball, at least I would rather fix the documentation than change the packaging format. > 3) incubator refs: as mentioned before, there are a lot of refrences to > the incubator that should be switched to point to lucene... > > [EMAIL PROTECTED]:~/tmp/tika-release/rc1/tika-0.2$ grep -lir incubator . > ./pom.xml > ./src/site/apt/download.apt > ./src/site/apt/index.apt > ./README.txt Fair point, and it goes with my statement above about getting the release out as soon as possible after graduation. In Tika trunk we've now updated all Incubator references, so any new release will have this issue fixed. Given the PMC pushback; perhaps we should just scrap the 0.2 release and go directly to 0.3 based on the current trunk? > 4) user docs: (I think grant may have already mentioned this) The > README.txt file talks about building Tika, but there doesn't seem to be > anything in the release that describes how to use Tika ... has any thought > been given to including more docs in the release it self? -- > gettingstarted.html perhaps? ... at the very least a paragraph should be > added to the README refering to the gettingstarted.html page. > > Personally, i think including documentation.html and formats.html in the > release are also important -- they're going to change between releases, > probably more then the "getting started" type info, and should be > "versioned" so moving forward people with older versions won't get > misslead by the docs on the site. The available documentation is already included in the source release in src/site and can be generated with "mvn site". The fact that the documentation isn't complete (e.g. the Getting Started guide didn't yet exist in 0.2 release candidate) shouldn't IMHO be a blocker for a release (especially for a 0.x one). In any case it's an area where we are clearly getting better during the 0.x release cycle. The README could mention "mvn site" as the command to generate the official documentation for that release and we could include a static snapshot of that in http://lucene.apache.org/tika/ for reference. This is something we should look at. > 5) artifacts missing: i tried following along with the gettingstarted.html > (my first time using maven BTW so i may have messed something up) and ran > into a snag... "mvn install" download a bunch of dependencies (i think > they were maven's own dependencies since i'd never used it before), ran > some test (these definitely had tika in the name) then downloaded some > more things, then told me it was installing tika-0.2.jar in my ~/.m2 > directory. When i looked at the next section "Build artifacts" it refered > to 3 jars in my target directory -- but i only have one... > > [EMAIL PROTECTED]:~/tmp/tika-release/rc1/tika-0.2$ find target -name \*jar > target/tika-0.2.jar > > ...is the gettingstarted.html wrong, or did the build not run correctly? The Getting Started guide is wrong in claiming that the standalone jar should be available in a 0.2 build. I've fixed this in revision 721589. Only the tika-0.2.jar is produced by the 0.2 build. Currently the guide contains some forward-looking statements about the potentially upcoming 0.3 release; mostly that the "standalone" and "jdk14" artifacts are included in 0.3 (they are available in current trunk and the related Jira issues are targeted for release in 0.3). In general I think it's not a good idea to publish documents with such forward-looking statements, but in this case I think there is a pretty good consensus about the contents of Tika 0.3 and when writing the documentation I rather opted to publishing forward-looking information than keeping it back and having to revise the document later on. > 6) RAT: Apache RAT noticed the following files missing license info... > > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tika.svg > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/site/resources/tikaNoText.svg > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML.html > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testHTML_utf8.html > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testRTF.rtf > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testTXT.txt > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXHTML.html > !????? > /home/hossman/tmp/tika-release/rc1/tika-0.2/src/test/resources/test-documents/testXML.xml > > ...I don't know if i've ever heard an opinion on needing to include the > ASL header in *.svg files (they are xml, but they are also clearly > generated by inkscape), but I do remember someone pointing out that test > data files in formats that are capable of containing comments in them (ie: > xml, html, etc...) should include the ASL header, such as... > > http://svn.apache.org/repos/asf/lucene/solr/trunk/example/exampledocs/hd.xml I think that having the license header in such test files disrupts the main purpose of the test cases (i.e. you want to check whether the extracted text contains some specific test phrase, not necessarily the Apache license header), so at least I prefer to not include the license header in those test files. See also http://markmail.org/message/m7jmgl3qncsffygb for related discussion on [EMAIL PROTECTED] However, if the PMC so wishes, I don't see any big problem in us adding the license headers in these test files. Note that in some future test files this might be troublesome, but for existing tests I don't see problems with this. > 7) javadocs: maybe this is something that is obvious to maven users, and > as a non-maven user i just don't know the magic incantation, but i > couldn't find any generated javadocs in the release (or in the "target" > directory after running "mv install") ... since Tika is primarily a > library people will use in java apps, this seems kind of important. If > there is a magic maven incantation to build these, let's included the > instructions somewhere (since the gettingstarted guide suggests that maven > is neccessary to build tika, but not to use it (per the Artifacts and Ant > sections) Good point. The README could point out "mvn site" as the way to produce a browseable version of all documentation associated with the release, and as an added service we could (should?) publish specific per-version documentation also on the Tika web site. On the other hand, I don't see documentation as being a valid blocker for any 0.x release. BR, Jukka Zitting