Hi Hannes, Here are the noteable changes between 1.0 and 0.9, generated by:
curl "http://www.apache.org/dist/nutch/CHANGES-0.9.txt" > 0.9 curl "http://www.apache.org/dist/nutch/CHANGES-1.0.txt" > 1.0 diff -u 0.9 1.0 +Release 1.0 - 2009-03-23 + + 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab) + + 2. NUTCH-443 - Allow parsers to return multiple Parse objects. + (Dogacan Guney et al, via ab) + + 3. NUTCH-393 - Indexer should handle null documents returned by filters. + (Eelco Lempsink via ab) + + 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren) + + 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other + bots in robots.txt (Dogacan Guney via siren) + + 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren) + + 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin + (siren) + + 8. NUTCH-161 - Change Plain text parser to + use parser.character.encoding.default property for fall back encoding + (KuroSaka TeruHiko, siren) + + 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of + unmodified content. (ab) + +10. NUTCH-392 - OutputFormat implementations should pass on Progressable. + (cutting via ab) + +11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan) + +12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed + up the rss parser (dogacan via mattmann). This update is a fix and semantics + change from the original patch for NUTCH-443. The original patch did not tell + the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch + datums. This patch addresses that issue. Now, if Fetcher gets a null content, + instead of pushing an empty content, it filters the null content. + +13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of + Parse object. (Gal Nitzan via dogacan) + +14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains + some query parameters. (Emmanuel Joke via dogacan) + +15. NUTCH-502 - Bug in SegmentReader causes infinite loop. + (Ilya Vishnevsky via dogacan) + +16. NUTCH-444 Possibly use a different library to parse RSS feed for improved + performance and compatibility. This patch introduced a new plugin, feed, + that includes an index filter and a parse plugin for feeds that uses ROME. + There was discussion to remove parse-rss, in light of the feed plugin, + however, this patch does not explicitly remove parse-rss. (dogacan, mattmann) + +17. NUTCH-471 - Fix synchronization in NutchBean creation. + (Enis Soztutar via dogacan) + +18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab) + +19. NUTCH-468 - Scoring filter should distribute score to all outlinks at + once. (dogacan) + +20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan) + +21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in + DomContentUtils...Spider Trap. (kubes) + +22. NUTCH-434 - Replace usage of ObjectWritable with something based on + GenericWritable. (dogacan) + +23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan) + +24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation. + (Espen Amble Kolstad via dogacan) + +25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml. + (Emmanuel Joke via dogacan) + +26. NUTCH-503 - Generator exits incorrectly for small fetchlists. + (Vishal Shah via dogacan) + +27. NUTCH-505 - Outlink urls should be validated. (dogacan) + +28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan) + +29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan) + +30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan) + +30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan) + +31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan). + +32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining. + (Enis Soztutar via dogacan) + +33. NUTCH-516 - Next fetch time is not set when it is a + CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan) + +34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException + when trying to rerun dedup on a segment. (Vishal Shah via dogacan) + +35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS. + (dogacan) Note: There is a bigger problem, i.e how to deal + with redirected pages, and this issue can be considered as a band-aid + for the time being. See NUTCH-273 and NUTCH-353 for more details. + +36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and + inlinks list. (Emmanuel Joke via dogacan) + +37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during + parse. (dogacan) + +38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan) + +39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan) + +40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds + domain-related utilities. (Enis Soztutar via dogacan) + +41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable + release (2.1). (Dawid Weiss via dogacan) + +42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every + request. (Dawid Weiss via dogacan) + +43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time. + (Emmanuel Joke via dogacan) + +44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan) + +45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan) + +46. NUTCH-554 - Generator throws IOException on invalid urls. + (Brian Whitman via ab) + +47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child. + (Emmanuel Joke via dogacan) + +48. NUTCH-25 - needs 'character encoding' detector. + (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan) + +49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated + to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan) + +50. NUTCH-562 - Port mime type framework to use Tika mime detection framework. + (mattmann) + +51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink + list. (Emmanuel Joke, Marcin Okraszewski via kubes) + +52. NUTCH-501 - Implement a different caching mechanism for objects cached in + configuration. (dogacan) + +53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes) + +54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes) + +55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm. + (dogacan, kubes via dogacan) + +56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat. + (Emmanuel Joke via dogacan) + +57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan) + +58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan) + +59. NUTCH-574 - Including inlink anchor text in index can create irrelevant + search results. Created index-anchor plugin, removed functionality from + index-basic plugin. For backwards compatibility, add index-anchor plugin to + nutch-site.xml plugin.includes. (kubes) + +60. NUTCH-581 - DistributedSearch does not update search servers added to + search-servers.txt on the fly. (Rohan Mehta via kubes) + +61. NUTCH-586 - Add option to run compiled classes without job file + (enis via ab) + +62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy + server. (Susam Pal via dogacan) + +63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab) + +64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format + (Emmanuel Joke via ab) + +65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab) + +66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab) + +67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren) + +68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes) + +69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab) + +70. NUTCH-602 - Allow configurable number of handlers for search servers + (hartbecke via kubes) + +71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes) + +72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann) + +73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes) + +74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes) + +75. NUTCH-603 - Add more default url normalizations (kubes) + +76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes) + +77. NUTCH-44 - Too many search results, limits max results returned from a + single search. (Emilijan Mirceski and Susam Pal via kubes) + +78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is + updated to 1.2 version. (dogacan) + +79. NUTCH-613 - Empty summaries and cached pages (kubes via ab) + +80. NUTCH-612 - URL filtering was disabled in Generator when invoked + from Crawl (Susam Pal via ab) + +81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab) + +82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab) + +83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab) + +84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval. + Guard against reprUrl being null. (Emmanuel Joke, ab) + +85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel + Joke, ab) + +86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab) + +87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab) + +88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API. + (Emmanuel Joke, dogacan, ab) + +89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a + single slash. (Mark DeSpain via ab) + +90. NUTCH-500 - Add hadoop masters configuration file into conf folder. + (Emmanuel Joke via kubes) + +91. NUTCH-596 - ParseSegments parse content even if its not + CrawlDatum.STATUS_FETCH_SUCCESS (dogacan) + +92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes) + +93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln + Ritter, ab) + +94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab) + +95. NUTCH-645 - Parse-swf unit test failing (ab) + +96. NUTCH-642 - Unit tests fail when run in non-local mode (ab) + +97. NUTCH-639 - Change LuceneDocumentWrapper visibility from + private to _public_ (Guillaume Smet via dogacan) + +98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn + tracking. (dogacan) + +99. NUTCH-375 - Add support for Content-Encoding: deflated + (Pascal Beis, ab) + +100. NUTCH-633 - ParseSegment no longer allow reparsing. + (dogacan) + +101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan) + +102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann) + +103. NUTCH-654 - urlfilter-regex's main does not work. + (dogacan) + +104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE". + (dogacan) + +105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes) + +106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes) + +107. NUTCH-647 - Resolve URLs tool (kubes) + +108. NUTCH-665 - Search Load Testing Tool (kubes) + +109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming + (kubes) + +110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes) + +111. NUTCH-646 - New Indexing Framework for Nutch. (kubes) + +112. NUTCH-668 - Domain URL Filter. (kubes) + +113. NUTCH-594 - Serve Nutch search results in multiple formats including + XML and JSON. (kubes) + +114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren) + +115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate + fetch interval correctly. (dogacan) + +116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic) + +117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t. + (julien nioche via dogacan) + +118. NUTCH-681 - parse-mp3 compilation problem. + (Wildan Maulana via dogacan) + +119. NUTCH-676 - MapWritable is written inefficiently and confusingly. + (dogacan) + +120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical + digest. (dogacan) + +121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3. + (Joseph Chen, dogacan) + +122. NUTCH-682 - SOLR indexer does not set boost on the document. + (julien nioche via dogacan) + +123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab) + +124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab) + +125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab) + +126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE + (Curtis d'Entremont, ab) + +127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan) + +128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException + (Stefan Will, siren) + +129. NUTCH-691 - Update jakarta poi jars to the most relevant version + (Dmitry Lihachev via siren) + +130. NUTCH-563 - Include custom fields in BasicQueryFilter + (Julien Nioche via siren) + +131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin + (Dmitry Lihachev via siren) + +132. NUTCH-694 - Distributed Search Server fails (siren) + +133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links + set at cross domain redirects (Remco Verhoef, dogacan via siren) + +134. NUTCH-247 - Robot parser to restrict (kubes, siren) + +135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan + via siren) + +136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan, + Dmitry Lihachev via siren) + +137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab) + +138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann, + Doug Cook via ab) + +139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren) + +140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren) + +141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab) + +142. NUTCH-684 - Dedup support for Solr. (dogacan) + +143. NUTCH-715 - Subcollection plugin doesn't work with default + subcollections.xml file (Dmitry Lihachev via siren) + +144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute + I would say that some of the biggest differences include improved scoring, a bunch of associated tools to look inside of the Nutch data, and improved speed on fetching. In addition, the first appearance of the integration of Nutch and Solr appeared in 1.0, so if you go backwards you won't be able to integrate with Solr (might not be that big of an issue). HTH, Chris On 7/16/10 9:10 AM, "Hannes Carl Meyer" <[email protected]> wrote: > Hi, > > I'm currently using Nutch 1.0 to perform intranet crawl and index html and > pdf contents. > Unfortunately we are using Java 1.5 in our production env, that means I have > to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6. > > Are there big differences between those versions which maybe impact my plans > moving backwards to 0.9? > (as I said it is performing intranet crawl only, NOT using lucene search > interface and NOT using distributed mode) > > Thanks for your feedback > > Hannes > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

