Hi Chris, thanks for your summary, I don't like moving to older versions but as it seems it is feasible. Since I'm not using Solr and just wrote a Plugin to export crawled pages into XML files it should be fine!
Regards Hannes On Fri, Jul 16, 2010 at 8:33 PM, Mattmann, Chris A (388J) < [email protected]> wrote: > Hi Hannes, > > Here are the noteable changes between 1.0 and 0.9, generated by: > > curl "http://www.apache.org/dist/nutch/CHANGES-0.9.txt" > 0.9 > curl "http://www.apache.org/dist/nutch/CHANGES-1.0.txt" > 1.0 > diff -u 0.9 1.0 > > +Release 1.0 - 2009-03-23 > + > + 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via > ab) > + > + 2. NUTCH-443 - Allow parsers to return multiple Parse objects. > + (Dogacan Guney et al, via ab) > + > + 3. NUTCH-393 - Indexer should handle null documents returned by filters. > + (Eelco Lempsink via ab) > + > + 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren) > + > + 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other > + bots in robots.txt (Dogacan Guney via siren) > + > + 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren) > + > + 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin > + (siren) > + > + 8. NUTCH-161 - Change Plain text parser to > + use parser.character.encoding.default property for fall back encoding > + (KuroSaka TeruHiko, siren) > + > + 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of > + unmodified content. (ab) > + > +10. NUTCH-392 - OutputFormat implementations should pass on Progressable. > + (cutting via ab) > + > +11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan) > + > +12. NUTCH-443 - allow parsers to return multiple Parse object, this will > speed > + up the rss parser (dogacan via mattmann). This update is a fix and > semantics > + change from the original patch for NUTCH-443. The original patch did > not tell > + the Indexer to read crawl_parse too so that it can pickup sub-urls' > fetch > + datums. This patch addresses that issue. Now, if Fetcher gets a null > content, > + instead of pushing an empty content, it filters the null content. > + > +13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object > instead of > + Parse object. (Gal Nitzan via dogacan) > + > +14. NUTCH-489 - URLFilter-suffix management of the url path when the url > contains > + some query parameters. (Emmanuel Joke via dogacan) > + > +15. NUTCH-502 - Bug in SegmentReader causes infinite loop. > + (Ilya Vishnevsky via dogacan) > + > +16. NUTCH-444 Possibly use a different library to parse RSS feed for > improved > + performance and compatibility. This patch introduced a new plugin, > feed, > + that includes an index filter and a parse plugin for feeds that uses > ROME. > + There was discussion to remove parse-rss, in light of the feed plugin, > + however, this patch does not explicitly remove parse-rss. (dogacan, > mattmann) > + > +17. NUTCH-471 - Fix synchronization in NutchBean creation. > + (Enis Soztutar via dogacan) > + > +18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab) > + > +19. NUTCH-468 - Scoring filter should distribute score to all outlinks at > + once. (dogacan) > + > +20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan) > + > +21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in > + DomContentUtils...Spider Trap. (kubes) > + > +22. NUTCH-434 - Replace usage of ObjectWritable with something based on > + GenericWritable. (dogacan) > + > +23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan) > + > +24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb > generation. > + (Espen Amble Kolstad via dogacan) > + > +25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml. > + (Emmanuel Joke via dogacan) > + > +26. NUTCH-503 - Generator exits incorrectly for small fetchlists. > + (Vishal Shah via dogacan) > + > +27. NUTCH-505 - Outlink urls should be validated. (dogacan) > + > +28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via > dogacan) > + > +29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan) > + > +30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan) > + > +30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan) > + > +31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via > dogacan). > + > +32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining. > + (Enis Soztutar via dogacan) > + > +33. NUTCH-516 - Next fetch time is not set when it is a > + CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan) > + > +34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException > + when trying to rerun dedup on a segment. (Vishal Shah via dogacan) > + > +35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS. > + (dogacan) Note: There is a bigger problem, i.e how to deal > + with redirected pages, and this issue can be considered as a band-aid > + for the time being. See NUTCH-273 and NUTCH-353 for more details. > + > +36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and > + inlinks list. (Emmanuel Joke via dogacan) > + > +37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values > during > + parse. (dogacan) > + > +38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan) > + > +39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan) > + > +40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds > + domain-related utilities. (Enis Soztutar via dogacan) > + > +41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable > + release (2.1). (Dawid Weiss via dogacan) > + > +42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every > + request. (Dawid Weiss via dogacan) > + > +43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time. > + (Emmanuel Joke via dogacan) > + > +44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan) > + > +45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan) > + > +46. NUTCH-554 - Generator throws IOException on invalid urls. > + (Brian Whitman via ab) > + > +47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 > child. > + (Emmanuel Joke via dogacan) > + > +48. NUTCH-25 - needs 'character encoding' detector. > + (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan) > + > +49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not > propagated > + to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan) > + > +50. NUTCH-562 - Port mime type framework to use Tika mime detection > framework. > + (mattmann) > + > +51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant > outlink > + list. (Emmanuel Joke, Marcin Okraszewski via kubes) > + > +52. NUTCH-501 - Implement a different caching mechanism for objects > cached > in > + configuration. (dogacan) > + > +53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes) > + > +54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes) > + > +55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm. > + (dogacan, kubes via dogacan) > + > +56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat. > + (Emmanuel Joke via dogacan) > + > +57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan) > + > +58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan) > + > +59. NUTCH-574 - Including inlink anchor text in index can create > irrelevant > + search results. Created index-anchor plugin, removed functionality > from > + index-basic plugin. For backwards compatibility, add index-anchor > plugin to > + nutch-site.xml plugin.includes. (kubes) > + > +60. NUTCH-581 - DistributedSearch does not update search servers added to > + search-servers.txt on the fly. (Rohan Mehta via kubes) > + > +61. NUTCH-586 - Add option to run compiled classes without job file > + (enis via ab) > + > +62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for > web/proxy > + server. (Susam Pal via dogacan) > + > +63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via > ab) > + > +64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format > + (Emmanuel Joke via ab) > + > +65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab) > + > +66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab) > + > +67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren) > + > +68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes) > + > +69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab) > + > +70. NUTCH-602 - Allow configurable number of handlers for search servers > + (hartbecke via kubes) > + > +71. NUTCH-607 - Update build.xml to include tika jar when building war > (kubes) > + > +72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating > (mattmann) > + > +73. NUTCH-606 - Refactoring of Generator, run all urls through checks > (kubes) > + > +74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes) > + > +75. NUTCH-603 - Add more default url normalizations (kubes) > + > +76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes) > + > +77. NUTCH-44 - Too many search results, limits max results returned from a > + single search. (Emilijan Mirceski and Susam Pal via kubes) > + > +78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is > + updated to 1.2 version. (dogacan) > + > +79. NUTCH-613 - Empty summaries and cached pages (kubes via ab) > + > +80. NUTCH-612 - URL filtering was disabled in Generator when invoked > + from Crawl (Susam Pal via ab) > + > +81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab) > + > +82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab) > + > +83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert > via > ab) > + > +84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval. > + Guard against reprUrl being null. (Emmanuel Joke, ab) > + > +85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful > (Emmanuel > + Joke, ab) > + > +86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab) > + > +87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab) > + > +88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop > API. > + (Emmanuel Joke, dogacan, ab) > + > +89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a > + single slash. (Mark DeSpain via ab) > + > +90. NUTCH-500 - Add hadoop masters configuration file into conf folder. > + (Emmanuel Joke via kubes) > + > +91. NUTCH-596 - ParseSegments parse content even if its not > + CrawlDatum.STATUS_FETCH_SUCCESS (dogacan) > + > +92. NUTCH-618 - Tika error "Media type alias already exists" > (mattmann,kubes) > + > +93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln > + Ritter, ab) > + > +94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab) > + > +95. NUTCH-645 - Parse-swf unit test failing (ab) > + > +96. NUTCH-642 - Unit tests fail when run in non-local mode (ab) > + > +97. NUTCH-639 - Change LuceneDocumentWrapper visibility from > + private to _public_ (Guillaume Smet via dogacan) > + > +98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn > + tracking. (dogacan) > + > +99. NUTCH-375 - Add support for Content-Encoding: deflated > + (Pascal Beis, ab) > + > +100. NUTCH-633 - ParseSegment no longer allow reparsing. > + (dogacan) > + > +101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan) > + > +102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann) > + > +103. NUTCH-654 - urlfilter-regex's main does not work. > + (dogacan) > + > +104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE". > + (dogacan) > + > +105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes) > + > +106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes) > + > +107. NUTCH-647 - Resolve URLs tool (kubes) > + > +108. NUTCH-665 - Search Load Testing Tool (kubes) > + > +109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming > + (kubes) > + > +110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes) > + > +111. NUTCH-646 - New Indexing Framework for Nutch. (kubes) > + > +112. NUTCH-668 - Domain URL Filter. (kubes) > + > +113. NUTCH-594 - Serve Nutch search results in multiple formats including > + XML and JSON. (kubes) > + > +114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by > siren) > + > +115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate > + fetch interval correctly. (dogacan) > + > +116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic) > + > +117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t. > + (julien nioche via dogacan) > + > +118. NUTCH-681 - parse-mp3 compilation problem. > + (Wildan Maulana via dogacan) > + > +119. NUTCH-676 - MapWritable is written inefficiently and confusingly. > + (dogacan) > + > +120. NUTCH-579 - Feed plugin only indexes one post per feed due to > identical > + digest. (dogacan) > + > +121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3. > + (Joseph Chen, dogacan) > + > +122. NUTCH-682 - SOLR indexer does not set boost on the document. > + (julien nioche via dogacan) > + > +123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab) > + > +124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab) > + > +125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab) > + > +126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE > + (Curtis d'Entremont, ab) > + > +127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan) > + > +128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException > + (Stefan Will, siren) > + > +129. NUTCH-691 - Update jakarta poi jars to the most relevant version > + (Dmitry Lihachev via siren) > + > +130. NUTCH-563 - Include custom fields in BasicQueryFilter > + (Julien Nioche via siren) > + > +131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter > plugin > + (Dmitry Lihachev via siren) > + > +132. NUTCH-694 - Distributed Search Server fails (siren) > + > +133. NUTCH-626 - Fetcher2 breaks out the domain with > db.ignore.external.links > + set at cross domain redirects (Remco Verhoef, dogacan via siren) > + > +134. NUTCH-247 - Robot parser to restrict (kubes, siren) > + > +135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan > + via siren) > + > +136. NUTCH-699 - Add an "official" solr schema for solr integration > (dogacan, > + Dmitry Lihachev via siren) > + > +137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab) > + > +138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann, > + Doug Cook via ab) > + > +139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren) > + > +140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren) > + > +141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab) > + > +142. NUTCH-684 - Dedup support for Solr. (dogacan) > + > +143. NUTCH-715 - Subcollection plugin doesn't work with default > + subcollections.xml file (Dmitry Lihachev via siren) > + > +144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute > + > > I would say that some of the biggest differences include improved scoring, > a > bunch of associated tools to look inside of the Nutch data, and improved > speed on fetching. In addition, the first appearance of the integration of > Nutch and Solr appeared in 1.0, so if you go backwards you won't be able to > integrate with Solr (might not be that big of an issue). > > HTH, > Chris > > > > On 7/16/10 9:10 AM, "Hannes Carl Meyer" <[email protected]> wrote: > > > Hi, > > > > I'm currently using Nutch 1.0 to perform intranet crawl and index html > and > > pdf contents. > > Unfortunately we are using Java 1.5 in our production env, that means I > have > > to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6. > > > > Are there big differences between those versions which maybe impact my > plans > > moving backwards to 0.9? > > (as I said it is performing intranet crawl only, NOT using lucene search > > interface and NOT using distributed mode) > > > > Thanks for your feedback > > > > Hannes > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > -- https://www.xing.com/profile/HannesCarl_Meyer http://de.linkedin.com/in/hannescarlmeyer http://twitter.com/hannescarlmeyer

