Re: Differences between 0.9 / 1.0

Mattmann, Chris A (388J) Fri, 16 Jul 2010 11:35:21 -0700

Hi Hannes,

Here are the noteable changes between 1.0 and 0.9, generated by:


curl "http://www.apache.org/dist/nutch/CHANGES-0.9.txt"; > 0.9
curl "http://www.apache.org/dist/nutch/CHANGES-1.0.txt"; > 1.0
diff -u 0.9 1.0

+Release 1.0 - 2009-03-23
+
+ 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
+
+ 2. NUTCH-443 - Allow parsers to return multiple Parse objects.
+    (Dogacan Guney et al, via ab)
+
+ 3. NUTCH-393 - Indexer should handle null documents returned by filters.
+    (Eelco Lempsink via ab)
+
+ 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
+
+ 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
+    bots in robots.txt (Dogacan Guney via siren)
+
+ 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
+
+ 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
+    (siren)
+
+ 8. NUTCH-161 - Change Plain text parser to
+    use parser.character.encoding.default property for fall back encoding
+    (KuroSaka TeruHiko, siren)
+
+ 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
+    unmodified content. (ab)
+
+10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
+    (cutting via ab)
+
+11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
+
+12. NUTCH-443 - allow parsers to return multiple Parse object, this will
speed
+    up the rss parser (dogacan via mattmann). This update is a fix and
semantics
+    change from the original patch for NUTCH-443. The original patch did
not tell
+    the  Indexer to read crawl_parse too so that it can pickup sub-urls'
fetch
+    datums. This patch addresses that issue. Now, if Fetcher gets a null
content,
+    instead of pushing an empty content, it filters the null content.
+
+13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object
instead of
+    Parse object. (Gal Nitzan via dogacan)
+
+14. NUTCH-489 - URLFilter-suffix management of the url path when the url
contains
+    some query parameters. (Emmanuel Joke via dogacan)
+
+15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
+    (Ilya Vishnevsky via dogacan)
+
+16. NUTCH-444 Possibly use a different library to parse RSS feed for
improved
+    performance and compatibility. This patch introduced a new plugin,
feed,
+    that includes an index filter and a parse plugin for feeds that uses
ROME.
+    There was discussion to remove parse-rss, in light of the feed plugin,
+    however, this patch does not explicitly remove parse-rss. (dogacan,
mattmann)
+
+17. NUTCH-471 - Fix synchronization in NutchBean creation.
+    (Enis Soztutar via dogacan)
+
+18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
+
+19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
+    once. (dogacan)
+
+20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
+
+21. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in
+    DomContentUtils...Spider Trap. (kubes)
+
+22. NUTCH-434 - Replace usage of ObjectWritable with something based on
+    GenericWritable. (dogacan)
+
+23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
+
+24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb
generation.
+    (Espen Amble Kolstad via dogacan)
+
+25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
+    (Emmanuel Joke via dogacan)
+
+26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
+    (Vishal Shah via dogacan)
+
+27. NUTCH-505 - Outlink urls should be validated. (dogacan)
+
+28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
+
+29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
+
+30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
+
+30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
+
+31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via
dogacan).
+
+32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
+    (Enis Soztutar via dogacan)
+
+33. NUTCH-516 - Next fetch time is not set when it is a
+    CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
+
+34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
+    when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
+
+35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
+    (dogacan) Note: There is a bigger problem, i.e how to deal
+    with redirected pages, and this issue can be considered as a band-aid
+    for the time being. See NUTCH-273 and NUTCH-353 for more details.
+
+36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
+    inlinks list. (Emmanuel Joke via dogacan)
+
+37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values
during
+    parse. (dogacan)
+
+38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
+
+39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
+
+40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
+    domain-related utilities. (Enis Soztutar via dogacan)
+
+41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
+    release (2.1). (Dawid Weiss via dogacan)
+
+42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
+    request. (Dawid Weiss via dogacan)
+
+43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
+    (Emmanuel Joke via dogacan)
+
+44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
+
+45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
+
+46. NUTCH-554 - Generator throws IOException on invalid urls.
+    (Brian Whitman via ab)
+
+47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
+    (Emmanuel Joke via dogacan)
+
+48. NUTCH-25 - needs 'character encoding' detector.
+    (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
+
+49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
+    to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
+
+50. NUTCH-562 - Port mime type framework to use Tika mime detection
framework.
+    (mattmann)
+
+51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant
outlink
+    list. (Emmanuel Joke, Marcin Okraszewski via kubes)
+
+52. NUTCH-501 -  Implement a different caching mechanism for objects cached
in
+    configuration. (dogacan)
+
+53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
+
+54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
+
+55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
+    (dogacan, kubes via dogacan)
+
+56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
+    (Emmanuel Joke via dogacan)
+
+57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
+
+58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
+
+59. NUTCH-574 - Including inlink anchor text in index can create irrelevant
+    search results.  Created index-anchor plugin, removed functionality
from
+    index-basic plugin. For backwards compatibility, add index-anchor
plugin to
+    nutch-site.xml plugin.includes. (kubes)
+
+60. NUTCH-581 - DistributedSearch does not update search servers added to
+    search-servers.txt on the fly.  (Rohan Mehta via kubes)
+
+61. NUTCH-586 - Add option to run compiled classes without job file
+    (enis via ab)
+
+62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
+    server. (Susam Pal via dogacan)
+
+63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
+
+64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
+    (Emmanuel Joke via ab)
+
+65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
+
+66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
+
+67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
+
+68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
+
+69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
+
+70. NUTCH-602 - Allow configurable number of handlers for search servers
+    (hartbecke via kubes)
+
+71. NUTCH-607 - Update build.xml to include tika jar when building war
(kubes)
+
+72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating
(mattmann)
+
+73. NUTCH-606 - Refactoring of Generator, run all urls through checks
(kubes)
+
+74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
+
+75. NUTCH-603 - Add more default url normalizations (kubes)
+
+76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
+
+77. NUTCH-44 - Too many search results, limits max results returned from a
+    single search. (Emilijan Mirceski and Susam Pal via kubes)
+
+78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
+    updated to 1.2 version. (dogacan)
+
+79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
+
+80. NUTCH-612 - URL filtering was disabled in Generator when invoked
+    from Crawl (Susam Pal via ab)
+
+81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
+
+82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
+
+83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via
ab)
+
+84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
+    Guard against reprUrl being null. (Emmanuel Joke, ab)
+
+85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful
(Emmanuel
+    Joke, ab)
+
+86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
+
+87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
+
+88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop
API.
+    (Emmanuel Joke, dogacan, ab)
+
+89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
+    single slash. (Mark DeSpain via ab)
+
+90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
+    (Emmanuel Joke via kubes)
+
+91. NUTCH-596 - ParseSegments parse content even if its not
+    CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
+
+92. NUTCH-618 - Tika error "Media type alias already exists"
(mattmann,kubes)
+
+93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
+    Ritter, ab)
+
+94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
+
+95. NUTCH-645 - Parse-swf unit test failing (ab)
+
+96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
+
+97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
+    private to _public_ (Guillaume Smet via dogacan)
+
+98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
+    tracking. (dogacan)
+
+99. NUTCH-375 - Add support for Content-Encoding: deflated
+    (Pascal Beis, ab)
+
+100. NUTCH-633 - ParseSegment no longer allow reparsing.
+     (dogacan)
+
+101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
+
+102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
+
+103. NUTCH-654 - urlfilter-regex's main does not work.
+     (dogacan)
+
+104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
+     (dogacan)
+
+105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
+
+106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
+
+107. NUTCH-647 - Resolve URLs tool (kubes)
+
+108. NUTCH-665 - Search Load Testing Tool (kubes)
+
+109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
+                 (kubes)
+
+110. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
+
+111. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
+
+112. NUTCH-668 -  Domain URL Filter. (kubes)
+
+113. NUTCH-594 -  Serve Nutch search results in multiple formats including
+                  XML and JSON. (kubes)
+
+114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
+
+115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
+                 fetch interval correctly. (dogacan)
+
+116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
+
+117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
+                 (julien nioche via dogacan)
+
+118. NUTCH-681 - parse-mp3 compilation problem.
+                 (Wildan Maulana via dogacan)
+
+119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
+                 (dogacan)
+
+120. NUTCH-579 - Feed plugin only indexes one post per feed due to
identical
+                 digest. (dogacan)
+
+121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
+                 (Joseph Chen, dogacan)
+
+122. NUTCH-682 - SOLR indexer does not set boost on the document.
+                 (julien nioche via dogacan)
+
+123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
+
+124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
+
+125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
+
+126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
+     (Curtis d'Entremont, ab)
+
+127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
+
+128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
+     (Stefan Will, siren)
+
+129. NUTCH-691 - Update jakarta poi jars to the most relevant version
+     (Dmitry Lihachev via siren)
+
+130. NUTCH-563 - Include custom fields in BasicQueryFilter
+     (Julien Nioche via siren)
+
+131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
+     (Dmitry Lihachev via siren)
+
+132. NUTCH-694 - Distributed Search Server fails (siren)
+
+133. NUTCH-626 - Fetcher2 breaks out the domain with
db.ignore.external.links
+     set at cross domain redirects (Remco Verhoef, dogacan via siren)
+
+134. NUTCH-247 - Robot parser to restrict (kubes, siren)
+
+135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
+     via siren)
+
+136. NUTCH-699 - Add an "official" solr schema for solr integration
(dogacan,
+     Dmitry Lihachev via siren)
+
+137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
+
+138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
+     Doug Cook via ab)
+
+139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
+
+140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
+
+141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
+
+142. NUTCH-684 - Dedup support for Solr. (dogacan)
+
+143. NUTCH-715 - Subcollection plugin doesn't work with default
+     subcollections.xml file (Dmitry Lihachev via siren)
+
+144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
+

I would say that some of the biggest differences include improved scoring, a
bunch of associated tools to look inside of the Nutch data, and improved
speed on fetching. In addition, the first appearance of the integration of
Nutch and Solr appeared in 1.0, so if you go backwards you won't be able to
integrate with Solr (might not be that big of an issue).

HTH,
Chris



On 7/16/10 9:10 AM, "Hannes Carl Meyer" <[email protected]> wrote:

> Hi,
>
> I'm currently using Nutch 1.0 to perform intranet crawl and index html and
> pdf contents.
> Unfortunately we are using Java 1.5 in our production env, that means I have
> to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6.
>
> Are there big differences between those versions which maybe impact my plans
> moving backwards to 0.9?
> (as I said it is performing intranet crawl only, NOT using lucene search
> interface and NOT using distributed mode)
>
> Thanks for your feedback
>
> Hannes
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Differences between 0.9 / 1.0

Reply via email to