Re: Differences between 0.9 / 1.0

Hannes Carl Meyer Fri, 16 Jul 2010 12:35:55 -0700

Hi Chris,

thanks for your summary, I don't like moving to older versions but as it
seems it is feasible.
Since I'm not using Solr and just wrote a Plugin to export crawled pages
into XML files it should be fine!


Regards

Hannes

On Fri, Jul 16, 2010 at 8:33 PM, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hi Hannes,
>
> Here are the noteable changes between 1.0 and 0.9, generated by:
>
> curl "http://www.apache.org/dist/nutch/CHANGES-0.9.txt"; > 0.9
> curl "http://www.apache.org/dist/nutch/CHANGES-1.0.txt"; > 1.0
> diff -u 0.9 1.0
>
> +Release 1.0 - 2009-03-23
> +
> + 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via
> ab)
> +
> + 2. NUTCH-443 - Allow parsers to return multiple Parse objects.
> +    (Dogacan Guney et al, via ab)
> +
> + 3. NUTCH-393 - Indexer should handle null documents returned by filters.
> +    (Eelco Lempsink via ab)
> +
> + 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
> +
> + 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
> +    bots in robots.txt (Dogacan Guney via siren)
> +
> + 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
> +
> + 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
> +    (siren)
> +
> + 8. NUTCH-161 - Change Plain text parser to
> +    use parser.character.encoding.default property for fall back encoding
> +    (KuroSaka TeruHiko, siren)
> +
> + 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
> +    unmodified content. (ab)
> +
> +10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
> +    (cutting via ab)
> +
> +11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
> +
> +12. NUTCH-443 - allow parsers to return multiple Parse object, this will
> speed
> +    up the rss parser (dogacan via mattmann). This update is a fix and
> semantics
> +    change from the original patch for NUTCH-443. The original patch did
> not tell
> +    the  Indexer to read crawl_parse too so that it can pickup sub-urls'
> fetch
> +    datums. This patch addresses that issue. Now, if Fetcher gets a null
> content,
> +    instead of pushing an empty content, it filters the null content.
> +
> +13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object
> instead of
> +    Parse object. (Gal Nitzan via dogacan)
> +
> +14. NUTCH-489 - URLFilter-suffix management of the url path when the url
> contains
> +    some query parameters. (Emmanuel Joke via dogacan)
> +
> +15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
> +    (Ilya Vishnevsky via dogacan)
> +
> +16. NUTCH-444 Possibly use a different library to parse RSS feed for
> improved
> +    performance and compatibility. This patch introduced a new plugin,
> feed,
> +    that includes an index filter and a parse plugin for feeds that uses
> ROME.
> +    There was discussion to remove parse-rss, in light of the feed plugin,
> +    however, this patch does not explicitly remove parse-rss. (dogacan,
> mattmann)
> +
> +17. NUTCH-471 - Fix synchronization in NutchBean creation.
> +    (Enis Soztutar via dogacan)
> +
> +18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
> +
> +19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
> +    once. (dogacan)
> +
> +20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
> +
> +21. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in
> +    DomContentUtils...Spider Trap. (kubes)
> +
> +22. NUTCH-434 - Replace usage of ObjectWritable with something based on
> +    GenericWritable. (dogacan)
> +
> +23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
> +
> +24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb
> generation.
> +    (Espen Amble Kolstad via dogacan)
> +
> +25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
> +    (Emmanuel Joke via dogacan)
> +
> +26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
> +    (Vishal Shah via dogacan)
> +
> +27. NUTCH-505 - Outlink urls should be validated. (dogacan)
> +
> +28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via
> dogacan)
> +
> +29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
> +
> +30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
> +
> +30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
> +
> +31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via
> dogacan).
> +
> +32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
> +    (Enis Soztutar via dogacan)
> +
> +33. NUTCH-516 - Next fetch time is not set when it is a
> +    CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
> +
> +34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
> +    when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
> +
> +35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
> +    (dogacan) Note: There is a bigger problem, i.e how to deal
> +    with redirected pages, and this issue can be considered as a band-aid
> +    for the time being. See NUTCH-273 and NUTCH-353 for more details.
> +
> +36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
> +    inlinks list. (Emmanuel Joke via dogacan)
> +
> +37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values
> during
> +    parse. (dogacan)
> +
> +38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
> +
> +39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
> +
> +40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
> +    domain-related utilities. (Enis Soztutar via dogacan)
> +
> +41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
> +    release (2.1). (Dawid Weiss via dogacan)
> +
> +42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
> +    request. (Dawid Weiss via dogacan)
> +
> +43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
> +    (Emmanuel Joke via dogacan)
> +
> +44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
> +
> +45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
> +
> +46. NUTCH-554 - Generator throws IOException on invalid urls.
> +    (Brian Whitman via ab)
> +
> +47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1
> child.
> +    (Emmanuel Joke via dogacan)
> +
> +48. NUTCH-25 - needs 'character encoding' detector.
> +    (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
> +
> +49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not
> propagated
> +    to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
> +
> +50. NUTCH-562 - Port mime type framework to use Tika mime detection
> framework.
> +    (mattmann)
> +
> +51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant
> outlink
> +    list. (Emmanuel Joke, Marcin Okraszewski via kubes)
> +
> +52. NUTCH-501 -  Implement a different caching mechanism for objects
> cached
> in
> +    configuration. (dogacan)
> +
> +53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
> +
> +54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
> +
> +55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
> +    (dogacan, kubes via dogacan)
> +
> +56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
> +    (Emmanuel Joke via dogacan)
> +
> +57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
> +
> +58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
> +
> +59. NUTCH-574 - Including inlink anchor text in index can create
> irrelevant
> +    search results.  Created index-anchor plugin, removed functionality
> from
> +    index-basic plugin. For backwards compatibility, add index-anchor
> plugin to
> +    nutch-site.xml plugin.includes. (kubes)
> +
> +60. NUTCH-581 - DistributedSearch does not update search servers added to
> +    search-servers.txt on the fly.  (Rohan Mehta via kubes)
> +
> +61. NUTCH-586 - Add option to run compiled classes without job file
> +    (enis via ab)
> +
> +62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for
> web/proxy
> +    server. (Susam Pal via dogacan)
> +
> +63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via
> ab)
> +
> +64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
> +    (Emmanuel Joke via ab)
> +
> +65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
> +
> +66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
> +
> +67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
> +
> +68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
> +
> +69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
> +
> +70. NUTCH-602 - Allow configurable number of handlers for search servers
> +    (hartbecke via kubes)
> +
> +71. NUTCH-607 - Update build.xml to include tika jar when building war
> (kubes)
> +
> +72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating
> (mattmann)
> +
> +73. NUTCH-606 - Refactoring of Generator, run all urls through checks
> (kubes)
> +
> +74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
> +
> +75. NUTCH-603 - Add more default url normalizations (kubes)
> +
> +76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
> +
> +77. NUTCH-44 - Too many search results, limits max results returned from a
> +    single search. (Emilijan Mirceski and Susam Pal via kubes)
> +
> +78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
> +    updated to 1.2 version. (dogacan)
> +
> +79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
> +
> +80. NUTCH-612 - URL filtering was disabled in Generator when invoked
> +    from Crawl (Susam Pal via ab)
> +
> +81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
> +
> +82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
> +
> +83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert
> via
> ab)
> +
> +84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
> +    Guard against reprUrl being null. (Emmanuel Joke, ab)
> +
> +85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful
> (Emmanuel
> +    Joke, ab)
> +
> +86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
> +
> +87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
> +
> +88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop
> API.
> +    (Emmanuel Joke, dogacan, ab)
> +
> +89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
> +    single slash. (Mark DeSpain via ab)
> +
> +90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
> +    (Emmanuel Joke via kubes)
> +
> +91. NUTCH-596 - ParseSegments parse content even if its not
> +    CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
> +
> +92. NUTCH-618 - Tika error "Media type alias already exists"
> (mattmann,kubes)
> +
> +93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
> +    Ritter, ab)
> +
> +94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
> +
> +95. NUTCH-645 - Parse-swf unit test failing (ab)
> +
> +96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
> +
> +97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
> +    private to _public_ (Guillaume Smet via dogacan)
> +
> +98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
> +    tracking. (dogacan)
> +
> +99. NUTCH-375 - Add support for Content-Encoding: deflated
> +    (Pascal Beis, ab)
> +
> +100. NUTCH-633 - ParseSegment no longer allow reparsing.
> +     (dogacan)
> +
> +101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
> +
> +102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
> +
> +103. NUTCH-654 - urlfilter-regex's main does not work.
> +     (dogacan)
> +
> +104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
> +     (dogacan)
> +
> +105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
> +
> +106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
> +
> +107. NUTCH-647 - Resolve URLs tool (kubes)
> +
> +108. NUTCH-665 - Search Load Testing Tool (kubes)
> +
> +109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
> +                 (kubes)
> +
> +110. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
> +
> +111. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
> +
> +112. NUTCH-668 -  Domain URL Filter. (kubes)
> +
> +113. NUTCH-594 -  Serve Nutch search results in multiple formats including
> +                  XML and JSON. (kubes)
> +
> +114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by
> siren)
> +
> +115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
> +                 fetch interval correctly. (dogacan)
> +
> +116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
> +
> +117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
> +                 (julien nioche via dogacan)
> +
> +118. NUTCH-681 - parse-mp3 compilation problem.
> +                 (Wildan Maulana via dogacan)
> +
> +119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
> +                 (dogacan)
> +
> +120. NUTCH-579 - Feed plugin only indexes one post per feed due to
> identical
> +                 digest. (dogacan)
> +
> +121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
> +                 (Joseph Chen, dogacan)
> +
> +122. NUTCH-682 - SOLR indexer does not set boost on the document.
> +                 (julien nioche via dogacan)
> +
> +123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
> +
> +124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
> +
> +125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
> +
> +126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
> +     (Curtis d'Entremont, ab)
> +
> +127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
> +
> +128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
> +     (Stefan Will, siren)
> +
> +129. NUTCH-691 - Update jakarta poi jars to the most relevant version
> +     (Dmitry Lihachev via siren)
> +
> +130. NUTCH-563 - Include custom fields in BasicQueryFilter
> +     (Julien Nioche via siren)
> +
> +131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter
> plugin
> +     (Dmitry Lihachev via siren)
> +
> +132. NUTCH-694 - Distributed Search Server fails (siren)
> +
> +133. NUTCH-626 - Fetcher2 breaks out the domain with
> db.ignore.external.links
> +     set at cross domain redirects (Remco Verhoef, dogacan via siren)
> +
> +134. NUTCH-247 - Robot parser to restrict (kubes, siren)
> +
> +135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
> +     via siren)
> +
> +136. NUTCH-699 - Add an "official" solr schema for solr integration
> (dogacan,
> +     Dmitry Lihachev via siren)
> +
> +137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
> +
> +138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
> +     Doug Cook via ab)
> +
> +139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
> +
> +140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
> +
> +141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
> +
> +142. NUTCH-684 - Dedup support for Solr. (dogacan)
> +
> +143. NUTCH-715 - Subcollection plugin doesn't work with default
> +     subcollections.xml file (Dmitry Lihachev via siren)
> +
> +144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
> +
>
> I would say that some of the biggest differences include improved scoring,
> a
> bunch of associated tools to look inside of the Nutch data, and improved
> speed on fetching. In addition, the first appearance of the integration of
> Nutch and Solr appeared in 1.0, so if you go backwards you won't be able to
> integrate with Solr (might not be that big of an issue).
>
> HTH,
> Chris
>
>
>
> On 7/16/10 9:10 AM, "Hannes Carl Meyer" <[email protected]> wrote:
>
> > Hi,
> >
> > I'm currently using Nutch 1.0 to perform intranet crawl and index html
> and
> > pdf contents.
> > Unfortunately we are using Java 1.5 in our production env, that means I
> have
> > to move to Nutch 0.9 since 1.1 and 1.0 requiring Java 6.
> >
> > Are there big differences between those versions which maybe impact my
> plans
> > moving backwards to 0.9?
> > (as I said it is performing intranet crawl only, NOT using lucene search
> > interface and NOT using distributed mode)
> >
> > Thanks for your feedback
> >
> > Hannes
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/<http://sunset.usc.edu/%7Emattmann/>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>


-- 

https://www.xing.com/profile/HannesCarl_Meyer
http://de.linkedin.com/in/hannescarlmeyer
http://twitter.com/hannescarlmeyer

Re: Differences between 0.9 / 1.0

Reply via email to