svn commit: r208869 [2/12] - in /lucene/nutch/trunk: conf/ src/plugin/languageidentifier/ src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/

2005-07-02 Thread ab
6698 -tnin 6691 -ag_ -sæ 6663 -ndr 6626 -sen 6593 -ente 6571 -dig 6562 -ga 6555 -kl 6546 -tu 6517 -bes 6500 -fe 6473 -lag 6461 -red 6435 -lin 6431 -ks 6412 -dre 6409 -ment 6405 -kal_ 6387 -skal 6383 -ved 6325 -ab 6321 -sam 6321 -æl 6310 -par 6284 -v_ 6283 -bet 6246 -est 6224 -ner_ 6218 -ve_ 6218

svn commit: r208869 [12/12] - in /lucene/nutch/trunk: conf/ src/plugin/languageidentifier/ src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/

2005-07-02 Thread ab
2734 -utan 2733 -aga 2731 -änk 2726 -org 2723 -öj 2723 -ab 2722 -ven_ 2716 -is_ 2711 -dli 2704 -rän 2704 -nkt 2703 -rfö 2698 -dag 2693 -ien 2692 -tti 2689 -bö 2676 -ske 2672 -amt 2669 -and_ 2669 -tvi 2662 -rag 2654 -ckli 2653 -ive 2647 -dd 2646 -rför 2646 -avs 2645 -dern 2645 -beh 2644 -nade

svn commit: r209246 - in /lucene/nutch/trunk/src/plugin/parse-js: plugin.xml src/java/org/apache/nutch/parse/js/JSParseFilter.java

2005-07-05 Thread ab
Author: ab Date: Tue Jul 5 01:56:01 2005 New Revision: 209246 URL: http://svn.apache.org/viewcvs?rev=209246view=rev Log: Active this as Parser plugin (it was accidentally omitted). Accept also empty content type, if the extension is right. Modified: lucene/nutch/trunk/src/plugin/parse-js

svn commit: r209669 - /lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java

2005-07-07 Thread ab
Author: ab Date: Thu Jul 7 15:43:49 2005 New Revision: 209669 URL: http://svn.apache.org/viewcvs?rev=209669view=rev Log: Forgot to add this one, sorry. Added: lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java (with props

svn commit: r219097 - in /lucene/nutch/trunk/src/java/org/apache/nutch: fs/NDFSFileSystem.java ndfs/FSDirectory.java ndfs/NDFSFile.java ndfs/NDFSFileInfo.java

2005-07-14 Thread ab
Author: ab Date: Thu Jul 14 13:54:16 2005 New Revision: 219097 URL: http://svn.apache.org/viewcvs?rev=219097view=rev Log: Fix issues reported in NUTCH-46. Submitted by Piotr Kosiorowski. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fs/NDFSFileSystem.java lucene/nutch/trunk/src

svn commit: r220034 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient: DummySSLProtocolSocketFactory.java Http.java HttpResponse.java

2005-07-21 Thread ab
Author: ab Date: Thu Jul 21 04:07:10 2005 New Revision: 220034 URL: http://svn.apache.org/viewcvs?rev=220034view=rev Log: Fixes for NUTCH-66, and some cleanup: * apply the more lenient CookiePolicy. This fixes the problem with poorly formatted cookies being rejected. * remove the while() loop

svn commit: r232303 - in /lucene/nutch/trunk/src/plugin: ./ parse-rss/ parse-rss/lib/ parse-rss/sample/ parse-rss/src/ parse-rss/src/java/ parse-rss/src/java/org/ parse-rss/src/java/org/apache/ parse-

2005-08-12 Thread ab
Author: ab Date: Fri Aug 12 07:23:47 2005 New Revision: 232303 URL: http://svn.apache.org/viewcvs?rev=232303view=rev Log: RSS Parse plugin. Contributed by Chris Mattmann (issue NUTCH-30). Thank you! Added: lucene/nutch/trunk/src/plugin/parse-rss/ lucene/nutch/trunk/src/plugin/parse-rss

svn commit: r290382 - in /lucene/nutch/trunk/src/plugin/parse-pdf: lib/PDFBox-0.7.0.LICENSE.txt lib/PDFBox-0.7.0.jar lib/PDFBox-0.7.2-log4j.jar lib/PDFBox-LICENSE.txt lib/log4j-1.2.9.LICENSE.txt lib/log4j-LICENSE.txt plugin.xml

2005-09-20 Thread ab
Author: ab Date: Tue Sep 20 00:09:45 2005 New Revision: 290382 URL: http://svn.apache.org/viewcvs?rev=290382view=rev Log: Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two versions - one has no logging at all, the other uses log4j (which is the version used here

svn commit: r290383 - in /lucene/nutch/branches/Release-0.7/src/plugin/parse-pdf: lib/PDFBox-0.7.0.LICENSE.txt lib/PDFBox-0.7.0.jar lib/PDFBox-0.7.2-log4j.jar lib/PDFBox-LICENSE.txt lib/log4j-1.2.9.LICENSE.txt lib/log4j-LICENSE.txt plugin.xml

2005-09-20 Thread ab
Author: ab Date: Tue Sep 20 00:16:42 2005 New Revision: 290383 URL: http://svn.apache.org/viewcvs?rev=290383view=rev Log: Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two versions - one has no logging at all, the other uses log4j (which is the version used here

svn commit: r357962 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: Fetcher.java Generator.java Injector.java

2005-12-20 Thread ab
Author: ab Date: Tue Dec 20 03:19:21 2005 New Revision: 357962 URL: http://svn.apache.org/viewcvs?rev=357962view=rev Log: Remove remaining calls to static NutchConf.get() in favor of using per-instance local configuration getConf(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r358674 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ java/org/apache/nutch/indexer/ plugin/creativecommons/src/java/org/creativecommons/nutch/ plugin/index-basic/src/java/org

2005-12-22 Thread ab
Author: ab Date: Thu Dec 22 17:16:31 2005 New Revision: 358674 URL: http://svn.apache.org/viewcvs?rev=358674view=rev Log: Remove traces of the old API FetcherOutput. The old IndexSegment is now marked broken. In the next step old utilities should be removed. Modified: lucene/nutch/trunk/src

svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/jav

2005-12-29 Thread ab
Author: ab Date: Thu Dec 29 07:28:30 2005 New Revision: 359822 URL: http://svn.apache.org/viewcvs?rev=359822view=rev Log: A framework for using different page signature implementations. Ordinary MD5 hash of a raw page content is very often unsuitable, when many near-duplicate pages are crawled

svn commit: r360069 - /lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java

2005-12-30 Thread ab
Author: ab Date: Fri Dec 30 03:07:50 2005 New Revision: 360069 URL: http://svn.apache.org/viewcvs?rev=360069view=rev Log: Fix incorrect package declaration. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r365576 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-01-03 Thread ab
Author: ab Date: Tue Jan 3 00:35:04 2006 New Revision: 365576 URL: http://svn.apache.org/viewcvs?rev=365576view=rev Log: Fixed an NPE, in case of a fetch error we don't have a score value from Fetcher. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-03 Thread ab
Author: ab Date: Tue Jan 3 23:32:04 2006 New Revision: 365850 URL: http://svn.apache.org/viewcvs?rev=365850view=rev Log: Update Commons HTTPClient to v. 3.0. Add some default headers to prefer HTML content, and in English. Added: lucene/nutch/trunk/src/plugin/protocol-httpclient/lib

svn commit: r367251 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java

2006-01-09 Thread ab
Author: ab Date: Mon Jan 9 00:58:58 2006 New Revision: 367251 URL: http://svn.apache.org/viewcvs?rev=367251view=rev Log: Replace the custom metadata serialization with the one provided by the ContentProperties class. This fixes the breakage if multiple property values per key are in use

svn commit: r368167 - in /lucene/nutch/trunk/src/java/org/apache/nutch: fetcher/Fetcher.java parse/ParseSegment.java

2006-01-11 Thread ab
Author: ab Date: Wed Jan 11 15:24:40 2006 New Revision: 368167 URL: http://svn.apache.org/viewcvs?rev=368167view=rev Log: Make sure we always have the segment name and score values in ParseData.metadata. Sometimes plugins would fail to copy them through, or a parsing error would produce empty

svn commit: r370854 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-01-20 Thread ab
Author: ab Date: Fri Jan 20 08:00:04 2006 New Revision: 370854 URL: http://svn.apache.org/viewcvs?rev=370854view=rev Log: Move excessive logging to Level.FINE. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Modified: lucene/nutch

svn commit: r374725 - /lucene/nutch/trunk/src/web/jsp/search.jsp

2006-02-03 Thread ab
Author: ab Date: Fri Feb 3 10:54:12 2006 New Revision: 374725 URL: http://svn.apache.org/viewcvs?rev=374725view=rev Log: Encode the query on links in UTF-8. Modified: lucene/nutch/trunk/src/web/jsp/search.jsp Modified: lucene/nutch/trunk/src/web/jsp/search.jsp URL: http://svn.apache.org

svn commit: r376518 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ test/org/apache/nutch/crawl/

2006-02-09 Thread ab
Author: ab Date: Thu Feb 9 16:56:57 2006 New Revision: 376518 URL: http://svn.apache.org/viewcvs?rev=376518view=rev Log: Add metadata to CrawlDatum. Contributed by Stefan Groschupf in NUTCH-192. Added: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java (with props

svn commit: r380163 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java

2006-02-23 Thread ab
Author: ab Date: Thu Feb 23 09:21:38 2006 New Revision: 380163 URL: http://svn.apache.org/viewcvs?rev=380163view=rev Log: Modify the cmd-line so that it's possible to perform incremental updates on existing linkDb. This significantly speeds up the invertlinks operation. Modified: lucene

svn commit: r383829 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java

2006-03-07 Thread ab
Author: ab Date: Tue Mar 7 01:26:54 2006 New Revision: 383829 URL: http://svn.apache.org/viewcvs?rev=383829view=rev Log: Cache instances of ParsePluginList. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java Modified: lucene/nutch/trunk/src/java/org/apache

svn commit: r384011 - /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Query.java

2006-03-07 Thread ab
Author: ab Date: Tue Mar 7 13:04:31 2006 New Revision: 384011 URL: http://svn.apache.org/viewcvs?rev=384011view=rev Log: No-arg constructors are required for RPC. Allow the RPC Server to set local Configuration. Fix by Marko Bauhardt. Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r385531 - /lucene/nutch/trunk/src/java/org/apache/nutch/plugin/PluginManifestParser.java

2006-03-13 Thread ab
Author: ab Date: Mon Mar 13 04:16:54 2006 New Revision: 385531 URL: http://svn.apache.org/viewcvs?rev=385531view=rev Log: Print out the full path of plugins directory. Allow using plugins located outside classpath, eg. in shared repos. Submitted by Stefan Groschupf, NUTCH-229. Modified

svn commit: r386875 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-03-18 Thread ab
Author: ab Date: Sat Mar 18 11:21:11 2006 New Revision: 386875 URL: http://svn.apache.org/viewcvs?rev=386875view=rev Log: Apply patch in NUTCH-230, which provides additional control over which outlinks are considered for OPIC cash value distribution. Modified: lucene/nutch/trunk/src/java/org

svn commit: r387341 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: Inlinks.java LinkDb.java LinkDbReader.java

2006-03-20 Thread ab
Author: ab Date: Mon Mar 20 15:20:56 2006 New Revision: 387341 URL: http://svn.apache.org/viewcvs?rev=387341view=rev Log: Don't allow Inlink duplicates (NUTCH-235). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Inlinks.java lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r387578 - in /lucene/nutch/trunk/src: java/org/apache/nutch/clustering/ plugin/clustering-carrot2/src/java/org/apache/nutch/clustering/carrot2/ plugin/clustering-carrot2/src/test/org/apach

2006-03-21 Thread ab
Author: ab Date: Tue Mar 21 08:43:09 2006 New Revision: 387578 URL: http://svn.apache.org/viewcvs?rev=387578view=rev Log: Cleanup and JUnit test for Carrot2. Contributed by Dawid Weiss (NUTCH-234). Added: lucene/nutch/trunk/src/plugin/clustering-carrot2/src/java/org/apache/nutch/clustering

svn commit: r390275 - /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

2006-03-30 Thread ab
Author: ab Date: Thu Mar 30 15:07:48 2006 New Revision: 390275 URL: http://svn.apache.org/viewcvs?rev=390275view=rev Log: Fix a bug where TagSoup would sometimes submit invalid index values. Modified: lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html

svn commit: r391044 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-03 Thread ab
Author: ab Date: Mon Apr 3 06:35:34 2006 New Revision: 391044 URL: http://svn.apache.org/viewcvs?rev=391044view=rev Log: Make sure we use new values for score, metadata, fetch interval and fetch time. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified

svn commit: r391055 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-03 Thread ab
Author: ab Date: Mon Apr 3 07:36:19 2006 New Revision: 391055 URL: http://svn.apache.org/viewcvs?rev=391055view=rev Log: Forgot to properly initialize the score. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified: lucene/nutch/trunk/src/java/org

svn commit: r391525 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java

2006-04-04 Thread ab
Author: ab Date: Tue Apr 4 22:39:28 2006 New Revision: 391525 URL: http://svn.apache.org/viewcvs?rev=391525view=rev Log: Correct javadoc. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse

svn commit: r391577 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-04-05 Thread ab
Author: ab Date: Wed Apr 5 03:09:54 2006 New Revision: 391577 URL: http://svn.apache.org/viewcvs?rev=391577view=rev Log: SelectorInverseMapper needs to implement Mapper, otherwise things break. Noticed by Shawn Gervais. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r391676 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-04-05 Thread ab
Author: ab Date: Wed Apr 5 10:01:02 2006 New Revision: 391676 URL: http://svn.apache.org/viewcvs?rev=391676view=rev Log: Fix protocol-level redirect code. Patch by Dennis Kubes. Make it clear that this is a protocol-level redirect, as opposed to a content-level redirect. Modified: lucene

svn commit: r392056 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-06 Thread ab
Author: ab Date: Thu Apr 6 13:01:47 2006 New Revision: 392056 URL: http://svn.apache.org/viewcvs?rev=392056view=rev Log: Pages with only STATUS_DB_GONE were unaccounted for, which caused an NPE. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified

svn commit: r396708 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-04-24 Thread ab
Author: ab Date: Mon Apr 24 15:56:17 2006 New Revision: 396708 URL: http://svn.apache.org/viewcvs?rev=396708view=rev Log: Fix an NPE, and simplify the logic (NUTCH-254). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: lucene/nutch/trunk/src/java/org

svn commit: r396955 - in /lucene/nutch/trunk: conf/ src/plugin/ src/plugin/parse-oo/ src/plugin/parse-oo/sample/ src/plugin/parse-oo/src/ src/plugin/parse-oo/src/java/ src/plugin/parse-oo/src/java/org

2006-04-25 Thread ab
Author: ab Date: Tue Apr 25 12:12:48 2006 New Revision: 396955 URL: http://svn.apache.org/viewcvs?rev=396955view=rev Log: Parser for OpenOffice and OpenDocument formats (an updated version of NUTCH-125). Development of this plugin was supported by Zaheed Haque. Thank you! Added: lucene

svn commit: r397169 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReducer.java

2006-04-26 Thread ab
Author: ab Date: Wed Apr 26 03:54:53 2006 New Revision: 397169 URL: http://svn.apache.org/viewcvs?rev=397169view=rev Log: Don't allow CrawlDatum.getMetaData() to return null. Underlying MapWritable is lazily instantiated to minimize the number of created objects. Refactor CrawlDbReducer to use

svn commit: r398462 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-30 Thread ab
Author: ab Date: Sun Apr 30 16:33:45 2006 New Revision: 398462 URL: http://svn.apache.org/viewcvs?rev=398462view=rev Log: Temporary workaround for a situation where we may end up with a lone STATUS_SIGNATURE. The real reason for this error is unknown at this moment, please report if you encounter

svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-05-03 Thread ab
Author: ab Date: Wed May 3 19:42:02 2006 New Revision: 399515 URL: http://svn.apache.org/viewcvs?rev=399515view=rev Log: Use the FileSystem instead of java.io.File.exists(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk/src

svn commit: r405179 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/MapWritable.java test/org/apache/nutch/crawl/TestMapWritable.java

2006-05-08 Thread ab
Author: ab Date: Mon May 8 14:48:21 2006 New Revision: 405179 URL: http://svn.apache.org/viewcvs?rev=405179view=rev Log: Fix NUTCH-263. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestMapWritable.java

svn commit: r405181 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDbReader.java LinkDb.java

2006-05-08 Thread ab
Author: ab Date: Mon May 8 14:52:09 2006 New Revision: 405181 URL: http://svn.apache.org/viewcvs?rev=405181view=rev Log: Refactor to make it easier to use these classes programmatically. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java lucene/nutch/trunk

svn commit: r405946 - /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

2006-05-12 Thread ab
Author: ab Date: Fri May 12 16:35:50 2006 New Revision: 405946 URL: http://svn.apache.org/viewcvs?rev=405946view=rev Log: Fix yet another case where TagSoup supplies invalid parameters. Modified: lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

svn commit: r405967 - in /lucene/nutch/trunk: conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java/org

2006-05-12 Thread ab
Author: ab Date: Fri May 12 17:52:33 2006 New Revision: 405967 URL: http://svn.apache.org/viewcvs?rev=405967view=rev Log: Scoring API (NUTCH-240). Development of this functionality was supported by Krugle.net. Thank you! Added: lucene/nutch/trunk/src/java/org/apache/nutch/scoring

svn commit: r406757 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/crawl/Generator.java src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-05-15 Thread ab
Author: ab Date: Mon May 15 15:18:34 2006 New Revision: 406757 URL: http://svn.apache.org/viewcvs?rev=406757view=rev Log: Fix NUTCH-268. Default settings are still different to avoid DOS-ing remote DNS servers during fetchlist generation. Modified: lucene/nutch/trunk/conf/nutch-default.xml

svn commit: r409276 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDb.java LinkDb.java

2006-05-24 Thread ab
Author: ab Date: Wed May 24 17:42:40 2006 New Revision: 409276 URL: http://svn.apache.org/viewvc?rev=409276view=rev Log: Add missing fs.mkdirs() - NUTCH-285. Submitted by Dennis Kubes. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java lucene/nutch/trunk/src/java

svn commit: r410377 - in /lucene/nutch/trunk/src/java/org/apache/nutch: crawl/Generator.java segment/SegmentMerger.java

2006-05-30 Thread ab
Author: ab Date: Tue May 30 14:12:52 2006 New Revision: 410377 URL: http://svn.apache.org/viewvc?rev=410377view=rev Log: SegmentMerger bug-fixes and improvements: * replace deprecated use of java.io.File with Hadoop's Path. * old segment name from Content.metadata needs to be replaced

svn commit: r417285 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

2006-06-26 Thread ab
Author: ab Date: Mon Jun 26 12:38:39 2006 New Revision: 417285 URL: http://svn.apache.org/viewvc?rev=417285view=rev Log: Add an optional mechanism to time limit long-running queries. This helps to protect search servers from adverse effects of certain resource-intensive queries. Development

svn commit: r423291 - /lucene/nutch/trunk/conf/nutch-default.xml

2006-07-18 Thread ab
Author: ab Date: Tue Jul 18 16:27:49 2006 New Revision: 423291 URL: http://svn.apache.org/viewvc?rev=423291view=rev Log: Add db.max.inlinks with its default value, and document it. Modified: lucene/nutch/trunk/conf/nutch-default.xml Modified: lucene/nutch/trunk/conf/nutch-default.xml URL

svn commit: r423539 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 10:35:08 2006 New Revision: 423539 URL: http://svn.apache.org/viewvc?rev=423539view=rev Log: Add ability to limit outlinks to only include initial hosts (NUTCH-173). Modified: lucene/nutch/trunk/conf/nutch-default.xml lucene/nutch/trunk/src/java/org/apache

svn commit: r423630 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 15:07:48 2006 New Revision: 423630 URL: http://svn.apache.org/viewvc?rev=423630view=rev Log: Add support for Crawl-delay in robots.txt (NUTCH-293). Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch

svn commit: r423643 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ java/org/apache/nutch/scoring/ plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 15:39:53 2006 New Revision: 423643 URL: http://svn.apache.org/viewvc?rev=423643view=rev Log: Fix a deficiency in the scoring API (NUTCH-321). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java lucene/nutch/trunk/src/java/org

svn commit: r423665 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 16:54:51 2006 New Revision: 423665 URL: http://svn.apache.org/viewvc?rev=423665view=rev Log: If a transient exception is thrown, don't mark the page as gone but retry. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: lucene

svn commit: r424965 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java

2006-07-24 Thread ab
Author: ab Date: Mon Jul 24 01:40:19 2006 New Revision: 424965 URL: http://svn.apache.org/viewvc?rev=424965view=rev Log: Set job names (NUTCH-329). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r425071 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-07-24 Thread ab
Author: ab Date: Mon Jul 24 07:41:18 2006 New Revision: 425071 URL: http://svn.apache.org/viewvc?rev=425071view=rev Log: Expire all finished addresses. When sites request long crawl delays this quickly ties down all threads, and lock expiration heppens rarely and proceeds too slowly to remove all

svn commit: r425354 - /lucene/nutch/trunk/bin/nutch

2006-07-25 Thread ab
Author: ab Date: Tue Jul 25 02:54:58 2006 New Revision: 425354 URL: http://svn.apache.org/viewvc?rev=425354view=rev Log: Change the name of SegmentReader alias to 'readseg' for consistency with other reading-related commands. Keep the old 'segread' for compatibility, and give a deprecation

svn commit: r432254 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 07:53:54 2006 New Revision: 432254 URL: http://svn.apache.org/viewvc?rev=432254view=rev Log: Move toLowerCase where it actually matters. Fix some whitespace. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api

svn commit: r432256 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 07:56:35 2006 New Revision: 432256 URL: http://svn.apache.org/viewvc?rev=432256view=rev Log: Apply patch in NUTCH-348 - Generator used the lowest score instead of the highest. Contributed by Chris Schneider and Stefan Groschupf. Modified: lucene/nutch/trunk/src

svn commit: r432287 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/crawl/Generator.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:35:35 2006 New Revision: 432287 URL: http://svn.apache.org/viewvc?rev=432287view=rev Log: Apply patch in NUTCH-348 - Generator used the lowest score instead of the highest. Contributed by Chris Schneider and Stefan Groschupf. Modified: lucene/nutch/branches

svn commit: r432290 - /lucene/nutch/branches/branch-0.8/CHANGES.txt

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:38:21 2006 New Revision: 432290 URL: http://svn.apache.org/viewvc?rev=432290view=rev Log: Update CHANGES. Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene

svn commit: r432293 - /lucene/nutch/trunk/CHANGES.txt

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:41:12 2006 New Revision: 432293 URL: http://svn.apache.org/viewvc?rev=432293view=rev Log: Update CHANGES. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev

svn commit: r432674 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java

2006-08-18 Thread ab
Author: ab Date: Fri Aug 18 11:48:29 2006 New Revision: 432674 URL: http://svn.apache.org/viewvc?rev=432674view=rev Log: NUTCH-341 - if -workingdir is specified, always create a unique subdir. Also, use unique directory names to allow multiple IndexMergers to run simultaneously. Modified

svn commit: r432675 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/indexer/IndexMerger.java

2006-08-18 Thread ab
Author: ab Date: Fri Aug 18 11:50:00 2006 New Revision: 432675 URL: http://svn.apache.org/viewvc?rev=432675view=rev Log: NUTCH-341 - if -workingdir is specified, always create a unique subdir. Also, use unique directory names to allow multiple IndexMergers to run simultaneously. Modified

svn commit: r447359 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

2006-09-18 Thread ab
Author: ab Date: Mon Sep 18 03:43:07 2006 New Revision: 447359 URL: http://svn.apache.org/viewvc?view=revrev=447359 Log: Fix an NPE when using searcher.max.hits, but NOT using time limit. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

svn commit: r449088 [2/2] - in /lucene/nutch/trunk: conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/net/ src/java/org/apache/nutch/parse/ src/plugin

2006-09-22 Thread ab
Added: lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java URL:

svn commit: r449294 - in /lucene/nutch/branches/branch-0.8: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/protocol/ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 12:45:48 2006 New Revision: 449294 URL: http://svn.apache.org/viewvc?view=revrev=449294 Log: NUTCH-350: Urls blocked by http.max.delays incorrectly marked as GONE. Added: lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http

svn commit: r449738 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 09:58:49 2006 New Revision: 449738 URL: http://svn.apache.org/viewvc?view=revrev=449738 Log: Don't create dummy Content (throws NPE), just pass null. Reported by Richard Braman. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol

svn commit: r449742 - /lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 10:05:22 2006 New Revision: 449742 URL: http://svn.apache.org/viewvc?view=revrev=449742 Log: Don't create dummy Content (throws NPE), just pass null. Reported by Richard Braman. Modified: lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache

svn commit: r449765 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/OutlinkExtractor.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 11:14:31 2006 New Revision: 449765 URL: http://svn.apache.org/viewvc?view=revrev=449765 Log: Catch exception on invalid urls, and continue collecting valid ones. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/OutlinkExtractor.java

svn commit: r450799 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/crawl/CrawlDb.java src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-09-28 Thread ab
Author: ab Date: Thu Sep 28 03:48:25 2006 New Revision: 450799 URL: http://svn.apache.org/viewvc?view=revrev=450799 Log: Bring back the '-noAdditions' option. This is useful for running constrained crawls, where the complete list of URLs is known in advance. Modified: lucene/nutch/trunk/conf

svn commit: r454297 - /lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms/MSBaseParser.java

2006-10-09 Thread ab
Author: ab Date: Mon Oct 9 00:13:46 2006 New Revision: 454297 URL: http://svn.apache.org/viewvc?view=revrev=454297 Log: Fix NPE when document properties are null. Reported by Trym Asserson. Modified: lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms

svn commit: r454298 - /lucene/nutch/branches/branch-0.8/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms/MSBaseParser.java

2006-10-09 Thread ab
Author: ab Date: Mon Oct 9 00:22:00 2006 New Revision: 454298 URL: http://svn.apache.org/viewvc?view=revrev=454298 Log: Fix NPE when document properties are null. Reported by Trym Asserson. Modified: lucene/nutch/branches/branch-0.8/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms

svn commit: r468673 - /lucene/nutch/branches/branch-0.8/build.xml

2006-10-28 Thread ab
Author: ab Date: Sat Oct 28 03:32:44 2006 New Revision: 468673 URL: http://svn.apache.org/viewvc?view=revrev=468673 Log: Fix NUTCH-394. Modified: lucene/nutch/branches/branch-0.8/build.xml Modified: lucene/nutch/branches/branch-0.8/build.xml URL: http://svn.apache.org/viewvc/lucene/nutch

svn commit: r469662 - /lucene/nutch/trunk/CHANGES.txt

2006-10-31 Thread ab
Author: ab Date: Tue Oct 31 13:36:01 2006 New Revision: 469662 URL: http://svn.apache.org/viewvc?view=revrev=469662 Log: Update. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev

svn commit: r469667 - in /lucene/nutch/branches/branch-0.8: CHANGES.txt src/java/org/apache/nutch/crawl/Generator.java

2006-10-31 Thread ab
Author: ab Date: Tue Oct 31 13:46:26 2006 New Revision: 469667 URL: http://svn.apache.org/viewvc?view=revrev=469667 Log: NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one partition. Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt lucene/nutch/branches/branch-0.8

svn commit: r474756 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-11-14 Thread ab
Author: ab Date: Tue Nov 14 04:11:30 2006 New Revision: 474756 URL: http://svn.apache.org/viewvc?view=revrev=474756 Log: NUTCH-401: use hadoop.tmp.dir instead of hardcoded /tmp. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk

svn commit: r474763 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/segment/SegmentReader.java

2006-11-14 Thread ab
Author: ab Date: Tue Nov 14 04:24:48 2006 New Revision: 474763 URL: http://svn.apache.org/viewvc?view=revrev=474763 Log: NUTCH-401: use mapred.temp.dir instead of hardcoded /tmp. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/segment/SegmentReader.java Modified

svn commit: r474934 - /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/MetaWrapper.java

2006-11-14 Thread ab
Author: ab Date: Tue Nov 14 11:38:06 2006 New Revision: 474934 URL: http://svn.apache.org/viewvc?view=revrev=474934 Log: Add an ObjectWritable decorator. Added: lucene/nutch/trunk/src/java/org/apache/nutch/metadata/MetaWrapper.java (with props) Added: lucene/nutch/trunk/src/java/org

svn commit: r480188 - in /lucene/nutch/trunk/src: java/org/apache/nutch/fetcher/ java/org/apache/nutch/indexer/ java/org/apache/nutch/metadata/ java/org/apache/nutch/parse/ java/org/apache/nutch/segme

2006-11-28 Thread ab
Author: ab Date: Tue Nov 28 12:14:58 2006 New Revision: 480188 URL: http://svn.apache.org/viewvc?view=revrev=480188 Log: Move some constants to Nutch.java, so that Metadata could use them properly. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java lucene/nutch

svn commit: r480207 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/ java/org/apache/nutch/protocol/ plugin/protocol-http/src/java/org/apache/nutch/protocol/http/ plugin/protocol-httpclie

2006-11-28 Thread ab
Author: ab Date: Tue Nov 28 13:02:10 2006 New Revision: 480207 URL: http://svn.apache.org/viewvc?view=revrev=480207 Log: Use SpellCheckedMetadata only when necessary, i.e. only when collecting metadata from unreliable sources such as HTTP headers. * Metadata: fix a bug where SpellCheckedMetadata

svn commit: r482674 - in /lucene/nutch/trunk/src: java/org/apache/nutch/fetcher/ java/org/apache/nutch/protocol/ plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ plugin/protocol-file/src/j

2006-12-05 Thread ab
Author: ab Date: Tue Dec 5 06:34:13 2006 New Revision: 482674 URL: http://svn.apache.org/viewvc?view=revrev=482674 Log: Refactor robots.txt checking so that it's protocol independent. Make blocking and robots checking optional inside lib-http. This is needed for alternative Fetcher

svn commit: r483420 - in /lucene/nutch/trunk: lib/hadoop-0.7.1.jar lib/hadoop-0.9.1.jar src/java/org/apache/nutch/crawl/CrawlDb.java src/java/org/apache/nutch/parse/ParseOutputFormat.java src/test/org

2006-12-07 Thread ab
Author: ab Date: Thu Dec 7 03:21:08 2006 New Revision: 483420 URL: http://svn.apache.org/viewvc?view=revrev=483420 Log: Upgrade to Hadoop 0.9.1 . Added: lucene/nutch/trunk/lib/hadoop-0.9.1.jar (with props) Removed: lucene/nutch/trunk/lib/hadoop-0.7.1.jar Modified: lucene/nutch

svn commit: r485587 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java

2006-12-11 Thread ab
Author: ab Date: Mon Dec 11 02:04:59 2006 New Revision: 485587 URL: http://svn.apache.org/viewvc?view=revrev=485587 Log: Remove misplaced cast, which sometimes lead to an overflow. Close readers when done - when using local FS this would prevent us from deleting temporary dirs. Modified

svn commit: r487143 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java

2006-12-14 Thread ab
Author: ab Date: Thu Dec 14 00:53:08 2006 New Revision: 487143 URL: http://svn.apache.org/viewvc?view=revrev=487143 Log: Check if paths exist before deleting them. Reported by Renaud Richardet. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java Modified: lucene/nutch

svn commit: r487145 - in /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/crawl: CrawlDb.java LinkDb.java

2006-12-14 Thread ab
Author: ab Date: Thu Dec 14 01:06:56 2006 New Revision: 487145 URL: http://svn.apache.org/viewvc?view=revrev=487145 Log: Check if paths exist before deleting them. Reported by Renaud Richardet. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/crawl/CrawlDb.java

svn commit: r493085 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java

2007-01-05 Thread ab
Author: ab Date: Fri Jan 5 08:58:29 2007 New Revision: 493085 URL: http://svn.apache.org/viewvc?view=revrev=493085 Log: Fix NUTCH-425 and NUTCH-426. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java

svn commit: r495214 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Crawl.java src/java/org/apache/nutch/indexer/Indexer.java

2007-01-11 Thread ab
Author: ab Date: Thu Jan 11 05:25:43 2007 New Revision: 495214 URL: http://svn.apache.org/viewvc?view=revrev=495214 Log: When indexing redirected pages, drop intermediate pages and only index the final page. Avoid NPEs in Crawl tool, when no URLs are generated or fetched. Modified: lucene

svn commit: r495397 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/DeleteDuplicates.java src/test/org/apache/nutch/indexer/TestDeleteDuplicates.java

2007-01-11 Thread ab
Author: ab Date: Thu Jan 11 14:00:51 2007 New Revision: 495397 URL: http://svn.apache.org/viewvc?view=revrev=495397 Log: Fix NUTCH-420 - DeleteDuplicates depended on the order of IndexDoc processing.. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r496535 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2007-01-15 Thread ab
Author: ab Date: Mon Jan 15 15:07:15 2007 New Revision: 496535 URL: http://svn.apache.org/viewvc?view=revrev=496535 Log: Pick the right entry, as indicated by the same generate time. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Modified: lucene/nutch/trunk/src

svn commit: r497141 - in /lucene/nutch/trunk: CHANGES.txt bin/nutch src/java/org/apache/nutch/tools/FreeGenerator.java

2007-01-17 Thread ab
Author: ab Date: Wed Jan 17 11:55:07 2007 New Revision: 497141 URL: http://svn.apache.org/viewvc?view=revrev=497141 Log: NUTCH-68 - ported to use map-reduce. Added: lucene/nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java (with props) Modified: lucene/nutch/trunk

svn commit: r497172 - in /lucene/nutch/trunk: bin/nutch src/java/org/apache/nutch/fetcher/Fetcher.java src/java/org/apache/nutch/fetcher/Fetcher2.java

2007-01-17 Thread ab
Author: ab Date: Wed Jan 17 13:06:50 2007 New Revision: 497172 URL: http://svn.apache.org/viewvc?view=revrev=497172 Log: Revert accidental change to bin/nutch. Fix Fetcher.java to correctly split input. Add Fetcher2 - a queue-based fetcher implementation. Added: lucene/nutch/trunk/src/java

svn commit: r499944 - /lucene/nutch/trunk/CHANGES.txt

2007-01-25 Thread ab
Author: ab Date: Thu Jan 25 12:15:34 2007 New Revision: 499944 URL: http://svn.apache.org/viewvc?view=revrev=499944 Log: Mention the addition of Fetcher2. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk

svn commit: r507504 - in /lucene/nutch/trunk/src/java/org/apache/nutch/parse: Outlink.java ParseSegment.java

2007-02-14 Thread ab
Author: ab Date: Wed Feb 14 04:15:05 2007 New Revision: 507504 URL: http://svn.apache.org/viewvc?view=revrev=507504 Log: Outlink: when null anchor is supplied replace it with an empty string. ParseSegment: store segment name in parts that we produce here. Content is only read, not stored as one

svn commit: r515698 - in /lucene/nutch/trunk: CHANGES.txt bin/nutch

2007-03-07 Thread ab
Author: ab Date: Wed Mar 7 11:02:56 2007 New Revision: 515698 URL: http://svn.apache.org/viewvc?view=revrev=515698 Log: NUTCH-432 - JAVA_PLATFORM with spaces breaks bin/nutch. Also, apply the patch proposed in HADOOP-1080 to fix CLASSPATH problems under Cygwin. Modified: lucene/nutch/trunk

svn commit: r515791 - in /lucene/nutch/trunk: ./ lib/ lib/native/Linux-i386-32/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apac

2007-03-07 Thread ab
Author: ab Date: Wed Mar 7 13:59:07 2007 New Revision: 515791 URL: http://svn.apache.org/viewvc?view=revrev=515791 Log: Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 releases. Added: lucene/nutch/trunk/lib/hadoop-0.11.2-core.jar (with props) lucene/nutch/trunk/lib/lucene-core-2.1.0.jar

svn commit: r516387 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher2.java

2007-03-09 Thread ab
Author: ab Date: Fri Mar 9 04:27:18 2007 New Revision: 516387 URL: http://svn.apache.org/viewvc?view=revrev=516387 Log: Add the number of active threads to the status report. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher2.java Modified: lucene/nutch/trunk/src/java

svn commit: r517382 - in /lucene/nutch/trunk/contrib/web2/plugins: web-caching-oscache/ web-caching-oscache/src/conf/ web-clustering/ web-keymatch/ web-more/ web-more/src/conf/ web-query-propose-ontol

2007-03-12 Thread ab
Author: ab Date: Mon Mar 12 13:35:37 2007 New Revision: 517382 URL: http://svn.apache.org/viewvc?view=revrev=517382 Log: Fix inconsistent end-of-line style. Discovered this when trying to import to a separate subversion repo. Modified: lucene/nutch/trunk/contrib/web2/plugins/web-caching

svn commit: r520154 - in /lucene/nutch/trunk: ./ lib/ lib/native/Linux-i386-32/

2007-03-19 Thread ab
Author: ab Date: Mon Mar 19 16:02:56 2007 New Revision: 520154 URL: http://svn.apache.org/viewvc?view=revrev=520154 Log: Update to Hadoop 0.12.1. Added: lucene/nutch/trunk/lib/hadoop-0.12.1-core.jar (with props) lucene/nutch/trunk/lib/jets3t-0.5.0.jar (with props) Removed: lucene

svn commit: r521182 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Injector.java

2007-03-22 Thread ab
Author: ab Date: Thu Mar 22 03:08:00 2007 New Revision: 521182 URL: http://svn.apache.org/viewvc?view=revrev=521182 Log: NUTCH-246 - incorrect segment size being generated due to time synchronization issue. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache

svn commit: r521933 - in /lucene/nutch/trunk: ./ lib/ lib/native/Linux-i386-32/ src/test/org/apache/nutch/indexer/

2007-03-23 Thread ab
Author: ab Date: Fri Mar 23 15:59:01 2007 New Revision: 521933 URL: http://svn.apache.org/viewvc?view=revrev=521933 Log: Upgrade to Hadoop 0.12.2 release. Fix whitespace issues in platform name in bin/hadoop under Cygwin. Replace deprecated method call. Added: lucene/nutch/trunk/lib/hadoop

svn commit: r526455 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java

2007-04-07 Thread ab
Author: ab Date: Sat Apr 7 09:44:02 2007 New Revision: 526455 URL: http://svn.apache.org/viewvc?view=revrev=526455 Log: Empty MapWritable would throw an NPE when building a keySet. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java Modified: lucene/nutch/trunk

  1   2   >