svn commit: r208869 [2/12] - in /lucene/nutch/trunk: conf/ src/plugin/languageidentifier/ src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/

2005-07-02 Thread ab
6698 -tnin 6691 -ag_ -sæ 6663 -ndr 6626 -sen 6593 -ente 6571 -dig 6562 -ga 6555 -kl 6546 -tu 6517 -bes 6500 -fe 6473 -lag 6461 -red 6435 -lin 6431 -ks 6412 -dre 6409 -ment 6405 -kal_ 6387 -skal 6383 -ved 6325 -ab 6321 -sam 6321 -æl 6310 -par 6284 -v_ 6283 -bet 6246 -est 6224 -ner_ 6218 -ve_ 6218

svn commit: r208869 [12/12] - in /lucene/nutch/trunk: conf/ src/plugin/languageidentifier/ src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ src/plugin/languageidentifier/src/test/org/apache/nutch/analysis/lang/

2005-07-02 Thread ab
2734 -utan 2733 -aga 2731 -änk 2726 -org 2723 -öj 2723 -ab 2722 -ven_ 2716 -is_ 2711 -dli 2704 -rän 2704 -nkt 2703 -rfö 2698 -dag 2693 -ien 2692 -tti 2689 -bö 2676 -ske 2672 -amt 2669 -and_ 2669 -tvi 2662 -rag 2654 -ckli 2653 -ive 2647 -dd 2646 -rför 2646 -avs 2645 -dern 2645 -beh 2644 -nade

svn commit: r209246 - in /lucene/nutch/trunk/src/plugin/parse-js: plugin.xml src/java/org/apache/nutch/parse/js/JSParseFilter.java

2005-07-05 Thread ab
Author: ab Date: Tue Jul 5 01:56:01 2005 New Revision: 209246 URL: http://svn.apache.org/viewcvs?rev=209246view=rev Log: Active this as Parser plugin (it was accidentally omitted). Accept also empty content type, if the extension is right. Modified: lucene/nutch/trunk/src/plugin/parse-js

svn commit: r209669 - /lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java

2005-07-07 Thread ab
Author: ab Date: Thu Jul 7 15:43:49 2005 New Revision: 209669 URL: http://svn.apache.org/viewcvs?rev=209669view=rev Log: Forgot to add this one, sorry. Added: lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java (with props

svn commit: r219097 - in /lucene/nutch/trunk/src/java/org/apache/nutch: fs/NDFSFileSystem.java ndfs/FSDirectory.java ndfs/NDFSFile.java ndfs/NDFSFileInfo.java

2005-07-14 Thread ab
Author: ab Date: Thu Jul 14 13:54:16 2005 New Revision: 219097 URL: http://svn.apache.org/viewcvs?rev=219097view=rev Log: Fix issues reported in NUTCH-46. Submitted by Piotr Kosiorowski. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fs/NDFSFileSystem.java lucene/nutch/trunk/src

svn commit: r220034 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient: DummySSLProtocolSocketFactory.java Http.java HttpResponse.java

2005-07-21 Thread ab
Author: ab Date: Thu Jul 21 04:07:10 2005 New Revision: 220034 URL: http://svn.apache.org/viewcvs?rev=220034view=rev Log: Fixes for NUTCH-66, and some cleanup: * apply the more lenient CookiePolicy. This fixes the problem with poorly formatted cookies being rejected. * remove the while() loop

svn commit: r232303 - in /lucene/nutch/trunk/src/plugin: ./ parse-rss/ parse-rss/lib/ parse-rss/sample/ parse-rss/src/ parse-rss/src/java/ parse-rss/src/java/org/ parse-rss/src/java/org/apache/ parse-

2005-08-12 Thread ab
Author: ab Date: Fri Aug 12 07:23:47 2005 New Revision: 232303 URL: http://svn.apache.org/viewcvs?rev=232303view=rev Log: RSS Parse plugin. Contributed by Chris Mattmann (issue NUTCH-30). Thank you! Added: lucene/nutch/trunk/src/plugin/parse-rss/ lucene/nutch/trunk/src/plugin/parse-rss

svn commit: r290382 - in /lucene/nutch/trunk/src/plugin/parse-pdf: lib/PDFBox-0.7.0.LICENSE.txt lib/PDFBox-0.7.0.jar lib/PDFBox-0.7.2-log4j.jar lib/PDFBox-LICENSE.txt lib/log4j-1.2.9.LICENSE.txt lib/log4j-LICENSE.txt plugin.xml

2005-09-20 Thread ab
Author: ab Date: Tue Sep 20 00:09:45 2005 New Revision: 290382 URL: http://svn.apache.org/viewcvs?rev=290382view=rev Log: Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two versions - one has no logging at all, the other uses log4j (which is the version used here

svn commit: r290383 - in /lucene/nutch/branches/Release-0.7/src/plugin/parse-pdf: lib/PDFBox-0.7.0.LICENSE.txt lib/PDFBox-0.7.0.jar lib/PDFBox-0.7.2-log4j.jar lib/PDFBox-LICENSE.txt lib/log4j-1.2.9.LICENSE.txt lib/log4j-LICENSE.txt plugin.xml

2005-09-20 Thread ab
Author: ab Date: Tue Sep 20 00:16:42 2005 New Revision: 290383 URL: http://svn.apache.org/viewcvs?rev=290383view=rev Log: Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two versions - one has no logging at all, the other uses log4j (which is the version used here

svn commit: r348417 - in /lucene/nutch/branches/mapred: bin/nutch src/java/org/apache/nutch/fs/NDFSShell.java src/java/org/apache/nutch/fs/TestClient.java

2005-11-23 Thread ab
Author: ab Date: Wed Nov 23 04:09:07 2005 New Revision: 348417 URL: http://svn.apache.org/viewcvs?rev=348417view=rev Log: Rename the TestClient to NDFSShell. Modify the bin/nutch script accordingly. Added: lucene/nutch/branches/mapred/src/java/org/apache/nutch/fs/NDFSShell.java

svn commit: r348432 - /lucene/nutch/branches/mapred/bin/nutch

2005-11-23 Thread ab
Author: ab Date: Wed Nov 23 05:22:05 2005 New Revision: 348432 URL: http://svn.apache.org/viewcvs?rev=348432view=rev Log: Add readdb command - human interface to CrawlDbReader. Modified: lucene/nutch/branches/mapred/bin/nutch Modified: lucene/nutch/branches/mapred/bin/nutch URL: http

svn commit: r348503 - /lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl/CrawlDbReader.java

2005-11-23 Thread ab
Author: ab Date: Wed Nov 23 09:44:41 2005 New Revision: 348503 URL: http://svn.apache.org/viewcvs?rev=348503view=rev Log: Remove dependency on Java 1.5. Spotted by Sami Siren. Modified: lucene/nutch/branches/mapred/src/java/org/apache/nutch/crawl/CrawlDbReader.java Modified: lucene/nutch

svn commit: r357962 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: Fetcher.java Generator.java Injector.java

2005-12-20 Thread ab
Author: ab Date: Tue Dec 20 03:19:21 2005 New Revision: 357962 URL: http://svn.apache.org/viewcvs?rev=357962view=rev Log: Remove remaining calls to static NutchConf.get() in favor of using per-instance local configuration getConf(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r358674 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ java/org/apache/nutch/indexer/ plugin/creativecommons/src/java/org/creativecommons/nutch/ plugin/index-basic/src/java/org

2005-12-22 Thread ab
Author: ab Date: Thu Dec 22 17:16:31 2005 New Revision: 358674 URL: http://svn.apache.org/viewcvs?rev=358674view=rev Log: Remove traces of the old API FetcherOutput. The old IndexSegment is now marked broken. In the next step old utilities should be removed. Modified: lucene/nutch/trunk/src

svn commit: r359668 [2/2] - in /lucene/nutch/trunk: bin/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/db/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/o

2005-12-28 Thread ab
Added: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java?rev=359668view=auto == ---

svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/jav

2005-12-29 Thread ab
Author: ab Date: Thu Dec 29 07:28:30 2005 New Revision: 359822 URL: http://svn.apache.org/viewcvs?rev=359822view=rev Log: A framework for using different page signature implementations. Ordinary MD5 hash of a raw page content is very often unsuitable, when many near-duplicate pages are crawled

svn commit: r360017 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java

2005-12-29 Thread ab
Author: ab Date: Thu Dec 29 23:55:52 2005 New Revision: 360017 URL: http://svn.apache.org/viewcvs?rev=360017view=rev Log: Fix this - the Fetcher API has changed. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r360069 - /lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java

2005-12-30 Thread ab
Author: ab Date: Fri Dec 30 03:07:50 2005 New Revision: 360069 URL: http://svn.apache.org/viewcvs?rev=360069view=rev Log: Fix incorrect package declaration. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r365345 - /lucene/nutch/trunk/build.xml

2006-01-02 Thread ab
Author: ab Date: Mon Jan 2 05:26:19 2006 New Revision: 365345 URL: http://svn.apache.org/viewcvs?rev=365345view=rev Log: Not needed anymore. Modified: lucene/nutch/trunk/build.xml Modified: lucene/nutch/trunk/build.xml URL: http://svn.apache.org/viewcvs/lucene/nutch/trunk/build.xml?rev

svn commit: r365576 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-01-03 Thread ab
Author: ab Date: Tue Jan 3 00:35:04 2006 New Revision: 365576 URL: http://svn.apache.org/viewcvs?rev=365576view=rev Log: Fixed an NPE, in case of a fetch error we don't have a score value from Fetcher. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-03 Thread ab
Author: ab Date: Tue Jan 3 23:32:04 2006 New Revision: 365850 URL: http://svn.apache.org/viewcvs?rev=365850view=rev Log: Update Commons HTTPClient to v. 3.0. Add some default headers to prefer HTML content, and in English. Added: lucene/nutch/trunk/src/plugin/protocol-httpclient/lib

svn commit: r367251 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseData.java

2006-01-09 Thread ab
Author: ab Date: Mon Jan 9 00:58:58 2006 New Revision: 367251 URL: http://svn.apache.org/viewcvs?rev=367251view=rev Log: Replace the custom metadata serialization with the one provided by the ContentProperties class. This fixes the breakage if multiple property values per key are in use

svn commit: r368167 - in /lucene/nutch/trunk/src/java/org/apache/nutch: fetcher/Fetcher.java parse/ParseSegment.java

2006-01-11 Thread ab
Author: ab Date: Wed Jan 11 15:24:40 2006 New Revision: 368167 URL: http://svn.apache.org/viewcvs?rev=368167view=rev Log: Make sure we always have the segment name and score values in ParseData.metadata. Sometimes plugins would fail to copy them through, or a parsing error would produce empty

svn commit: r370854 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-01-20 Thread ab
Author: ab Date: Fri Jan 20 08:00:04 2006 New Revision: 370854 URL: http://svn.apache.org/viewcvs?rev=370854view=rev Log: Move excessive logging to Level.FINE. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Modified: lucene/nutch

svn commit: r372664 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java

2006-01-26 Thread ab
Author: ab Date: Thu Jan 26 15:48:14 2006 New Revision: 372664 URL: http://svn.apache.org/viewcvs?rev=372664view=rev Log: Fix for NUTCH-190 (ParseUtil drops reason for failed parse). Also, hide the message about successful parse behind LOG.fine() - no message is a good message; if users want

svn commit: r374724 - in /lucene/nutch/trunk: conf/ src/plugin/ src/plugin/parse-swf/ src/plugin/parse-swf/lib/ src/plugin/parse-swf/sample/ src/plugin/parse-swf/src/ src/plugin/parse-swf/src/java/ sr

2006-02-03 Thread ab
Author: ab Date: Fri Feb 3 10:49:07 2006 New Revision: 374724 URL: http://svn.apache.org/viewcvs?rev=374724view=rev Log: Add a parse plugin for SWF (Macromedia Flash) files. Add a mapping in parse-plugins.xml . Add also an entry in mime-types.xml corresponding to an alternative, compressed SWF

svn commit: r374725 - /lucene/nutch/trunk/src/web/jsp/search.jsp

2006-02-03 Thread ab
Author: ab Date: Fri Feb 3 10:54:12 2006 New Revision: 374725 URL: http://svn.apache.org/viewcvs?rev=374725view=rev Log: Encode the query on links in UTF-8. Modified: lucene/nutch/trunk/src/web/jsp/search.jsp Modified: lucene/nutch/trunk/src/web/jsp/search.jsp URL: http://svn.apache.org

svn commit: r376518 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ test/org/apache/nutch/crawl/

2006-02-09 Thread ab
Author: ab Date: Thu Feb 9 16:56:57 2006 New Revision: 376518 URL: http://svn.apache.org/viewcvs?rev=376518view=rev Log: Add metadata to CrawlDatum. Contributed by Stefan Groschupf in NUTCH-192. Added: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java (with props

svn commit: r380163 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java

2006-02-23 Thread ab
Author: ab Date: Thu Feb 23 09:21:38 2006 New Revision: 380163 URL: http://svn.apache.org/viewcvs?rev=380163view=rev Log: Modify the cmd-line so that it's possible to perform incremental updates on existing linkDb. This significantly speeds up the invertlinks operation. Modified: lucene

svn commit: r383829 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java

2006-03-07 Thread ab
Author: ab Date: Tue Mar 7 01:26:54 2006 New Revision: 383829 URL: http://svn.apache.org/viewcvs?rev=383829view=rev Log: Cache instances of ParsePluginList. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java Modified: lucene/nutch/trunk/src/java/org/apache

svn commit: r384011 - /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/Query.java

2006-03-07 Thread ab
Author: ab Date: Tue Mar 7 13:04:31 2006 New Revision: 384011 URL: http://svn.apache.org/viewcvs?rev=384011view=rev Log: No-arg constructors are required for RPC. Allow the RPC Server to set local Configuration. Fix by Marko Bauhardt. Modified: lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r384012 - /lucene/nutch/trunk/src/java/org/apache/nutch/servlet/Cached.java

2006-03-07 Thread ab
Author: ab Date: Tue Mar 7 13:09:31 2006 New Revision: 384012 URL: http://svn.apache.org/viewcvs?rev=384012view=rev Log: Servlets are initialized by a servlet container through no-args init(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/servlet/Cached.java Modified: lucene/nutch

svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-03-08 Thread ab
Author: ab Date: Wed Mar 8 06:10:12 2006 New Revision: 384219 URL: http://svn.apache.org/viewcvs?rev=384219view=rev Log: Don't generate URLs that don't pass URLFilters. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Modified: lucene/nutch/trunk/src/java/org

svn commit: r385531 - /lucene/nutch/trunk/src/java/org/apache/nutch/plugin/PluginManifestParser.java

2006-03-13 Thread ab
Author: ab Date: Mon Mar 13 04:16:54 2006 New Revision: 385531 URL: http://svn.apache.org/viewcvs?rev=385531view=rev Log: Print out the full path of plugins directory. Allow using plugins located outside classpath, eg. in shared repos. Submitted by Stefan Groschupf, NUTCH-229. Modified

svn commit: r386875 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-03-18 Thread ab
Author: ab Date: Sat Mar 18 11:21:11 2006 New Revision: 386875 URL: http://svn.apache.org/viewcvs?rev=386875view=rev Log: Apply patch in NUTCH-230, which provides additional control over which outlinks are considered for OPIC cash value distribution. Modified: lucene/nutch/trunk/src/java/org

svn commit: r387341 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: Inlinks.java LinkDb.java LinkDbReader.java

2006-03-20 Thread ab
Author: ab Date: Mon Mar 20 15:20:56 2006 New Revision: 387341 URL: http://svn.apache.org/viewcvs?rev=387341view=rev Log: Don't allow Inlink duplicates (NUTCH-235). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Inlinks.java lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r387578 - in /lucene/nutch/trunk/src: java/org/apache/nutch/clustering/ plugin/clustering-carrot2/src/java/org/apache/nutch/clustering/carrot2/ plugin/clustering-carrot2/src/test/org/apach

2006-03-21 Thread ab
Author: ab Date: Tue Mar 21 08:43:09 2006 New Revision: 387578 URL: http://svn.apache.org/viewcvs?rev=387578view=rev Log: Cleanup and JUnit test for Carrot2. Contributed by Dawid Weiss (NUTCH-234). Added: lucene/nutch/trunk/src/plugin/clustering-carrot2/src/java/org/apache/nutch/clustering

svn commit: r390275 - /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

2006-03-30 Thread ab
Author: ab Date: Thu Mar 30 15:07:48 2006 New Revision: 390275 URL: http://svn.apache.org/viewcvs?rev=390275view=rev Log: Fix a bug where TagSoup would sometimes submit invalid index values. Modified: lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html

svn commit: r391044 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-03 Thread ab
Author: ab Date: Mon Apr 3 06:35:34 2006 New Revision: 391044 URL: http://svn.apache.org/viewcvs?rev=391044view=rev Log: Make sure we use new values for score, metadata, fetch interval and fetch time. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified

svn commit: r391055 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-03 Thread ab
Author: ab Date: Mon Apr 3 07:36:19 2006 New Revision: 391055 URL: http://svn.apache.org/viewcvs?rev=391055view=rev Log: Forgot to properly initialize the score. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified: lucene/nutch/trunk/src/java/org

svn commit: r391271 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-04-04 Thread ab
Author: ab Date: Tue Apr 4 04:03:09 2006 New Revision: 391271 URL: http://svn.apache.org/viewcvs?rev=391271view=rev Log: Use a separate float value for sorting and selecting topN records. This decouples the selection process from values in CrawlDatum and its compareTo. See NUTCH-240 for more

svn commit: r391525 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java

2006-04-04 Thread ab
Author: ab Date: Tue Apr 4 22:39:28 2006 New Revision: 391525 URL: http://svn.apache.org/viewcvs?rev=391525view=rev Log: Correct javadoc. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse

svn commit: r391577 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-04-05 Thread ab
Author: ab Date: Wed Apr 5 03:09:54 2006 New Revision: 391577 URL: http://svn.apache.org/viewcvs?rev=391577view=rev Log: SelectorInverseMapper needs to implement Mapper, otherwise things break. Noticed by Shawn Gervais. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r391676 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-04-05 Thread ab
Author: ab Date: Wed Apr 5 10:01:02 2006 New Revision: 391676 URL: http://svn.apache.org/viewcvs?rev=391676view=rev Log: Fix protocol-level redirect code. Patch by Dennis Kubes. Make it clear that this is a protocol-level redirect, as opposed to a content-level redirect. Modified: lucene

svn commit: r392056 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-06 Thread ab
Author: ab Date: Thu Apr 6 13:01:47 2006 New Revision: 392056 URL: http://svn.apache.org/viewcvs?rev=392056view=rev Log: Pages with only STATUS_DB_GONE were unaccounted for, which caused an NPE. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified

svn commit: r393330 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-04-11 Thread ab
Author: ab Date: Tue Apr 11 16:36:44 2006 New Revision: 393330 URL: http://svn.apache.org/viewcvs?rev=393330view=rev Log: Improved SegmentReader: * fix breakage - couldn't write to already existing subdirectory. Now output directory is specified in arguments. * add functionality to retrieve

svn commit: r393921 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-13 Thread ab
Author: ab Date: Thu Apr 13 13:32:02 2006 New Revision: 393921 URL: http://svn.apache.org/viewcvs?rev=393921view=rev Log: Fix an NPE. Reported by Marko Bauhardt. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java Modified: lucene/nutch/trunk/src/java/org/apache

svn commit: r396708 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-04-24 Thread ab
Author: ab Date: Mon Apr 24 15:56:17 2006 New Revision: 396708 URL: http://svn.apache.org/viewcvs?rev=396708view=rev Log: Fix an NPE, and simplify the logic (NUTCH-254). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: lucene/nutch/trunk/src/java/org

svn commit: r396955 - in /lucene/nutch/trunk: conf/ src/plugin/ src/plugin/parse-oo/ src/plugin/parse-oo/sample/ src/plugin/parse-oo/src/ src/plugin/parse-oo/src/java/ src/plugin/parse-oo/src/java/org

2006-04-25 Thread ab
Author: ab Date: Tue Apr 25 12:12:48 2006 New Revision: 396955 URL: http://svn.apache.org/viewcvs?rev=396955view=rev Log: Parser for OpenOffice and OpenDocument formats (an updated version of NUTCH-125). Development of this plugin was supported by Zaheed Haque. Thank you! Added: lucene

svn commit: r397169 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDatum.java CrawlDbReducer.java

2006-04-26 Thread ab
Author: ab Date: Wed Apr 26 03:54:53 2006 New Revision: 397169 URL: http://svn.apache.org/viewcvs?rev=397169view=rev Log: Don't allow CrawlDatum.getMetaData() to return null. Underlying MapWritable is lazily instantiated to minimize the number of created objects. Refactor CrawlDbReducer to use

svn commit: r398462 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-04-30 Thread ab
Author: ab Date: Sun Apr 30 16:33:45 2006 New Revision: 398462 URL: http://svn.apache.org/viewcvs?rev=398462view=rev Log: Temporary workaround for a situation where we may end up with a lone STATUS_SIGNATURE. The real reason for this error is unknown at this moment, please report if you encounter

svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-05-03 Thread ab
Author: ab Date: Wed May 3 19:42:02 2006 New Revision: 399515 URL: http://svn.apache.org/viewcvs?rev=399515view=rev Log: Use the FileSystem instead of java.io.File.exists(). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk/src

svn commit: r405179 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/MapWritable.java test/org/apache/nutch/crawl/TestMapWritable.java

2006-05-08 Thread ab
Author: ab Date: Mon May 8 14:48:21 2006 New Revision: 405179 URL: http://svn.apache.org/viewcvs?rev=405179view=rev Log: Fix NUTCH-263. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestMapWritable.java

svn commit: r405181 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDbReader.java LinkDb.java

2006-05-08 Thread ab
Author: ab Date: Mon May 8 14:52:09 2006 New Revision: 405181 URL: http://svn.apache.org/viewcvs?rev=405181view=rev Log: Refactor to make it easier to use these classes programmatically. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java lucene/nutch/trunk

svn commit: r405946 - /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

2006-05-12 Thread ab
Author: ab Date: Fri May 12 16:35:50 2006 New Revision: 405946 URL: http://svn.apache.org/viewcvs?rev=405946view=rev Log: Fix yet another case where TagSoup supplies invalid parameters. Modified: lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java

svn commit: r405967 - in /lucene/nutch/trunk: conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java/org

2006-05-12 Thread ab
Author: ab Date: Fri May 12 17:52:33 2006 New Revision: 405967 URL: http://svn.apache.org/viewcvs?rev=405967view=rev Log: Scoring API (NUTCH-240). Development of this functionality was supported by Krugle.net. Thank you! Added: lucene/nutch/trunk/src/java/org/apache/nutch/scoring

svn commit: r406757 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/crawl/Generator.java src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-05-15 Thread ab
Author: ab Date: Mon May 15 15:18:34 2006 New Revision: 406757 URL: http://svn.apache.org/viewcvs?rev=406757view=rev Log: Fix NUTCH-268. Default settings are still different to avoid DOS-ing remote DNS servers during fetchlist generation. Modified: lucene/nutch/trunk/conf/nutch-default.xml

svn commit: r407567 - in /lucene/nutch/trunk: conf/ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/ src/plugin/protoc

2006-05-18 Thread ab
Author: ab Date: Thu May 18 08:26:06 2006 New Revision: 407567 URL: http://svn.apache.org/viewvc?rev=407567view=rev Log: Refactor HTTP plugins so that both support gzip encoding. Add appropriate headers in protocol-httpclient so that it prefers this encoding. Add an option to use HTTP 1.1

svn commit: r409275 - in /lucene/nutch/trunk: conf/ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ src/plugin/parse-html/src/test/org/apache/nutch/parse/html/

2006-05-24 Thread ab
Author: ab Date: Wed May 24 17:38:16 2006 New Revision: 409275 URL: http://svn.apache.org/viewvc?rev=409275view=rev Log: Fix for incorrect behavior (collecting action URLs from forms). This is now optional, and turned off by default. Update JUnit test to cover this option. Modified: lucene

svn commit: r409276 - in /lucene/nutch/trunk/src/java/org/apache/nutch/crawl: CrawlDb.java LinkDb.java

2006-05-24 Thread ab
Author: ab Date: Wed May 24 17:42:40 2006 New Revision: 409276 URL: http://svn.apache.org/viewvc?rev=409276view=rev Log: Add missing fs.mkdirs() - NUTCH-285. Submitted by Dennis Kubes. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java lucene/nutch/trunk/src/java

svn commit: r410377 - in /lucene/nutch/trunk/src/java/org/apache/nutch: crawl/Generator.java segment/SegmentMerger.java

2006-05-30 Thread ab
Author: ab Date: Tue May 30 14:12:52 2006 New Revision: 410377 URL: http://svn.apache.org/viewvc?rev=410377view=rev Log: SegmentMerger bug-fixes and improvements: * replace deprecated use of java.io.File with Hadoop's Path. * old segment name from Content.metadata needs to be replaced

svn commit: r417285 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

2006-06-26 Thread ab
Author: ab Date: Mon Jun 26 12:38:39 2006 New Revision: 417285 URL: http://svn.apache.org/viewvc?rev=417285view=rev Log: Add an optional mechanism to time limit long-running queries. This helps to protect search servers from adverse effects of certain resource-intensive queries. Development

svn commit: r423291 - /lucene/nutch/trunk/conf/nutch-default.xml

2006-07-18 Thread ab
Author: ab Date: Tue Jul 18 16:27:49 2006 New Revision: 423291 URL: http://svn.apache.org/viewvc?rev=423291view=rev Log: Add db.max.inlinks with its default value, and document it. Modified: lucene/nutch/trunk/conf/nutch-default.xml Modified: lucene/nutch/trunk/conf/nutch-default.xml URL

svn commit: r423539 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 10:35:08 2006 New Revision: 423539 URL: http://svn.apache.org/viewvc?rev=423539view=rev Log: Add ability to limit outlinks to only include initial hosts (NUTCH-173). Modified: lucene/nutch/trunk/conf/nutch-default.xml lucene/nutch/trunk/src/java/org/apache

svn commit: r423630 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 15:07:48 2006 New Revision: 423630 URL: http://svn.apache.org/viewvc?rev=423630view=rev Log: Add support for Crawl-delay in robots.txt (NUTCH-293). Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch

svn commit: r423643 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ java/org/apache/nutch/scoring/ plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 15:39:53 2006 New Revision: 423643 URL: http://svn.apache.org/viewvc?rev=423643view=rev Log: Fix a deficiency in the scoring API (NUTCH-321). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java lucene/nutch/trunk/src/java/org

svn commit: r423665 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

2006-07-19 Thread ab
Author: ab Date: Wed Jul 19 16:54:51 2006 New Revision: 423665 URL: http://svn.apache.org/viewvc?rev=423665view=rev Log: If a transient exception is thrown, don't mark the page as gone but retry. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Modified: lucene

svn commit: r424965 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java

2006-07-24 Thread ab
Author: ab Date: Mon Jul 24 01:40:19 2006 New Revision: 424965 URL: http://svn.apache.org/viewvc?rev=424965view=rev Log: Set job names (NUTCH-329). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r425071 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-07-24 Thread ab
Author: ab Date: Mon Jul 24 07:41:18 2006 New Revision: 425071 URL: http://svn.apache.org/viewvc?rev=425071view=rev Log: Expire all finished addresses. When sites request long crawl delays this quickly ties down all threads, and lock expiration heppens rarely and proceeds too slowly to remove all

svn commit: r425354 - /lucene/nutch/trunk/bin/nutch

2006-07-25 Thread ab
Author: ab Date: Tue Jul 25 02:54:58 2006 New Revision: 425354 URL: http://svn.apache.org/viewvc?rev=425354view=rev Log: Change the name of SegmentReader alias to 'readseg' for consistency with other reading-related commands. Keep the old 'segread' for compatibility, and give a deprecation

svn commit: r432254 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 07:53:54 2006 New Revision: 432254 URL: http://svn.apache.org/viewvc?rev=432254view=rev Log: Move toLowerCase where it actually matters. Fix some whitespace. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api

svn commit: r432256 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 07:56:35 2006 New Revision: 432256 URL: http://svn.apache.org/viewvc?rev=432256view=rev Log: Apply patch in NUTCH-348 - Generator used the lowest score instead of the highest. Contributed by Chris Schneider and Stefan Groschupf. Modified: lucene/nutch/trunk/src

svn commit: r432287 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/crawl/Generator.java

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:35:35 2006 New Revision: 432287 URL: http://svn.apache.org/viewvc?rev=432287view=rev Log: Apply patch in NUTCH-348 - Generator used the lowest score instead of the highest. Contributed by Chris Schneider and Stefan Groschupf. Modified: lucene/nutch/branches

svn commit: r432290 - /lucene/nutch/branches/branch-0.8/CHANGES.txt

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:38:21 2006 New Revision: 432290 URL: http://svn.apache.org/viewvc?rev=432290view=rev Log: Update CHANGES. Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene

svn commit: r432293 - /lucene/nutch/trunk/CHANGES.txt

2006-08-17 Thread ab
Author: ab Date: Thu Aug 17 09:41:12 2006 New Revision: 432293 URL: http://svn.apache.org/viewvc?rev=432293view=rev Log: Update CHANGES. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev

svn commit: r432674 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java

2006-08-18 Thread ab
Author: ab Date: Fri Aug 18 11:48:29 2006 New Revision: 432674 URL: http://svn.apache.org/viewvc?rev=432674view=rev Log: NUTCH-341 - if -workingdir is specified, always create a unique subdir. Also, use unique directory names to allow multiple IndexMergers to run simultaneously. Modified

svn commit: r432675 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/indexer/IndexMerger.java

2006-08-18 Thread ab
Author: ab Date: Fri Aug 18 11:50:00 2006 New Revision: 432675 URL: http://svn.apache.org/viewvc?rev=432675view=rev Log: NUTCH-341 - if -workingdir is specified, always create a unique subdir. Also, use unique directory names to allow multiple IndexMergers to run simultaneously. Modified

svn commit: r432896 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java

2006-08-19 Thread ab
Author: ab Date: Sat Aug 19 16:21:16 2006 New Revision: 432896 URL: http://svn.apache.org/viewvc?rev=432896view=rev Log: NUTCH-354 - properly reset a field when reusing an instance. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java Modified: lucene/nutch/trunk

svn commit: r447359 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

2006-09-18 Thread ab
Author: ab Date: Mon Sep 18 03:43:07 2006 New Revision: 447359 URL: http://svn.apache.org/viewvc?view=revrev=447359 Log: Fix an NPE when using searcher.max.hits, but NOT using time limit. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

svn commit: r447363 - /lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java

2006-09-18 Thread ab
Author: ab Date: Mon Sep 18 03:44:01 2006 New Revision: 447363 URL: http://svn.apache.org/viewvc?view=revrev=447363 Log: Fix an NPE when using searcher.max.hits, but NOT using time limit. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java Modified

svn commit: r449088 [2/2] - in /lucene/nutch/trunk: conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/net/ src/java/org/apache/nutch/parse/ src/plugin

2006-09-22 Thread ab
Added: lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java URL:

svn commit: r449100 - in /lucene/nutch/branches/branch-0.8: CHANGES.txt src/java/org/apache/nutch/parse/ParseOutputFormat.java

2006-09-22 Thread ab
Author: ab Date: Fri Sep 22 14:43:01 2006 New Revision: 449100 URL: http://svn.apache.org/viewvc?view=revrev=449100 Log: NUTCH-332: fix the problem of doubling scores caused by links pointing to the current page (e.g. anchors). Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt

svn commit: r449274 - in /lucene/nutch/trunk/src: java/org/apache/nutch/crawl/ java/org/apache/nutch/scoring/ plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 10:11:58 2006 New Revision: 449274 URL: http://svn.apache.org/viewvc?view=revrev=449274 Log: NUTCH-336: differentiate between newly discovered pages (known value through inlink contributions) and newly injected pages (aribtrarily defined initial value). Modified

svn commit: r449278 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 10:25:33 2006 New Revision: 449278 URL: http://svn.apache.org/viewvc?view=revrev=449278 Log: In case we fail to filter the score, bring it to 0.0f (and remove unused variable). Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java

svn commit: r449279 - in /lucene/nutch/branches/branch-0.8: ./ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/scoring/ src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 10:30:45 2006 New Revision: 449279 URL: http://svn.apache.org/viewvc?view=revrev=449279 Log: NUTCH-336: differentiate between newly discovered pages (known value through inlink contributions) and newly injected pages (aribtrarily defined initial value). Modified

svn commit: r449287 - in /lucene/nutch/trunk/src/java/org/apache/nutch: crawl/Crawl.java fetcher/Fetcher.java

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 11:50:56 2006 New Revision: 449287 URL: http://svn.apache.org/viewvc?view=revrev=449287 Log: NUTCH-337: obey fetcher.parse property if -noParsing is not specified. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java lucene/nutch/trunk/src

svn commit: r449288 - /lucene/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 11:55:25 2006 New Revision: 449288 URL: http://svn.apache.org/viewvc?view=revrev=449288 Log: Fix JUnit test after internal API change. Modified: lucene/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java Modified: lucene/nutch/trunk/src/test/org/apache

svn commit: r449294 - in /lucene/nutch/branches/branch-0.8: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/protocol/ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/

2006-09-23 Thread ab
Author: ab Date: Sat Sep 23 12:45:48 2006 New Revision: 449294 URL: http://svn.apache.org/viewvc?view=revrev=449294 Log: NUTCH-350: Urls blocked by http.max.delays incorrectly marked as GONE. Added: lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http

svn commit: r449738 - /lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 09:58:49 2006 New Revision: 449738 URL: http://svn.apache.org/viewvc?view=revrev=449738 Log: Don't create dummy Content (throws NPE), just pass null. Reported by Richard Braman. Modified: lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol

svn commit: r449742 - /lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 10:05:22 2006 New Revision: 449742 URL: http://svn.apache.org/viewvc?view=revrev=449742 Log: Don't create dummy Content (throws NPE), just pass null. Reported by Richard Braman. Modified: lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache

svn commit: r449752 - /lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 10:36:36 2006 New Revision: 449752 URL: http://svn.apache.org/viewvc?view=revrev=449752 Log: Catch exception on invalid urls, and continue collecting valid ones. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java Modified: lucene

svn commit: r449765 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/OutlinkExtractor.java

2006-09-25 Thread ab
Author: ab Date: Mon Sep 25 11:14:31 2006 New Revision: 449765 URL: http://svn.apache.org/viewvc?view=revrev=449765 Log: Catch exception on invalid urls, and continue collecting valid ones. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/OutlinkExtractor.java

svn commit: r450799 - in /lucene/nutch/trunk: conf/nutch-default.xml src/java/org/apache/nutch/crawl/CrawlDb.java src/java/org/apache/nutch/crawl/CrawlDbReducer.java

2006-09-28 Thread ab
Author: ab Date: Thu Sep 28 03:48:25 2006 New Revision: 450799 URL: http://svn.apache.org/viewvc?view=revrev=450799 Log: Bring back the '-noAdditions' option. This is useful for running constrained crawls, where the complete list of URLs is known in advance. Modified: lucene/nutch/trunk/conf

svn commit: r454297 - /lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms/MSBaseParser.java

2006-10-09 Thread ab
Author: ab Date: Mon Oct 9 00:13:46 2006 New Revision: 454297 URL: http://svn.apache.org/viewvc?view=revrev=454297 Log: Fix NPE when document properties are null. Reported by Trym Asserson. Modified: lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms

svn commit: r454298 - /lucene/nutch/branches/branch-0.8/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms/MSBaseParser.java

2006-10-09 Thread ab
Author: ab Date: Mon Oct 9 00:22:00 2006 New Revision: 454298 URL: http://svn.apache.org/viewvc?view=revrev=454298 Log: Fix NPE when document properties are null. Reported by Trym Asserson. Modified: lucene/nutch/branches/branch-0.8/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms

svn commit: r468673 - /lucene/nutch/branches/branch-0.8/build.xml

2006-10-28 Thread ab
Author: ab Date: Sat Oct 28 03:32:44 2006 New Revision: 468673 URL: http://svn.apache.org/viewvc?view=revrev=468673 Log: Fix NUTCH-394. Modified: lucene/nutch/branches/branch-0.8/build.xml Modified: lucene/nutch/branches/branch-0.8/build.xml URL: http://svn.apache.org/viewvc/lucene/nutch

svn commit: r469662 - /lucene/nutch/trunk/CHANGES.txt

2006-10-31 Thread ab
Author: ab Date: Tue Oct 31 13:36:01 2006 New Revision: 469662 URL: http://svn.apache.org/viewvc?view=revrev=469662 Log: Update. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev

svn commit: r469667 - in /lucene/nutch/branches/branch-0.8: CHANGES.txt src/java/org/apache/nutch/crawl/Generator.java

2006-10-31 Thread ab
Author: ab Date: Tue Oct 31 13:46:26 2006 New Revision: 469667 URL: http://svn.apache.org/viewvc?view=revrev=469667 Log: NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one partition. Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt lucene/nutch/branches/branch-0.8

svn commit: r474756 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-11-14 Thread ab
Author: ab Date: Tue Nov 14 04:11:30 2006 New Revision: 474756 URL: http://svn.apache.org/viewvc?view=revrev=474756 Log: NUTCH-401: use hadoop.tmp.dir instead of hardcoded /tmp. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk

svn commit: r474763 - /lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/segment/SegmentReader.java

2006-11-14 Thread ab
Author: ab Date: Tue Nov 14 04:24:48 2006 New Revision: 474763 URL: http://svn.apache.org/viewvc?view=revrev=474763 Log: NUTCH-401: use mapred.temp.dir instead of hardcoded /tmp. Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/segment/SegmentReader.java Modified

  1   2   >