svn commit: r547901 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/Fetcher2.java

2007-06-17 Thread dogacan
Author: dogacan Date: Sat Jun 16 03:33:24 2007 New Revision: 547901 URL: http://svn.apache.org/viewvc?view=revrev=547901 Log: NUTCH-495 - Unnecessary delays in Fetcher2. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher2.java Modified

svn commit: r547610 - in /lucene/nutch/trunk: site/credits.html site/credits.pdf src/site/src/documentation/content/xdocs/credits.xml

2007-06-17 Thread dogacan
Author: dogacan Date: Fri Jun 15 03:51:23 2007 New Revision: 547610 URL: http://svn.apache.org/viewvc?view=revrev=547610 Log: Added myself (Dogacan G├╝ney) to the list of committers. Modified: lucene/nutch/trunk/site/credits.html lucene/nutch/trunk/site/credits.pdf lucene/nutch/trunk

svn commit: r548103 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/parse/ src/plugin/creativecommons/src/java/org/creativecommons/nutch/ src/plugin/languageidentifier/src/java/org/apache/nutch

2007-06-18 Thread dogacan
Author: dogacan Date: Sun Jun 17 13:27:17 2007 New Revision: 548103 URL: http://svn.apache.org/viewvc?view=revrev=548103 Log: NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of Parse object. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java

svn commit: r548429 - in /lucene/nutch/trunk: ./ conf/ src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/ src/plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/

2007-06-18 Thread dogacan
Author: dogacan Date: Mon Jun 18 11:13:15 2007 New Revision: 548429 URL: http://svn.apache.org/viewvc?view=revrev=548429 Log: NUTCH-489 - URLFilter-suffix management of the url path when the url contains some query parameters. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk

svn commit: r548666 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/segment/SegmentReader.java

2007-06-19 Thread dogacan
Author: dogacan Date: Tue Jun 19 02:21:21 2007 New Revision: 548666 URL: http://svn.apache.org/viewvc?view=revrev=548666 Log: NUTCH-502 - Bug in SegmentReader causes infinite loop. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/segment

svn commit: r549186 - in /lucene/nutch/trunk/src/plugin: feed/build.xml feed/lib/jdom.jar feed/plugin.xml lib-xml/lib/jdom.jar

2007-06-20 Thread dogacan
Author: dogacan Date: Wed Jun 20 11:43:47 2007 New Revision: 549186 URL: http://svn.apache.org/viewvc?view=revrev=549186 Log: Updated jdom.jar in lib-xml to a newer version. Removed jdom.jar from feed plugin since lib-xml already provides it. Removed: lucene/nutch/trunk/src/plugin/feed/lib

svn commit: r549507 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/searcher/NutchBean.java src/web/web.xml

2007-06-21 Thread dogacan
Author: dogacan Date: Thu Jun 21 08:15:32 2007 New Revision: 549507 URL: http://svn.apache.org/viewvc?view=revrev=549507 Log: NUTCH-471 - Fix synchronization in NutchBean creation. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/searcher

svn commit: r550188 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/scoring/ src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/

2007-06-24 Thread dogacan
Author: dogacan Date: Sun Jun 24 02:28:41 2007 New Revision: 550188 URL: http://svn.apache.org/viewvc?view=revrev=550188 Log: NUTCH-468 - Scoring filter should distribute score to all outlinks at once. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r550196 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/test/ src/test/org/apache/nutch/fetcher/ src/testresources/fetch-test-site/

2007-06-24 Thread dogacan
Author: dogacan Date: Sun Jun 24 03:04:30 2007 New Revision: 550196 URL: http://svn.apache.org/viewvc?view=revrev=550196 Log: NUTCH-504 - Parsing during fetching is broken. Added: lucene/nutch/trunk/src/testresources/fetch-test-site/exception.html Modified: lucene/nutch/trunk/CHANGES.txt

svn commit: r551081 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/metadata/ src/java/org

2007-06-27 Thread dogacan
Author: dogacan Date: Wed Jun 27 00:05:52 2007 New Revision: 551081 URL: http://svn.apache.org/viewvc?view=revrev=551081 Log: NUTCH-474 - Replace usage of ObjectWritable with something based on GenericWritable. Added: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/NutchWritable.java

svn commit: r551147 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/LinkDb.java

2007-06-27 Thread dogacan
Author: dogacan Date: Wed Jun 27 05:46:05 2007 New Revision: 551147 URL: http://svn.apache.org/viewvc?view=revrev=551147 Log: NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation. Contributed by Espen Amble Kolstad. Modified: lucene/nutch/trunk/CHANGES.txt lucene

svn commit: r554530 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/lib-lucene-analyzers/plugin.xml

2007-07-09 Thread dogacan
Author: dogacan Date: Sun Jul 8 23:15:53 2007 New Revision: 554530 URL: http://svn.apache.org/viewvc?view=revrev=554530 Log: NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml. Contributed by Emmanuel Joke. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk

svn commit: r554539 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Generator.java

2007-07-09 Thread dogacan
Author: dogacan Date: Sun Jul 8 23:44:18 2007 New Revision: 554539 URL: http://svn.apache.org/viewvc?view=revrev=554539 Log: NUTCH-503 - Generator exits incorrectly for small fetchlists. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r555237 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/net/ src/java/org/apache/nutch/parse/ src/plugin/lib-parsems/src/java/org/apache/nutch/par

2007-07-11 Thread dogacan
Author: dogacan Date: Wed Jul 11 03:54:37 2007 New Revision: 555237 URL: http://svn.apache.org/viewvc?view=revrev=555237 Log: NUTCH-505 - Outlink urls should be validated. Added: lucene/nutch/trunk/src/java/org/apache/nutch/net/UrlValidator.java Modified: lucene/nutch/trunk/CHANGES.txt

svn commit: r555307 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/IndexMerger.java

2007-07-11 Thread dogacan
Author: dogacan Date: Wed Jul 11 08:30:29 2007 New Revision: 555307 URL: http://svn.apache.org/viewvc?view=revrev=555307 Log: NUTCH-510 - IndexMerger delete working dir. Contributed by Enis. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/indexer

svn commit: r556072 - in /lucene/nutch/trunk: CHANGES.txt conf/suffix-urlfilter.txt conf/suffix-urlfilter.txt.template

2007-07-13 Thread dogacan
Author: dogacan Date: Fri Jul 13 10:20:44 2007 New Revision: 556072 URL: http://svn.apache.org/viewvc?view=revrev=556072 Log: NUTCH-513 - suffix-urlfilter.txt does not have a template. Added: lucene/nutch/trunk/conf/suffix-urlfilter.txt.template - copied unchanged from r556068, lucene

svn commit: r556824 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDatum.java src/java/org/apache/nutch/crawl/Injector.java src/java/org/apache/nutch/parse/ParseOutputForma

2007-07-17 Thread dogacan
Author: dogacan Date: Mon Jul 16 23:19:06 2007 New Revision: 556824 URL: http://svn.apache.org/viewvc?view=revrev=556824 Log: NUTCH-515 - Next fetch time is set incorrectly. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java

svn commit: r556946 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/protocol/ src/java/org/apache/nutch/segment/

2007-07-17 Thread dogacan
Author: dogacan Date: Tue Jul 17 08:16:40 2007 New Revision: 556946 URL: http://svn.apache.org/viewvc?view=revrev=556946 Log: NUTCH-506 - Delegate compression to Hadoop. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/fetcher

svn commit: r557342 - in /lucene/nutch/trunk: CHANGES.txt default.properties

2007-07-18 Thread dogacan
Author: dogacan Date: Wed Jul 18 10:59:59 2007 New Revision: 557342 URL: http://svn.apache.org/viewvc?view=revrev=557342 Log: NUTCH-517 - build encoding should be UTF-8. Contributed by Enis. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/default.properties Modified: lucene

svn commit: r557344 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java

2007-07-18 Thread dogacan
Author: dogacan Date: Wed Jul 18 11:04:26 2007 New Revision: 557344 URL: http://svn.apache.org/viewvc?view=revrev=557344 Log: NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining. Contributed by Enis. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src

svn commit: r559742 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java src/java/org/apache/nutch/crawl/DefaultFetchSchedule.java

2007-07-26 Thread dogacan
Author: dogacan Date: Thu Jul 26 01:10:38 2007 New Revision: 559742 URL: http://svn.apache.org/viewvc?view=revrev=559742 Log: NUTCH-516 - Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE. Contributed by Emmanuel Joke. Modified: lucene/nutch/trunk/CHANGES.txt lucene

svn commit: r559754 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/DeleteDuplicates.java src/test/org/apache/nutch/indexer/TestDeleteDuplicates.java

2007-07-26 Thread dogacan
Author: dogacan Date: Thu Jul 26 01:44:33 2007 New Revision: 559754 URL: http://svn.apache.org/viewvc?view=revrev=559754 Log: NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment. Contributed by Vishal Shah. Modified: lucene/nutch

svn commit: r561092 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/Indexer.java

2007-07-30 Thread dogacan
Author: dogacan Date: Mon Jul 30 12:02:27 2007 New Revision: 561092 URL: http://svn.apache.org/viewvc?view=revrev=561092 Log: NUTCH-514 - Indexer should only index pages with fetch status SUCCESS. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch

svn commit: r561306 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/LinkDbFilter.java

2007-07-31 Thread dogacan
Author: dogacan Date: Tue Jul 31 05:07:30 2007 New Revision: 561306 URL: http://svn.apache.org/viewvc?view=revrev=561306 Log: NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and inlinks list. Contributed by Emmanuel Joke. Modified: lucene/nutch/trunk/CHANGES.txt

svn commit: r561816 - /lucene/nutch/trunk/src/plugin/summary-lucene/plugin.xml

2007-08-01 Thread dogacan
Author: dogacan Date: Wed Aug 1 07:50:51 2007 New Revision: 561816 URL: http://svn.apache.org/viewvc?view=revrev=561816 Log: Plugin summary-lucene's plugin.xml contained a link to non-existant lucene-highlighter jar. Updated plugin.xml to point to new jar. Modified: lucene/nutch/trunk/src

svn commit: r563777 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDatum.java src/java/org/apache/nutch/metadata/Metadata.java src/java/org/apache/nutch/parse/ParseData.jav

2007-08-08 Thread dogacan
Author: dogacan Date: Wed Aug 8 00:33:23 2007 New Revision: 563777 URL: http://svn.apache.org/viewvc?view=revrev=563777 Log: NUTCH-535 - ParseData's contentMeta accumulates unnecessary values during parse. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache

svn commit: r563807 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Injector.java src/java/org/apache/nutch/net/UrlValidator.java src/test/org/apache/nutch/crawl/TestInjector.jav

2007-08-08 Thread dogacan
Author: dogacan Date: Wed Aug 8 03:57:11 2007 New Revision: 563807 URL: http://svn.apache.org/viewvc?view=revrev=563807 Log: NUTCH-522 - Use URLValidator in the Injector. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java lucene

svn commit: r563894 [2/2] - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/analysis/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/html/ src/java/

2007-08-08 Thread dogacan
Modified: lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java?view=diffrev=563894r1=563893r2=563894 ==

svn commit: r568053 [3/3] - in /lucene/nutch/trunk: ./ conf/ src/java/org/apache/nutch/util/ src/java/org/apache/nutch/util/domain/ src/plugin/ src/plugin/tld/ src/plugin/tld/src/ src/plugin/tld/src/j

2007-08-21 Thread dogacan
Added: lucene/nutch/trunk/conf/domain-suffixes.xsd URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/conf/domain-suffixes.xsd?rev=568053view=auto == --- lucene/nutch/trunk/conf/domain-suffixes.xsd (added) +++

svn commit: r568053 [1/3] - in /lucene/nutch/trunk: ./ conf/ src/java/org/apache/nutch/util/ src/java/org/apache/nutch/util/domain/ src/plugin/ src/plugin/tld/ src/plugin/tld/src/ src/plugin/tld/src/j

2007-08-21 Thread dogacan
Author: dogacan Date: Tue Aug 21 03:50:07 2007 New Revision: 568053 URL: http://svn.apache.org/viewvc?rev=568053view=rev Log: NUTCH-439 - Top Level Domains Indexing / Scoring. Contributed by Enis. Added: lucene/nutch/trunk/conf/domain-suffixes.xml lucene/nutch/trunk/conf/domain

svn commit: r570334 - /lucene/nutch/trunk/src/web/jsp/search.jsp

2007-08-28 Thread dogacan
Author: dogacan Date: Mon Aug 27 23:44:16 2007 New Revision: 570334 URL: http://svn.apache.org/viewvc?rev=570334view=rev Log: NUTCH-545 - Configuration and OnlineClusterer get initialized in every request. Part 2. I have committed an older version of search.jsp by mistake in last commit

svn commit: r574344 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/parse/ParseOutputFormat.java

2007-09-10 Thread dogacan
Author: dogacan Date: Mon Sep 10 12:40:20 2007 New Revision: 574344 URL: http://svn.apache.org/viewvc?rev=574344view=rev Log: NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/parse

svn commit: r574346 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/net/ src/java/org/apache/nutch/parse/ src/plugin/ src/plugin/urlfilter-validator/ src/plugin

2007-09-10 Thread dogacan
Author: dogacan Date: Mon Sep 10 12:45:22 2007 New Revision: 574346 URL: http://svn.apache.org/viewvc?rev=574346view=rev Log: NUTCH-546 - file URL are filtered out by the crawler. Added: lucene/nutch/trunk/src/plugin/urlfilter-validator/ lucene/nutch/trunk/src/plugin/urlfilter-validator

svn commit: r574545 - /lucene/nutch/trunk/src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/UrlValidator.java

2007-09-11 Thread dogacan
Author: dogacan Date: Tue Sep 11 03:50:15 2007 New Revision: 574545 URL: http://svn.apache.org/viewvc?rev=574545view=rev Log: Java 5 Compatibility fix for NUTCH-546. Modified: lucene/nutch/trunk/src/plugin/urlfilter-validator/src/java/org/apache/nutch/urlfilter/validator/UrlValidator.java

svn commit: r578703 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/NodeWalker.java src/test/org/apache/nutch/util/TestNodeWalker.java

2007-09-24 Thread dogacan
Author: dogacan Date: Mon Sep 24 01:27:34 2007 New Revision: 578703 URL: http://svn.apache.org/viewvc?rev=578703view=rev Log: NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child. Contributed by Emmanuel Joke. Added: lucene/nutch/trunk/src/test/org/apache/nutch/util

svn commit: r579656 - in /lucene/nutch/trunk: ./ conf/ lib/ src/java/org/apache/nutch/util/ src/plugin/feed/src/java/org/apache/nutch/parse/feed/ src/plugin/ontology/ src/plugin/ontology/lib/ src/plug

2007-09-26 Thread dogacan
Author: dogacan Date: Wed Sep 26 07:02:48 2007 New Revision: 579656 URL: http://svn.apache.org/viewvc?rev=579656view=rev Log: NUTCH-25 - needs 'character encoding' detector. Mostly contributed by Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud Richardet. Also fixes NUTCH

svn commit: r579922 - /lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java

2007-09-27 Thread dogacan
Author: dogacan Date: Wed Sep 26 23:49:26 2007 New Revision: 579922 URL: http://svn.apache.org/viewvc?rev=579922view=rev Log: Java 5 compatibility fix for NUTCH-25. Contributed by Ned Rockson. Modified: lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html

svn commit: r580572 - /lucene/nutch/trunk/src/test/org/apache/nutch/util/TestEncodingDetector.java

2007-09-29 Thread dogacan
Author: dogacan Date: Sat Sep 29 04:02:01 2007 New Revision: 580572 URL: http://svn.apache.org/viewvc?rev=580572view=rev Log: Yet another java5 compatibility fix for NUTCH-25. Updates unit test. Modified: lucene/nutch/trunk/src/test/org/apache/nutch/util/TestEncodingDetector.java Modified

svn commit: r582775 - in /lucene/nutch/trunk: CHANGES.txt conf/log4j.properties

2007-10-08 Thread dogacan
Author: dogacan Date: Mon Oct 8 03:58:11 2007 New Revision: 582775 URL: http://svn.apache.org/viewvc?rev=582775view=rev Log: NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker. Contributed by Mathijs Homminga and Emmanuel Joke. Modified: lucene/nutch

svn commit: r589654 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/analysis/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/net/ src/java/org/apa

2007-10-29 Thread dogacan
Author: dogacan Date: Mon Oct 29 07:57:19 2007 New Revision: 589654 URL: http://svn.apache.org/viewvc?rev=589654view=rev Log: NUTCH-501 - Implement a different caching mechanism for objects cached in configuration. Added: lucene/nutch/trunk/src/java/org/apache/nutch/util/ObjectCache.java

svn commit: r593151 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/metadata/ src/java/org/apache/nutch/parse/ src/java/org

2007-11-08 Thread dogacan
Author: dogacan Date: Thu Nov 8 05:18:05 2007 New Revision: 593151 URL: http://svn.apache.org/viewvc?rev=593151view=rev Log: NUTCH-547 - Redirection handling: YahooSlurp's algorithm. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/fetcher

svn commit: r593200 - in /lucene/nutch/trunk/src/plugin: parse-oo/src/java/org/apache/nutch/parse/oo/ parse-rss/src/java/org/apache/nutch/parse/rss/ parse-swf/src/java/org/apache/nutch/parse/swf/ pars

2007-11-08 Thread dogacan
Author: dogacan Date: Thu Nov 8 07:32:11 2007 New Revision: 593200 URL: http://svn.apache.org/viewvc?rev=593200view=rev Log: NUTCH-548 - Last commit failed to upgrade some of the plugins. This commit removes all instances of Outlink(..,..,Configuration) calls. Modified: lucene/nutch/trunk

svn commit: r593186 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/parse/ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ src/plugin/parse-html/src/test/org/apache/nutch/parse/html

2007-11-08 Thread dogacan
Author: dogacan Date: Thu Nov 8 07:08:47 2007 New Revision: 593186 URL: http://svn.apache.org/viewvc?rev=593186view=rev Log: NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat. Contributed by Emmanuel Joke. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src

svn commit: r593263 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbReader.java src/java/org/apache/nutch/indexer/DeleteDuplicates.java

2007-11-08 Thread dogacan
Author: dogacan Date: Thu Nov 8 11:13:37 2007 New Revision: 593263 URL: http://svn.apache.org/viewvc?rev=593263view=rev Log: NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl

svn commit: r593261 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/util/FibonacciHeap.java src/java/org/apache/nutch/util/ThreadPool.java

2007-11-08 Thread dogacan
Author: dogacan Date: Thu Nov 8 11:09:06 2007 New Revision: 593261 URL: http://svn.apache.org/viewvc?rev=593261view=rev Log: NUTCH-538 - Delete unused classes under o.a.n.util. Removed: lucene/nutch/trunk/src/java/org/apache/nutch/util/FibonacciHeap.java lucene/nutch/trunk/src/java/org

svn commit: r608972 - in /lucene/nutch/trunk: ./ conf/ src/plugin/ src/plugin/protocol-httpclient/ src/plugin/protocol-httpclient/jsp/ src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol

2008-01-04 Thread dogacan
Author: dogacan Date: Fri Jan 4 11:48:32 2008 New Revision: 608972 URL: http://svn.apache.org/viewvc?rev=608972view=rev Log: NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy. Contributed by Susam Pal. Added: lucene/nutch/trunk/conf/httpclient-auth.xml.template

svn commit: r630779 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/parse-html/lib/tagsoup-1.0rc3.jar src/plugin/parse-html/lib/tagsoup-1.2.jar src/plugin/parse-html/lib/tagsoup.LICENSE.txt src/plugi

2008-02-25 Thread dogacan
Author: dogacan Date: Mon Feb 25 01:38:12 2008 New Revision: 630779 URL: http://svn.apache.org/viewvc?rev=630779view=rev Log: NUTCH-567 - Proper (?) handling of URIs in TagSoup. Added: lucene/nutch/trunk/src/plugin/parse-html/lib/tagsoup-1.2.jar (with props) Removed: lucene/nutch/trunk

svn commit: r697395 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/indexer/Indexer.java

2008-09-20 Thread dogacan
Author: dogacan Date: Sat Sep 20 10:05:03 2008 New Revision: 697395 URL: http://svn.apache.org/viewvc?rev=697395view=rev Log: NUTCH-639 - Change LuceneDocumentWrapper visibility from private to protected Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache

svn commit: r697781 - in /lucene/nutch/trunk: CHANGES.txt bin/start-balancer.sh bin/stop-balancer.sh

2008-09-22 Thread dogacan
Author: dogacan Date: Mon Sep 22 04:08:09 2008 New Revision: 697781 URL: http://svn.apache.org/viewvc?rev=697781view=rev Log: NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn tracking Removed: lucene/nutch/trunk/bin/start-balancer.sh lucene/nutch/trunk/bin/stop-balancer.sh

svn commit: r697896 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/fetcher/Fetcher.java src/java/org/apache/nutch/fetcher/Fetcher2.java

2008-09-22 Thread dogacan
Author: dogacan Date: Mon Sep 22 09:43:33 2008 New Revision: 697896 URL: http://svn.apache.org/viewvc?rev=697896view=rev Log: NUTCH-633 - ParseSegment no longer allow reparsing. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

svn commit: r698469 - /lucene/nutch/trunk/bin/

2008-09-24 Thread dogacan
Author: dogacan Date: Wed Sep 24 01:50:15 2008 New Revision: 698469 URL: http://svn.apache.org/viewvc?rev=698469view=rev Log: NUTCH-651 second part. Also add bin/{start|stop}-balancer.sh to svn ignore. Modified: lucene/nutch/trunk/bin/ (props changed) Propchange: lucene/nutch/trunk/bin

svn commit: r698471 - in /lucene/nutch/trunk: ./ lib/ lib/native/Linux-amd64-64/ lib/native/Linux-i386-32/ src/java/org/apache/nutch/segment/

2008-09-24 Thread dogacan
Author: dogacan Date: Wed Sep 24 01:52:19 2008 New Revision: 698471 URL: http://svn.apache.org/viewvc?rev=698471view=rev Log: NUTCH-653 - Upgrade to hadoop 0.18 Added: lucene/nutch/trunk/lib/hadoop-0.18.1-core.jar (with props) lucene/nutch/trunk/lib/jets3t-0.6.0.jar (with props

svn commit: r701045 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java

2008-10-02 Thread dogacan
Author: dogacan Date: Thu Oct 2 02:05:22 2008 New Revision: 701045 URL: http://svn.apache.org/viewvc?rev=701045view=rev Log: NUTCH-654 - urlfilter-regex's main does not work Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch

svn commit: r733712 - /lucene/nutch/trunk/src/test/domain-urlfilter.txt

2009-01-12 Thread dogacan
Author: dogacan Date: Mon Jan 12 04:50:02 2009 New Revision: 733712 URL: http://svn.apache.org/viewvc?rev=733712view=rev Log: Added a test domain-urlfilter conf file so that it doesn't filter everything Added: lucene/nutch/trunk/src/test/domain-urlfilter.txt Added: lucene/nutch/trunk/src

svn commit: r733744 - /lucene/nutch/trunk/src/plugin/build.xml

2009-01-12 Thread dogacan
Author: dogacan Date: Mon Jan 12 05:30:28 2009 New Revision: 733744 URL: http://svn.apache.org/viewvc?rev=733744view=rev Log: Unrelated change went in accidentally in NUTCH-442. Reverting to old version. Modified: lucene/nutch/trunk/src/plugin/build.xml Modified: lucene/nutch/trunk/src

svn commit: r733747 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java

2009-01-12 Thread dogacan
Author: dogacan Date: Mon Jan 12 05:37:23 2009 New Revision: 733747 URL: http://svn.apache.org/viewvc?rev=733747view=rev Log: NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src

svn commit: r733848 - in /lucene/nutch/trunk/src: java/org/apache/nutch/indexer/solr/SolrIndexer.java test/org/apache/nutch/searcher/TestDistributedSearch.java

2009-01-12 Thread dogacan
Author: dogacan Date: Mon Jan 12 09:33:16 2009 New Revision: 733848 URL: http://svn.apache.org/viewvc?rev=733848view=rev Log: Two more NUTCH-442 changes: * Delete TestDistributedSearch for now * Set reduceSpeculativeExecution false for SolrIndexer Removed: lucene/nutch/trunk/src/test/org

svn commit: r735748 - in /lucene/nutch/trunk: CHANGES.txt lib/jets3t-0.6.0.jar lib/jets3t-0.6.1.jar

2009-01-19 Thread dogacan
Author: dogacan Date: Mon Jan 19 09:09:47 2009 New Revision: 735748 URL: http://svn.apache.org/viewvc?rev=735748view=rev Log: NUTCH-678 - Hadoop 0.19 requires an update of jets3t (julien nioche) Added: lucene/nutch/trunk/lib/jets3t-0.6.1.jar (with props) Removed: lucene/nutch/trunk/lib

svn commit: r736307 - in /lucene/nutch/trunk: ./ src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/ src/plugin/parse-mp3/src/test/org/apache/nutch/parse/mp3/

2009-01-21 Thread dogacan
Author: dogacan Date: Wed Jan 21 05:09:48 2009 New Revision: 736307 URL: http://svn.apache.org/viewvc?rev=736307view=rev Log: NUTCH-681 - parse-mp3 compilation problem. Patch by Wildan Maulana. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin/parse-mp3/src/java/org

svn commit: r736385 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/tools/compat/ src/test/org/apache/nutch/crawl/

2009-01-21 Thread dogacan
Author: dogacan Date: Wed Jan 21 11:26:27 2009 New Revision: 736385 URL: http://svn.apache.org/viewvc?rev=736385view=rev Log: NUTCH-676 - MapWritable is written inefficiently and confusingly. Removed: lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestMapWritable.java Modified

svn commit: r736388 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/MD5Signature.java

2009-01-21 Thread dogacan
Author: dogacan Date: Wed Jan 21 11:41:55 2009 New Revision: 736388 URL: http://svn.apache.org/viewvc?rev=736388view=rev Log: NUTCH-579 - Feed plugin only indexes one post per feed due to identical digest Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache

svn commit: r737325 - in /lucene/nutch/trunk: lib/ src/plugin/lib-nekohtml/ src/plugin/lib-nekohtml/lib/ src/plugin/summary-lucene/ src/plugin/summary-lucene/lib/

2009-01-24 Thread dogacan
Author: dogacan Date: Sat Jan 24 10:28:37 2009 New Revision: 737325 URL: http://svn.apache.org/viewvc?rev=737325view=rev Log: NUTCH-680 - Update external jars to latest versions Updates: nekohtml lucene-highlighter icu4j jakarta-oro Added: lucene/nutch/trunk/lib/icu4j-4_0_1.LICENSE.txt

svn commit: r738049 - /lucene/nutch/trunk/lib/pmd-ext/

2009-01-27 Thread dogacan
Author: dogacan Date: Tue Jan 27 10:21:58 2009 New Revision: 738049 URL: http://svn.apache.org/viewvc?rev=738049view=rev Log: NUTCH-680 - Remove pmd-ext jars for now Removed: lucene/nutch/trunk/lib/pmd-ext/

svn commit: r738455 - in /lucene/nutch/trunk: CHANGES.txt src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java

2009-01-28 Thread dogacan
Author: dogacan Date: Wed Jan 28 11:33:20 2009 New Revision: 738455 URL: http://svn.apache.org/viewvc?rev=738455view=rev Log: NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3. Patch by Joseph Chen. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/plugin

svn commit: r743277 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDbMerger.java

2009-02-11 Thread dogacan
Author: dogacan Date: Wed Feb 11 09:12:15 2009 New Revision: 743277 URL: http://svn.apache.org/viewvc?rev=743277view=rev Log: NUTCH-683 - NUTCH-676 broke CrawlDbMerger Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java Modified

svn commit: r751774 - in /lucene/nutch/trunk: CHANGES.txt bin/nutch src/java/org/apache/nutch/indexer/solr/SolrConstants.java src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java

2009-03-09 Thread dogacan
Author: dogacan Date: Mon Mar 9 17:34:51 2009 New Revision: 751774 URL: http://svn.apache.org/viewvc?rev=751774view=rev Log: NUTCH-684 - Dedup support for Solr Added: lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java Modified: lucene/nutch/trunk

svn commit: r761271 - /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/OldFetcher.java

2009-04-02 Thread dogacan
Author: dogacan Date: Thu Apr 2 12:46:47 2009 New Revision: 761271 URL: http://svn.apache.org/viewvc?rev=761271view=rev Log: NUTCH-721 - Commit old fetcher as OldFetcher for now so that we can test Fetcher2 performance. Added: lucene/nutch/trunk/src/java/org/apache/nutch/fetcher

svn commit: r782412 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/Crawl.java src/java/org/apache/nutch/util/NutchConfiguration.java

2009-06-07 Thread dogacan
Author: dogacan Date: Sun Jun 7 17:12:18 2009 New Revision: 782412 URL: http://svn.apache.org/viewvc?rev=782412view=rev Log: NUTCH-735 - crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command. Patch by Susam Pal. Modified: lucene/nutch/trunk/CHANGES.txt

svn commit: r789591 - /lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java

2009-06-30 Thread dogacan
Author: dogacan Date: Tue Jun 30 07:09:14 2009 New Revision: 789591 URL: http://svn.apache.org/viewvc?rev=789591view=rev Log: Remove dtd URL from xml in TestNodeWalker to prevent build failures for now. Modified: lucene/nutch/trunk/src/test/org/apache/nutch/util/TestNodeWalker.java Modified

svn commit: r804782 - /lucene/nutch/branches/nutchbase/

2009-08-16 Thread dogacan
Author: dogacan Date: Sun Aug 16 21:30:22 2009 New Revision: 804782 URL: http://svn.apache.org/viewvc?rev=804782view=rev Log: Creating initial nutchbase branch. Added: lucene/nutch/branches/nutchbase/ - copied from r804781, lucene/nutch/trunk/

svn commit: r807485 - in /lucene/nutch/trunk: CHANGES.txt conf/nutch-default.xml src/java/org/apache/nutch/fetcher/Fetcher.java

2009-08-24 Thread dogacan
Author: dogacan Date: Tue Aug 25 05:45:53 2009 New Revision: 807485 URL: http://svn.apache.org/viewvc?rev=807485view=rev Log: Fetcher2 slow. Patch contributed by Julien Nioche. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/conf/nutch-default.xml lucene/nutch/trunk/src

svn commit: r812497 - in /lucene/nutch/trunk: CHANGES.txt src/java/org/apache/nutch/crawl/CrawlDatum.java

2009-09-08 Thread dogacan
Author: dogacan Date: Tue Sep 8 13:15:03 2009 New Revision: 812497 URL: http://svn.apache.org/viewvc?rev=812497view=rev Log: NUTCH-702 - Lazy Instanciation of Metadata in CrawlDatum. Contributed by Julien Nioche. Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org