6698
-tnin 6691
-ag_
-sæ 6663
-ndr 6626
-sen 6593
-ente 6571
-dig 6562
-ga 6555
-kl 6546
-tu 6517
-bes 6500
-fe 6473
-lag 6461
-red 6435
-lin 6431
-ks 6412
-dre 6409
-ment 6405
-kal_ 6387
-skal 6383
-ved 6325
-ab 6321
-sam 6321
-æl 6310
-par 6284
-v_ 6283
-bet 6246
-est 6224
-ner_ 6218
-ve_ 6218
2734
-utan 2733
-aga 2731
-änk 2726
-org 2723
-öj 2723
-ab 2722
-ven_ 2716
-is_ 2711
-dli 2704
-rän 2704
-nkt 2703
-rfö 2698
-dag 2693
-ien 2692
-tti 2689
-bö 2676
-ske 2672
-amt 2669
-and_ 2669
-tvi 2662
-rag 2654
-ckli 2653
-ive 2647
-dd 2646
-rför 2646
-avs 2645
-dern 2645
-beh 2644
-nade
Author: ab
Date: Tue Jul 5 01:56:01 2005
New Revision: 209246
URL: http://svn.apache.org/viewcvs?rev=209246view=rev
Log:
Active this as Parser plugin (it was accidentally omitted).
Accept also empty content type, if the extension is right.
Modified:
lucene/nutch/trunk/src/plugin/parse-js
Author: ab
Date: Thu Jul 7 15:43:49 2005
New Revision: 209669
URL: http://svn.apache.org/viewcvs?rev=209669view=rev
Log:
Forgot to add this one, sorry.
Added:
lucene/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java
(with props
Author: ab
Date: Thu Jul 14 13:54:16 2005
New Revision: 219097
URL: http://svn.apache.org/viewcvs?rev=219097view=rev
Log:
Fix issues reported in NUTCH-46. Submitted by Piotr Kosiorowski.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/fs/NDFSFileSystem.java
lucene/nutch/trunk/src
Author: ab
Date: Thu Jul 21 04:07:10 2005
New Revision: 220034
URL: http://svn.apache.org/viewcvs?rev=220034view=rev
Log:
Fixes for NUTCH-66, and some cleanup:
* apply the more lenient CookiePolicy. This fixes the problem with poorly
formatted cookies being rejected.
* remove the while() loop
Author: ab
Date: Fri Aug 12 07:23:47 2005
New Revision: 232303
URL: http://svn.apache.org/viewcvs?rev=232303view=rev
Log:
RSS Parse plugin. Contributed by Chris Mattmann (issue NUTCH-30).
Thank you!
Added:
lucene/nutch/trunk/src/plugin/parse-rss/
lucene/nutch/trunk/src/plugin/parse-rss
Author: ab
Date: Tue Sep 20 00:09:45 2005
New Revision: 290382
URL: http://svn.apache.org/viewcvs?rev=290382view=rev
Log:
Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two
versions - one has no logging at all, the other uses log4j (which is
the version used here
Author: ab
Date: Tue Sep 20 00:16:42 2005
New Revision: 290383
URL: http://svn.apache.org/viewcvs?rev=290383view=rev
Log:
Updated to PDFBox-0.7.2. Starting with this release PDFBox comes in two
versions - one has no logging at all, the other uses log4j (which is
the version used here
Author: ab
Date: Tue Dec 20 03:19:21 2005
New Revision: 357962
URL: http://svn.apache.org/viewcvs?rev=357962view=rev
Log:
Remove remaining calls to static NutchConf.get() in favor of using
per-instance local configuration getConf().
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch
Author: ab
Date: Thu Dec 22 17:16:31 2005
New Revision: 358674
URL: http://svn.apache.org/viewcvs?rev=358674view=rev
Log:
Remove traces of the old API FetcherOutput.
The old IndexSegment is now marked broken. In the next step old utilities
should be removed.
Modified:
lucene/nutch/trunk/src
Author: ab
Date: Thu Dec 29 07:28:30 2005
New Revision: 359822
URL: http://svn.apache.org/viewcvs?rev=359822view=rev
Log:
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled
Author: ab
Date: Fri Dec 30 03:07:50 2005
New Revision: 360069
URL: http://svn.apache.org/viewcvs?rev=360069view=rev
Log:
Fix incorrect package declaration.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java
Modified: lucene/nutch/trunk/src/java/org/apache/nutch
Author: ab
Date: Tue Jan 3 00:35:04 2006
New Revision: 365576
URL: http://svn.apache.org/viewcvs?rev=365576view=rev
Log:
Fixed an NPE, in case of a fetch error we don't have a score value
from Fetcher.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java
Author: ab
Date: Tue Jan 3 23:32:04 2006
New Revision: 365850
URL: http://svn.apache.org/viewcvs?rev=365850view=rev
Log:
Update Commons HTTPClient to v. 3.0.
Add some default headers to prefer HTML content, and in English.
Added:
lucene/nutch/trunk/src/plugin/protocol-httpclient/lib
Author: ab
Date: Mon Jan 9 00:58:58 2006
New Revision: 367251
URL: http://svn.apache.org/viewcvs?rev=367251view=rev
Log:
Replace the custom metadata serialization with the one provided by the
ContentProperties class. This fixes the breakage if multiple property
values per key are in use
Author: ab
Date: Wed Jan 11 15:24:40 2006
New Revision: 368167
URL: http://svn.apache.org/viewcvs?rev=368167view=rev
Log:
Make sure we always have the segment name and score values in
ParseData.metadata. Sometimes plugins would fail to copy them through,
or a parsing error would produce empty
Author: ab
Date: Fri Jan 20 08:00:04 2006
New Revision: 370854
URL: http://svn.apache.org/viewcvs?rev=370854view=rev
Log:
Move excessive logging to Level.FINE.
Modified:
lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
Modified:
lucene/nutch
Author: ab
Date: Fri Feb 3 10:54:12 2006
New Revision: 374725
URL: http://svn.apache.org/viewcvs?rev=374725view=rev
Log:
Encode the query on links in UTF-8.
Modified:
lucene/nutch/trunk/src/web/jsp/search.jsp
Modified: lucene/nutch/trunk/src/web/jsp/search.jsp
URL:
http://svn.apache.org
Author: ab
Date: Thu Feb 9 16:56:57 2006
New Revision: 376518
URL: http://svn.apache.org/viewcvs?rev=376518view=rev
Log:
Add metadata to CrawlDatum. Contributed by Stefan Groschupf in
NUTCH-192.
Added:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java (with
props
Author: ab
Date: Thu Feb 23 09:21:38 2006
New Revision: 380163
URL: http://svn.apache.org/viewcvs?rev=380163view=rev
Log:
Modify the cmd-line so that it's possible to perform incremental
updates on existing linkDb. This significantly speeds up the
invertlinks operation.
Modified:
lucene
Author: ab
Date: Tue Mar 7 01:26:54 2006
New Revision: 383829
URL: http://svn.apache.org/viewcvs?rev=383829view=rev
Log:
Cache instances of ParsePluginList.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java
Modified: lucene/nutch/trunk/src/java/org/apache
Author: ab
Date: Tue Mar 7 13:04:31 2006
New Revision: 384011
URL: http://svn.apache.org/viewcvs?rev=384011view=rev
Log:
No-arg constructors are required for RPC. Allow the RPC Server to set
local Configuration. Fix by Marko Bauhardt.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch
Author: ab
Date: Mon Mar 13 04:16:54 2006
New Revision: 385531
URL: http://svn.apache.org/viewcvs?rev=385531view=rev
Log:
Print out the full path of plugins directory. Allow using plugins
located outside classpath, eg. in shared repos.
Submitted by Stefan Groschupf, NUTCH-229.
Modified
Author: ab
Date: Sat Mar 18 11:21:11 2006
New Revision: 386875
URL: http://svn.apache.org/viewcvs?rev=386875view=rev
Log:
Apply patch in NUTCH-230, which provides additional control over which
outlinks are considered for OPIC cash value distribution.
Modified:
lucene/nutch/trunk/src/java/org
Author: ab
Date: Mon Mar 20 15:20:56 2006
New Revision: 387341
URL: http://svn.apache.org/viewcvs?rev=387341view=rev
Log:
Don't allow Inlink duplicates (NUTCH-235).
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Inlinks.java
lucene/nutch/trunk/src/java/org/apache/nutch
Author: ab
Date: Tue Mar 21 08:43:09 2006
New Revision: 387578
URL: http://svn.apache.org/viewcvs?rev=387578view=rev
Log:
Cleanup and JUnit test for Carrot2. Contributed by Dawid Weiss (NUTCH-234).
Added:
lucene/nutch/trunk/src/plugin/clustering-carrot2/src/java/org/apache/nutch/clustering
Author: ab
Date: Thu Mar 30 15:07:48 2006
New Revision: 390275
URL: http://svn.apache.org/viewcvs?rev=390275view=rev
Log:
Fix a bug where TagSoup would sometimes submit invalid index values.
Modified:
lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html
Author: ab
Date: Mon Apr 3 06:35:34 2006
New Revision: 391044
URL: http://svn.apache.org/viewcvs?rev=391044view=rev
Log:
Make sure we use new values for score, metadata, fetch interval
and fetch time.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
Modified
Author: ab
Date: Mon Apr 3 07:36:19 2006
New Revision: 391055
URL: http://svn.apache.org/viewcvs?rev=391055view=rev
Log:
Forgot to properly initialize the score.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
Modified: lucene/nutch/trunk/src/java/org
Author: ab
Date: Tue Apr 4 22:39:28 2006
New Revision: 391525
URL: http://svn.apache.org/viewcvs?rev=391525view=rev
Log:
Correct javadoc.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseStatus.java
Modified: lucene/nutch/trunk/src/java/org/apache/nutch/parse
Author: ab
Date: Wed Apr 5 03:09:54 2006
New Revision: 391577
URL: http://svn.apache.org/viewcvs?rev=391577view=rev
Log:
SelectorInverseMapper needs to implement Mapper, otherwise things break.
Noticed by Shawn Gervais.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl
Author: ab
Date: Wed Apr 5 10:01:02 2006
New Revision: 391676
URL: http://svn.apache.org/viewcvs?rev=391676view=rev
Log:
Fix protocol-level redirect code. Patch by Dennis Kubes.
Make it clear that this is a protocol-level redirect, as
opposed to a content-level redirect.
Modified:
lucene
Author: ab
Date: Thu Apr 6 13:01:47 2006
New Revision: 392056
URL: http://svn.apache.org/viewcvs?rev=392056view=rev
Log:
Pages with only STATUS_DB_GONE were unaccounted for, which caused an NPE.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
Modified
Author: ab
Date: Mon Apr 24 15:56:17 2006
New Revision: 396708
URL: http://svn.apache.org/viewcvs?rev=396708view=rev
Log:
Fix an NPE, and simplify the logic (NUTCH-254).
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
Modified: lucene/nutch/trunk/src/java/org
Author: ab
Date: Tue Apr 25 12:12:48 2006
New Revision: 396955
URL: http://svn.apache.org/viewcvs?rev=396955view=rev
Log:
Parser for OpenOffice and OpenDocument formats (an updated version of
NUTCH-125).
Development of this plugin was supported by Zaheed Haque. Thank you!
Added:
lucene
Author: ab
Date: Wed Apr 26 03:54:53 2006
New Revision: 397169
URL: http://svn.apache.org/viewcvs?rev=397169view=rev
Log:
Don't allow CrawlDatum.getMetaData() to return null. Underlying
MapWritable is lazily instantiated to minimize the number of
created objects.
Refactor CrawlDbReducer to use
Author: ab
Date: Sun Apr 30 16:33:45 2006
New Revision: 398462
URL: http://svn.apache.org/viewcvs?rev=398462view=rev
Log:
Temporary workaround for a situation where we may end up with a
lone STATUS_SIGNATURE. The real reason for this error is
unknown at this moment, please report if you encounter
Author: ab
Date: Wed May 3 19:42:02 2006
New Revision: 399515
URL: http://svn.apache.org/viewcvs?rev=399515view=rev
Log:
Use the FileSystem instead of java.io.File.exists().
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
Modified:
lucene/nutch/trunk/src
Author: ab
Date: Mon May 8 14:48:21 2006
New Revision: 405179
URL: http://svn.apache.org/viewcvs?rev=405179view=rev
Log:
Fix NUTCH-263.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestMapWritable.java
Author: ab
Date: Mon May 8 14:52:09 2006
New Revision: 405181
URL: http://svn.apache.org/viewcvs?rev=405181view=rev
Log:
Refactor to make it easier to use these classes programmatically.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
lucene/nutch/trunk
Author: ab
Date: Fri May 12 16:35:50 2006
New Revision: 405946
URL: http://svn.apache.org/viewcvs?rev=405946view=rev
Log:
Fix yet another case where TagSoup supplies invalid parameters.
Modified:
lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMBuilder.java
Author: ab
Date: Fri May 12 17:52:33 2006
New Revision: 405967
URL: http://svn.apache.org/viewcvs?rev=405967view=rev
Log:
Scoring API (NUTCH-240).
Development of this functionality was supported by Krugle.net. Thank you!
Added:
lucene/nutch/trunk/src/java/org/apache/nutch/scoring
Author: ab
Date: Mon May 15 15:18:34 2006
New Revision: 406757
URL: http://svn.apache.org/viewcvs?rev=406757view=rev
Log:
Fix NUTCH-268. Default settings are still different to avoid DOS-ing
remote DNS servers during fetchlist generation.
Modified:
lucene/nutch/trunk/conf/nutch-default.xml
Author: ab
Date: Wed May 24 17:42:40 2006
New Revision: 409276
URL: http://svn.apache.org/viewvc?rev=409276view=rev
Log:
Add missing fs.mkdirs() - NUTCH-285. Submitted by Dennis Kubes.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
lucene/nutch/trunk/src/java
Author: ab
Date: Tue May 30 14:12:52 2006
New Revision: 410377
URL: http://svn.apache.org/viewvc?rev=410377view=rev
Log:
SegmentMerger bug-fixes and improvements:
* replace deprecated use of java.io.File with Hadoop's Path.
* old segment name from Content.metadata needs to be replaced
Author: ab
Date: Mon Jun 26 12:38:39 2006
New Revision: 417285
URL: http://svn.apache.org/viewvc?rev=417285view=rev
Log:
Add an optional mechanism to time limit long-running queries. This helps to
protect search servers from adverse effects of certain resource-intensive
queries.
Development
Author: ab
Date: Tue Jul 18 16:27:49 2006
New Revision: 423291
URL: http://svn.apache.org/viewvc?rev=423291view=rev
Log:
Add db.max.inlinks with its default value, and document it.
Modified:
lucene/nutch/trunk/conf/nutch-default.xml
Modified: lucene/nutch/trunk/conf/nutch-default.xml
URL
Author: ab
Date: Wed Jul 19 10:35:08 2006
New Revision: 423539
URL: http://svn.apache.org/viewvc?rev=423539view=rev
Log:
Add ability to limit outlinks to only include initial hosts (NUTCH-173).
Modified:
lucene/nutch/trunk/conf/nutch-default.xml
lucene/nutch/trunk/src/java/org/apache
Author: ab
Date: Wed Jul 19 15:07:48 2006
New Revision: 423630
URL: http://svn.apache.org/viewvc?rev=423630view=rev
Log:
Add support for Crawl-delay in robots.txt (NUTCH-293).
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch
Author: ab
Date: Wed Jul 19 15:39:53 2006
New Revision: 423643
URL: http://svn.apache.org/viewvc?rev=423643view=rev
Log:
Fix a deficiency in the scoring API (NUTCH-321).
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
lucene/nutch/trunk/src/java/org
Author: ab
Date: Wed Jul 19 16:54:51 2006
New Revision: 423665
URL: http://svn.apache.org/viewvc?rev=423665view=rev
Log:
If a transient exception is thrown, don't mark the page as gone but retry.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
Modified: lucene
Author: ab
Date: Mon Jul 24 01:40:19 2006
New Revision: 424965
URL: http://svn.apache.org/viewvc?rev=424965view=rev
Log:
Set job names (NUTCH-329).
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl
Author: ab
Date: Mon Jul 24 07:41:18 2006
New Revision: 425071
URL: http://svn.apache.org/viewvc?rev=425071view=rev
Log:
Expire all finished addresses. When sites request long crawl delays
this quickly ties down all threads, and lock expiration heppens
rarely and proceeds too slowly to remove all
Author: ab
Date: Tue Jul 25 02:54:58 2006
New Revision: 425354
URL: http://svn.apache.org/viewvc?rev=425354view=rev
Log:
Change the name of SegmentReader alias to 'readseg' for consistency with other
reading-related commands. Keep the old 'segread' for compatibility, and
give a deprecation
Author: ab
Date: Thu Aug 17 07:53:54 2006
New Revision: 432254
URL: http://svn.apache.org/viewvc?rev=432254view=rev
Log:
Move toLowerCase where it actually matters. Fix some whitespace.
Modified:
lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api
Author: ab
Date: Thu Aug 17 07:56:35 2006
New Revision: 432256
URL: http://svn.apache.org/viewvc?rev=432256view=rev
Log:
Apply patch in NUTCH-348 - Generator used the lowest score instead of
the highest. Contributed by Chris Schneider and Stefan Groschupf.
Modified:
lucene/nutch/trunk/src
Author: ab
Date: Thu Aug 17 09:35:35 2006
New Revision: 432287
URL: http://svn.apache.org/viewvc?rev=432287view=rev
Log:
Apply patch in NUTCH-348 - Generator used the lowest score instead of
the highest. Contributed by Chris Schneider and Stefan Groschupf.
Modified:
lucene/nutch/branches
Author: ab
Date: Thu Aug 17 09:38:21 2006
New Revision: 432290
URL: http://svn.apache.org/viewvc?rev=432290view=rev
Log:
Update CHANGES.
Modified:
lucene/nutch/branches/branch-0.8/CHANGES.txt
Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt
URL:
http://svn.apache.org/viewvc/lucene
Author: ab
Date: Thu Aug 17 09:41:12 2006
New Revision: 432293
URL: http://svn.apache.org/viewvc?rev=432293view=rev
Log:
Update CHANGES.
Modified:
lucene/nutch/trunk/CHANGES.txt
Modified: lucene/nutch/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev
Author: ab
Date: Fri Aug 18 11:48:29 2006
New Revision: 432674
URL: http://svn.apache.org/viewvc?rev=432674view=rev
Log:
NUTCH-341 - if -workingdir is specified, always create a unique subdir.
Also, use unique directory names to allow multiple IndexMergers to run
simultaneously.
Modified
Author: ab
Date: Fri Aug 18 11:50:00 2006
New Revision: 432675
URL: http://svn.apache.org/viewvc?rev=432675view=rev
Log:
NUTCH-341 - if -workingdir is specified, always create a unique subdir.
Also, use unique directory names to allow multiple IndexMergers to run
simultaneously.
Modified
Author: ab
Date: Mon Sep 18 03:43:07 2006
New Revision: 447359
URL: http://svn.apache.org/viewvc?view=revrev=447359
Log:
Fix an NPE when using searcher.max.hits, but NOT using time limit.
Modified:
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java
Added:
lucene/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
URL:
Author: ab
Date: Sat Sep 23 12:45:48 2006
New Revision: 449294
URL: http://svn.apache.org/viewvc?view=revrev=449294
Log:
NUTCH-350: Urls blocked by http.max.delays incorrectly marked as GONE.
Added:
lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http
Author: ab
Date: Mon Sep 25 09:58:49 2006
New Revision: 449738
URL: http://svn.apache.org/viewvc?view=revrev=449738
Log:
Don't create dummy Content (throws NPE), just pass null. Reported by
Richard Braman.
Modified:
lucene/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol
Author: ab
Date: Mon Sep 25 10:05:22 2006
New Revision: 449742
URL: http://svn.apache.org/viewvc?view=revrev=449742
Log:
Don't create dummy Content (throws NPE), just pass null. Reported by
Richard Braman.
Modified:
lucene/nutch/branches/branch-0.8/src/plugin/lib-http/src/java/org/apache
Author: ab
Date: Mon Sep 25 11:14:31 2006
New Revision: 449765
URL: http://svn.apache.org/viewvc?view=revrev=449765
Log:
Catch exception on invalid urls, and continue collecting valid ones.
Modified:
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/OutlinkExtractor.java
Author: ab
Date: Thu Sep 28 03:48:25 2006
New Revision: 450799
URL: http://svn.apache.org/viewvc?view=revrev=450799
Log:
Bring back the '-noAdditions' option. This is useful for running
constrained crawls, where the complete list of URLs is known in
advance.
Modified:
lucene/nutch/trunk/conf
Author: ab
Date: Mon Oct 9 00:13:46 2006
New Revision: 454297
URL: http://svn.apache.org/viewvc?view=revrev=454297
Log:
Fix NPE when document properties are null. Reported by Trym Asserson.
Modified:
lucene/nutch/trunk/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms
Author: ab
Date: Mon Oct 9 00:22:00 2006
New Revision: 454298
URL: http://svn.apache.org/viewvc?view=revrev=454298
Log:
Fix NPE when document properties are null. Reported by Trym Asserson.
Modified:
lucene/nutch/branches/branch-0.8/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms
Author: ab
Date: Sat Oct 28 03:32:44 2006
New Revision: 468673
URL: http://svn.apache.org/viewvc?view=revrev=468673
Log:
Fix NUTCH-394.
Modified:
lucene/nutch/branches/branch-0.8/build.xml
Modified: lucene/nutch/branches/branch-0.8/build.xml
URL:
http://svn.apache.org/viewvc/lucene/nutch
Author: ab
Date: Tue Oct 31 13:36:01 2006
New Revision: 469662
URL: http://svn.apache.org/viewvc?view=revrev=469662
Log:
Update.
Modified:
lucene/nutch/trunk/CHANGES.txt
Modified: lucene/nutch/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev
Author: ab
Date: Tue Oct 31 13:46:26 2006
New Revision: 469667
URL: http://svn.apache.org/viewvc?view=revrev=469667
Log:
NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
partition.
Modified:
lucene/nutch/branches/branch-0.8/CHANGES.txt
lucene/nutch/branches/branch-0.8
Author: ab
Date: Tue Nov 14 04:11:30 2006
New Revision: 474756
URL: http://svn.apache.org/viewvc?view=revrev=474756
Log:
NUTCH-401: use hadoop.tmp.dir instead of hardcoded /tmp.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
Modified:
lucene/nutch/trunk
Author: ab
Date: Tue Nov 14 04:24:48 2006
New Revision: 474763
URL: http://svn.apache.org/viewvc?view=revrev=474763
Log:
NUTCH-401: use mapred.temp.dir instead of hardcoded /tmp.
Modified:
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/segment/SegmentReader.java
Modified
Author: ab
Date: Tue Nov 14 11:38:06 2006
New Revision: 474934
URL: http://svn.apache.org/viewvc?view=revrev=474934
Log:
Add an ObjectWritable decorator.
Added:
lucene/nutch/trunk/src/java/org/apache/nutch/metadata/MetaWrapper.java
(with props)
Added: lucene/nutch/trunk/src/java/org
Author: ab
Date: Tue Nov 28 12:14:58 2006
New Revision: 480188
URL: http://svn.apache.org/viewvc?view=revrev=480188
Log:
Move some constants to Nutch.java, so that Metadata could use them properly.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
lucene/nutch
Author: ab
Date: Tue Nov 28 13:02:10 2006
New Revision: 480207
URL: http://svn.apache.org/viewvc?view=revrev=480207
Log:
Use SpellCheckedMetadata only when necessary, i.e. only when collecting
metadata from unreliable sources such as HTTP headers.
* Metadata: fix a bug where SpellCheckedMetadata
Author: ab
Date: Tue Dec 5 06:34:13 2006
New Revision: 482674
URL: http://svn.apache.org/viewvc?view=revrev=482674
Log:
Refactor robots.txt checking so that it's protocol independent.
Make blocking and robots checking optional inside lib-http. This is
needed for alternative Fetcher
Author: ab
Date: Thu Dec 7 03:21:08 2006
New Revision: 483420
URL: http://svn.apache.org/viewvc?view=revrev=483420
Log:
Upgrade to Hadoop 0.9.1 .
Added:
lucene/nutch/trunk/lib/hadoop-0.9.1.jar (with props)
Removed:
lucene/nutch/trunk/lib/hadoop-0.7.1.jar
Modified:
lucene/nutch
Author: ab
Date: Mon Dec 11 02:04:59 2006
New Revision: 485587
URL: http://svn.apache.org/viewvc?view=revrev=485587
Log:
Remove misplaced cast, which sometimes lead to an overflow.
Close readers when done - when using local FS this would prevent us
from deleting temporary dirs.
Modified
Author: ab
Date: Thu Dec 14 00:53:08 2006
New Revision: 487143
URL: http://svn.apache.org/viewvc?view=revrev=487143
Log:
Check if paths exist before deleting them. Reported by Renaud Richardet.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
Modified: lucene/nutch
Author: ab
Date: Thu Dec 14 01:06:56 2006
New Revision: 487145
URL: http://svn.apache.org/viewvc?view=revrev=487145
Log:
Check if paths exist before deleting them. Reported by Renaud Richardet.
Modified:
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/crawl/CrawlDb.java
Author: ab
Date: Fri Jan 5 08:58:29 2007
New Revision: 493085
URL: http://svn.apache.org/viewvc?view=revrev=493085
Log:
Fix NUTCH-425 and NUTCH-426.
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
Author: ab
Date: Thu Jan 11 05:25:43 2007
New Revision: 495214
URL: http://svn.apache.org/viewvc?view=revrev=495214
Log:
When indexing redirected pages, drop intermediate pages and only index the
final page.
Avoid NPEs in Crawl tool, when no URLs are generated or fetched.
Modified:
lucene
Author: ab
Date: Thu Jan 11 14:00:51 2007
New Revision: 495397
URL: http://svn.apache.org/viewvc?view=revrev=495397
Log:
Fix NUTCH-420 - DeleteDuplicates depended on the order of IndexDoc
processing..
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/java/org/apache/nutch
Author: ab
Date: Mon Jan 15 15:07:15 2007
New Revision: 496535
URL: http://svn.apache.org/viewvc?view=revrev=496535
Log:
Pick the right entry, as indicated by the same generate time.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
Modified: lucene/nutch/trunk/src
Author: ab
Date: Wed Jan 17 11:55:07 2007
New Revision: 497141
URL: http://svn.apache.org/viewvc?view=revrev=497141
Log:
NUTCH-68 - ported to use map-reduce.
Added:
lucene/nutch/trunk/src/java/org/apache/nutch/tools/FreeGenerator.java
(with props)
Modified:
lucene/nutch/trunk
Author: ab
Date: Wed Jan 17 13:06:50 2007
New Revision: 497172
URL: http://svn.apache.org/viewvc?view=revrev=497172
Log:
Revert accidental change to bin/nutch.
Fix Fetcher.java to correctly split input.
Add Fetcher2 - a queue-based fetcher implementation.
Added:
lucene/nutch/trunk/src/java
Author: ab
Date: Thu Jan 25 12:15:34 2007
New Revision: 499944
URL: http://svn.apache.org/viewvc?view=revrev=499944
Log:
Mention the addition of Fetcher2.
Modified:
lucene/nutch/trunk/CHANGES.txt
Modified: lucene/nutch/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/lucene/nutch/trunk
Author: ab
Date: Wed Feb 14 04:15:05 2007
New Revision: 507504
URL: http://svn.apache.org/viewvc?view=revrev=507504
Log:
Outlink: when null anchor is supplied replace it with an empty string.
ParseSegment: store segment name in parts that we produce here. Content is
only read, not stored as one
Author: ab
Date: Wed Mar 7 11:02:56 2007
New Revision: 515698
URL: http://svn.apache.org/viewvc?view=revrev=515698
Log:
NUTCH-432 - JAVA_PLATFORM with spaces breaks bin/nutch.
Also, apply the patch proposed in HADOOP-1080 to fix CLASSPATH problems
under Cygwin.
Modified:
lucene/nutch/trunk
Author: ab
Date: Wed Mar 7 13:59:07 2007
New Revision: 515791
URL: http://svn.apache.org/viewvc?view=revrev=515791
Log:
Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 releases.
Added:
lucene/nutch/trunk/lib/hadoop-0.11.2-core.jar (with props)
lucene/nutch/trunk/lib/lucene-core-2.1.0.jar
Author: ab
Date: Fri Mar 9 04:27:18 2007
New Revision: 516387
URL: http://svn.apache.org/viewvc?view=revrev=516387
Log:
Add the number of active threads to the status report.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher2.java
Modified: lucene/nutch/trunk/src/java
Author: ab
Date: Mon Mar 12 13:35:37 2007
New Revision: 517382
URL: http://svn.apache.org/viewvc?view=revrev=517382
Log:
Fix inconsistent end-of-line style. Discovered this when trying to import
to a separate subversion repo.
Modified:
lucene/nutch/trunk/contrib/web2/plugins/web-caching
Author: ab
Date: Mon Mar 19 16:02:56 2007
New Revision: 520154
URL: http://svn.apache.org/viewvc?view=revrev=520154
Log:
Update to Hadoop 0.12.1.
Added:
lucene/nutch/trunk/lib/hadoop-0.12.1-core.jar (with props)
lucene/nutch/trunk/lib/jets3t-0.5.0.jar (with props)
Removed:
lucene
Author: ab
Date: Thu Mar 22 03:08:00 2007
New Revision: 521182
URL: http://svn.apache.org/viewvc?view=revrev=521182
Log:
NUTCH-246 - incorrect segment size being generated due to time
synchronization issue.
Modified:
lucene/nutch/trunk/CHANGES.txt
lucene/nutch/trunk/src/java/org/apache
Author: ab
Date: Fri Mar 23 15:59:01 2007
New Revision: 521933
URL: http://svn.apache.org/viewvc?view=revrev=521933
Log:
Upgrade to Hadoop 0.12.2 release.
Fix whitespace issues in platform name in bin/hadoop under Cygwin.
Replace deprecated method call.
Added:
lucene/nutch/trunk/lib/hadoop
Author: ab
Date: Sat Apr 7 09:44:02 2007
New Revision: 526455
URL: http://svn.apache.org/viewvc?view=revrev=526455
Log:
Empty MapWritable would throw an NPE when building a keySet.
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
Modified: lucene/nutch/trunk
1 - 100 of 157 matches
Mail list logo