[ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
Paul Baclace updated NUTCH-150:
---
Attachment: OutlinkExtractor.java.patch
This patch has 3 changes:
1. Adds a comment that non-plain-text can be a problem.
2. Adds quantifiers to the regular expre
OutlinkExtractor extremely slow on some non-plain text
--
Key: NUTCH-150
URL: http://issues.apache.org/jira/browse/NUTCH-150
Project: Nutch
Type: Bug
Versions: 0.8-dev
Environment: All
Reporter: Paul Ba
Hi all,
It's time to do some cleanup of the trunk/ after the mapred merge. I'm
planning to remove the old classes in trunk/, from the following packages:
* org.apache.nutch.db.* - all classes
* org.apache.nutch.fetcher.*
* org.apache.nutch.indexer.IndexSegment
* org.apache.nutch.indexer.Delete
[
http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361133 ]
Andrzej Bialecki commented on NUTCH-61:
This patch already supports this. Anyway, it needs to be significantly
re-worked to fit into the current development version.
>
[
http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361131 ]
raghavendra prabhu commented on NUTCH-61:
-
Will the same thing work for a filesystem
For a file system , We can directly get the modified date store it in the db
The pl
[
http://issues.apache.org/jira/browse/NUTCH-149?page=comments#action_12361130 ]
raghavendra prabhu commented on NUTCH-149:
--
Do the outlinks work only when the HTML has a basetag
So that the entire link may be constructed
If not will the base ta
outlinks not shown properly in cached.jsp
-
Key: NUTCH-149
URL: http://issues.apache.org/jira/browse/NUTCH-149
Project: Nutch
Type: Bug
Components: searcher, web gui
Versions: 0.8-dev
Environment: windows xp
apache
[
http://issues.apache.org/jira/browse/NUTCH-148?page=comments#action_12361128 ]
Piotr Kosiorowski commented on NUTCH-148:
-
Do you have Cygwin installed?
Is 'df' working in your cygwin installation?
Do you run crawl from cygwin shell?
Nutch require
Hi,
This is what i did to make NutchConf behave not so static,
without patching any of those 195 places Stefan mentioned.
NutchConf.get() yields the current config.
OpenConf sets a new current config.
finally CloseConf closes this config.
But be warned about issues with the plugin cache menti
Stefan Groschupf wrote:
Hi,
Since we know that our httpclient plugin has some problems may it is
sensefully to update to the new library,
I guess this is some work, but may someone is interested to take the
job.:)
I'll take it, thanks for the heads-up.
--
Best regards,
Andrzej Bialecki
org.apache.nutch.tools.CrawlTool throws error while doing deleteduplicates
--
Key: NUTCH-148
URL: http://issues.apache.org/jira/browse/NUTCH-148
Project: Nutch
Type: Bug
Components: indexer
Hi,
Since we know that our httpclient plugin has some problems may it is
sensefully to update to the new library,
I guess this is some work, but may someone is interested to take the
job.:)
http://www.theserverside.com/news/thread.tss?thread_id=38189
ttpClient 3.0 provides the following n
12 matches
Mail list logo