Hello,
The NDFS is under development, I think not use it on production. You can
use the 'bin/nutch server'.
Regards,
Ferenc
reetesh chandran wrotte:
Hello,
We are running nutch in 3 networked machines running
linux. We have apache tomcat running in all 3
machines. We are able to create a
Please check your crawl-urlfilter.txt. If you use older version of nutch
(e.g. 0.6 final), there is an entry, that specifies that, crawl only
from nutch.org.
Feng (Michael) Ji wrotte:
hi there,
If I put multiple web URL in the plain text file
urls in the following command, will it fetch
about the background of this tool. Can anyone tell me, what's the idea
behind?
Regards
Michael
Andy Liu wrote:
I believe this tool is unfinished and unsupported.
On 7/22/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
I found an IndexOptimzer in nutch.
When I run it, it dorps
Dear Developers!
I tested nutch 0.7 with all the parser plugins, and found the followings:
-
The fetch broken by with e.g. followings:
-
050901 110915
see the same errors. As I've seen a running installation yesterday,
I think it's a configuration mistake. By now I have no idea where.
Have you made any progress?
Regards
Michael
[EMAIL PROTECTED] wrote:
Dear Developers!
I tested nutch 0.7 with all the parser plugins, and found
Hi All,
I need to add dmoz meta-data to my index. I see some people have commented
about it but I didn't find a solution. Can someone read the steps below and
give me some hints or pointers? This is the code that I added:
1) injector.java: datum.setCategory(dmoz-cat);
2) crawldatum.java: add
Reporter: [EMAIL PROTECTED]
OpenSearchServlet does not check text-to-output for illegal xml characters;
dependent on search result, its possible for OSS to output xml that is not
well-formed. For example, if text has the character FF character in it -- --
i.e. the ascii character
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars.patch
Attached patch runs all xml text through a check for bad xml characters. This
patch is brutal dropping silently
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: NUTCH-110-version2.patch
Patch version 2. This patch benefits from discussion held up on nutch dev
list. This patch differs from the first
[
http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ]
[EMAIL PROTECTED] commented on NUTCH-110:
-
Scrub NUTCH-110-version2.patch. This patch double-encode certain entities
(First by the new toValidXmlText method, second
Be explicit about target JVM when building (1.4.x?)
---
Key: NUTCH-130
URL: http://issues.apache.org/jira/browse/NUTCH-130
Project: Nutch
Type: Improvement
Reporter: [EMAIL PROTECTED]
Priority: Minor
Below
[
http://issues.apache.org/jira/browse/NUTCH-130?page=comments#action_12358981 ]
[EMAIL PROTECTED] commented on NUTCH-130:
-
Need to do same for plugin compile:
$ /usr/local/bin/svn diff src/plugin/build-plugin.xml
Index: src/plugin/build
ParseUtil drops reason for failed parse
---
Key: NUTCH-190
URL: http://issues.apache.org/jira/browse/NUTCH-190
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.8-dev
Environment: linux
Reporter: [EMAIL
[ http://issues.apache.org/jira/browse/NUTCH-190?page=all ]
[EMAIL PROTECTED] updated NUTCH-190:
Attachment: ParseUtil_drops_failure_reason.patch
Attached is a suggested patch against revision 369598.
ParseUtil drops reason for failed parse
[
http://issues.apache.org/jira/browse/NUTCH-190?page=comments#action_12364145 ]
[EMAIL PROTECTED] commented on NUTCH-190:
-
Here's an example of failure output after patch is applied:
060126 141413 task_m_bx2ifn Error parsing:
http
Cannot open filename index.done.crc
---
Key: NUTCH-256
URL: http://issues.apache.org/jira/browse/NUTCH-256
Project: Nutch
Type: Bug
Components: indexer
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED
[ http://issues.apache.org/jira/browse/NUTCH-256?page=all ]
[EMAIL PROTECTED] updated NUTCH-256:
Attachment: index.done.crc.patch
Ensure creation of companion index.done .crc file
Cannot open filename index.done.crc
Type: Bug
Components: searcher
Versions: 0.8-dev
Reporter: [EMAIL PROTECTED]
Priority: Minor
All search result data we display in search results has to be explicitly
Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.
Its already Entity.encoded
[
http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376997 ]
[EMAIL PROTECTED] commented on NUTCH-257:
-
I took a closer look. Turns out Summary is inherently all about rendering HTML
(See the different Summary.Fragment
[
http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376999 ]
[EMAIL PROTECTED] commented on NUTCH-256:
-
Works for me. Thanks. Please close as fixed.
Cannot open filename index.done.crc
CrawlDbReducer: OOME because no upper-bound on inlinks count
Key: NUTCH-269
URL: http://issues.apache.org/jira/browse/NUTCH-269
Project: Nutch
Type: Bug
Reporter: [EMAIL PROTECTED]
Priority
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ]
[EMAIL PROTECTED] updated NUTCH-269:
Attachment: too-many-links.patch
Add configurable upper limit to amount of links we'll read.
CrawlDbReducer: OOME because no upper-bound on inlinks
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ]
[EMAIL PROTECTED] updated NUTCH-269:
Attachment: too-many-links2.patch
Previous patch is useless. This one actually breaks the loop.
CrawlDbReducer: OOME because no upper-bound
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v3.patch
Version of patch that doesn't ...process the String twice if it contains some
illegal characters!. Its name
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Version: 0.8-dev
(was: 0.7)
Was version 0.7. Changed 'Affects Version' to 0.8-dev.
OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v4.patch
v3 mistakenly included debugging code.
Attached cleaned up v4.
OpenSearchServlet outputs illegal xml
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v5.patch
No, the double call to getLegalXml is not intentional. Its a mistake. Thanks
for finding it.
I've attached
Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor
The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and
'anchor'. The query-basic plugin expands queries against the 'default' field
to run against all basic indexer plugin fields
[ http://issues.apache.org/jira/browse/NUTCH-423?page=all ]
[EMAIL PROTECTED] updated NUTCH-423:
Attachment: other-index-basic-query-fields.patch
Add other index-basic fields as query plugins
[
https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
[EMAIL PROTECTED] updated NUTCH-425:
Attachment: nutch425.patch
parse-js pollutes anchor text with base URL of source page
[
https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291
]
[EMAIL PROTECTED] commented on NUTCH-425:
-
I took a look at what is passed to parse-js both when called from
Components: fetcher
Affects Versions: 0.9.0
Reporter: [EMAIL PROTECTED]
Priority: Minor
The parse-js plugin in getJSLinks tries a regex looking for likely URLs against
a string of javascript. Any matches that do not begin 'www' are given to
java.net.URL with base URL to test
[
https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
[EMAIL PROTECTED] updated NUTCH-426:
Attachment: nutch426.patch
parse-js skips parsing if found URL fails java.net.URL parse
[
https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462307
]
[EMAIL PROTECTED] commented on NUTCH-426:
-
Just attached a patch that catches the MalformedURLException, logs
[
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472841
]
[EMAIL PROTECTED] commented on NUTCH-437:
-
+1. I reviewed and applied patch along with a hadoop-0.11.1-core
35 matches
Mail list logo