Re: [Nutch-dev] Exception Could not obtain new output block

2005-07-13 Thread [EMAIL PROTECTED]
Hello, The NDFS is under development, I think not use it on production. You can use the 'bin/nutch server'. Regards, Ferenc reetesh chandran wrotte: Hello, We are running nutch in 3 networked machines running linux. We have apache tomcat running in all 3 machines. We are able to create a

Re: mulitple website crawling

2005-07-21 Thread [EMAIL PROTECTED]
Please check your crawl-urlfilter.txt. If you use older version of nutch (e.g. 0.6 final), there is an entry, that specifies that, crawl only from nutch.org. Feng (Michael) Ji wrotte: hi there, If I put multiple web URL in the plain text file urls in the following command, will it fetch

Re: IndexOptimizer bug?

2005-08-04 Thread [EMAIL PROTECTED]
about the background of this tool. Can anyone tell me, what's the idea behind? Regards Michael Andy Liu wrote: I believe this tool is unfinished and unsupported. On 7/22/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I found an IndexOptimzer in nutch. When I run it, it dorps

nutch 0.7 bug?

2005-09-01 Thread [EMAIL PROTECTED]
Dear Developers! I tested nutch 0.7 with all the parser plugins, and found the followings: - The fetch broken by with e.g. followings: - 050901 110915

Re: nutch 0.7 bug?

2005-09-09 Thread [EMAIL PROTECTED]
see the same errors. As I've seen a running installation yesterday, I think it's a configuration mistake. By now I have no idea where. Have you made any progress? Regards Michael [EMAIL PROTECTED] wrote: Dear Developers! I tested nutch 0.7 with all the parser plugins, and found

adding dmoz meta data to index.

2007-11-06 Thread [EMAIL PROTECTED]
Hi All, I need to add dmoz meta-data to my index. I see some people have commented about it but I didn't find a solution. Can someone read the steps below and give me some hints or pointers? This is the code that I added: 1) injector.java: datum.setCategory(dmoz-cat); 2) crawldatum.java: add

[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
Reporter: [EMAIL PROTECTED] OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-14 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: NUTCH-110-version2.patch Patch version 2. This patch benefits from discussion held up on nutch dev list. This patch differs from the first

[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-11-10 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ] [EMAIL PROTECTED] commented on NUTCH-110: - Scrub NUTCH-110-version2.patch. This patch double-encode certain entities (First by the new toValidXmlText method, second

[jira] Created: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-11-29 Thread [EMAIL PROTECTED] (JIRA)
Be explicit about target JVM when building (1.4.x?) --- Key: NUTCH-130 URL: http://issues.apache.org/jira/browse/NUTCH-130 Project: Nutch Type: Improvement Reporter: [EMAIL PROTECTED] Priority: Minor Below

[jira] Commented: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-11-30 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-130?page=comments#action_12358981 ] [EMAIL PROTECTED] commented on NUTCH-130: - Need to do same for plugin compile: $ /usr/local/bin/svn diff src/plugin/build-plugin.xml Index: src/plugin/build

[jira] Created: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
ParseUtil drops reason for failed parse --- Key: NUTCH-190 URL: http://issues.apache.org/jira/browse/NUTCH-190 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: linux Reporter: [EMAIL

[jira] Updated: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-190?page=all ] [EMAIL PROTECTED] updated NUTCH-190: Attachment: ParseUtil_drops_failure_reason.patch Attached is a suggested patch against revision 369598. ParseUtil drops reason for failed parse

[jira] Commented: (NUTCH-190) ParseUtil drops reason for failed parse

2006-01-26 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-190?page=comments#action_12364145 ] [EMAIL PROTECTED] commented on NUTCH-190: - Here's an example of failure output after patch is applied: 060126 141413 task_m_bx2ifn Error parsing: http

[jira] Created: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
Cannot open filename index.done.crc --- Key: NUTCH-256 URL: http://issues.apache.org/jira/browse/NUTCH-256 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: [EMAIL PROTECTED

[jira] Updated: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-27 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-256?page=all ] [EMAIL PROTECTED] updated NUTCH-256: Attachment: index.done.crc.patch Ensure creation of companion index.done .crc file Cannot open filename index.done.crc

[jira] Created: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
Type: Bug Components: searcher Versions: 0.8-dev Reporter: [EMAIL PROTECTED] Priority: Minor All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded

[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376997 ] [EMAIL PROTECTED] commented on NUTCH-257: - I took a closer look. Turns out Summary is inherently all about rendering HTML (See the different Summary.Fragment

[jira] Commented: (NUTCH-256) Cannot open filename ....index.done.crc

2006-04-28 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-256?page=comments#action_12376999 ] [EMAIL PROTECTED] commented on NUTCH-256: - Works for me. Thanks. Please close as fixed. Cannot open filename index.done.crc

[jira] Created: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
CrawlDbReducer: OOME because no upper-bound on inlinks count Key: NUTCH-269 URL: http://issues.apache.org/jira/browse/NUTCH-269 Project: Nutch Type: Bug Reporter: [EMAIL PROTECTED] Priority

[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ] [EMAIL PROTECTED] updated NUTCH-269: Attachment: too-many-links.patch Add configurable upper limit to amount of links we'll read. CrawlDbReducer: OOME because no upper-bound on inlinks

[jira] Updated: (NUTCH-269) CrawlDbReducer: OOME because no upper-bound on inlinks count

2006-05-15 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-269?page=all ] [EMAIL PROTECTED] updated NUTCH-269: Attachment: too-many-links2.patch Previous patch is useless. This one actually breaks the loop. CrawlDbReducer: OOME because no upper-bound

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v3.patch Version of patch that doesn't ...process the String twice if it contains some illegal characters!. Its name

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Version: 0.8-dev (was: 0.7) Was version 0.7. Changed 'Affects Version' to 0.8-dev. OpenSearchServlet outputs illegal xml characters

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-19 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v4.patch v3 mistakenly included debugging code. Attached cleaned up v4. OpenSearchServlet outputs illegal xml

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-20 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v5.patch No, the double call to getLegalXml is not intentional. Its a mistake. Thanks for finding it. I've attached

[jira] Created: (NUTCH-423) Add other index-basic fields as query plugins

2006-12-28 Thread [EMAIL PROTECTED] (JIRA)
Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor The basic indexer plugin adds 'host', 'site', 'url', 'content', 'title', and 'anchor'. The query-basic plugin expands queries against the 'default' field to run against all basic indexer plugin fields

[jira] Updated: (NUTCH-423) Add other index-basic fields as query plugins

2006-12-28 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-423?page=all ] [EMAIL PROTECTED] updated NUTCH-423: Attachment: other-index-basic-query-fields.patch Add other index-basic fields as query plugins

[jira] Updated: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] [EMAIL PROTECTED] updated NUTCH-425: Attachment: nutch425.patch parse-js pollutes anchor text with base URL of source page

[jira] Commented: (NUTCH-425) parse-js pollutes anchor text with base URL of source page

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291 ] [EMAIL PROTECTED] commented on NUTCH-425: - I took a look at what is passed to parse-js both when called from

[jira] Created: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
Components: fetcher Affects Versions: 0.9.0 Reporter: [EMAIL PROTECTED] Priority: Minor The parse-js plugin in getJSLinks tries a regex looking for likely URLs against a string of javascript. Any matches that do not begin 'www' are given to java.net.URL with base URL to test

[jira] Updated: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] [EMAIL PROTECTED] updated NUTCH-426: Attachment: nutch426.patch parse-js skips parsing if found URL fails java.net.URL parse

[jira] Commented: (NUTCH-426) parse-js skips parsing if found URL fails java.net.URL parse

2007-01-04 Thread [EMAIL PROTECTED] (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462307 ] [EMAIL PROTECTED] commented on NUTCH-426: - Just attached a patch that catches the MalformedURLException, logs

[jira] Commented: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-13 Thread [EMAIL PROTECTED] (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472841 ] [EMAIL PROTECTED] commented on NUTCH-437: - +1. I reviewed and applied patch along with a hadoop-0.11.1-core