Re: mapred crawling exception - Job failed!
Yes it was fixed. just update your code from trunk. On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you.
[bug] Re: NegativeArraySizeException in search server
Hi, I got the same Exception. The cause of this exception is the default value of searcher.max.hits property in the nutch-default.xml. The default value is Integer.MAX_VALUE. But the class org.apache.lucene.util.PriorityQueue increment this max.value. The next number after Integer.MAX_VALUE is -2147483648. You must decrease the searcher.max.hits to fix this. But notice: The PriorityQueue use an Array of this size. If large a value is defined an OutOfMemoryException occurs. Any Ideas suggestion how to fix this? Marko Am 04.01.2006 um 02:00 schrieb Gal Nitzan: When trying to use the search server I get. I use the trunk from today... 060104 025549 13 Server handler 0 on 9004 call error: java.io.IOException: java.lang.NegativeArraySizeException java.io.IOException: java.lang.NegativeArraySizeException at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:35) at org.apache.lucene.search.HitQueue.init(HitQueue.java:23) at org.apache.lucene.search.TopDocCollector.init (TopDocCollector.java:47) at org.apache.nutch.searcher.LuceneQueryOptimizer $LimitedCollector.init(LuceneQueryOptimizer.java:52) at org.apache.nutch.searcher.LuceneQueryOptimizer.optimize (LuceneQueryOptimizer.java:153) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:93) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:155) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:324) at org.apache.nutch.ipc.RPC$1.call(RPC.java:186) at org.apache.nutch.ipc.Server$Handler.run(Server.java:200)
Re: mapred crawling exception - Job failed!
Hmmm... If I am looking correctly into my local SVN copy then I see I last updated yesterday - thus I have revision 365850 (Update of HTTPClient to v3.0). So this should be already fixed... :-( Andrzej, since you did probably the fix, is there anything special I should check to be sure I have the fixed version? Anyway, I will update from SVN again today and give it a next try this night. Will let you tomorow. Thanks, Lukas On 1/4/06, Gal Nitzan [EMAIL PROTECTED] wrote: Yes it was fixed. just update your code from trunk. On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you.
Re: mapred crawling exception - Job failed!
Fixed in the copy i run as i've been able to get my 100k pages indexed without getting that error. -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: mapred crawling exception - Job failed!
Thanks guys! I really didn't have the latest copy... L. On 1/4/06, Byron Miller [EMAIL PROTECTED] wrote: Fixed in the copy i run as i've been able to get my 100k pages indexed without getting that error. -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-163) LogFormatter design
LogFormatter design --- Key: NUTCH-163 URL: http://issues.apache.org/jira/browse/NUTCH-163 Project: Nutch Type: Improvement Environment: All platforms Reporter: Daniel Feinstein In Nutch project LogFormatter has duplicated functionality: 1) Logger records format and 2) Severe error handler The first usage is standard and usually could be overwritten by a user of the package by modifying logging.properties file. The second usage is much more problematic because it affects the behavior of the whole application (not only Nutch package). To support the error handling LogFormatter enforce usage of the formatter class by all classes of the whole application which uses Nutch package. This is done by overwriting all the system handlers (class java.util.logging.Handler). This operation prevents the application to use its own log formatter. Also this cause LogFormatter.hasLoggedSevere() to be sensitive to all severe records in the big system but not only to relevant. More than that this flag, LogFormatter.loggedSevere is never cleaned what means if an application had one, even unrelated severe record, tools like Fetcher will never run until the application will be restarted. I would like to suggest the following solutions: 1) To separate the functionality of log formatting and error handling or 2) Change LogFormatter class to be affected only by nutch package functions For my opinion the first solution is much better especially if error handling will be encapsulated for each task. I have found the following usages of LogFormatter.hasLoggedSevere(): - Fetcher - URLFilterChecker - ParseSegment Unfortunately I'm not familiar enough with the usages above to implement this solution that why I suggest the second one. I have rewritten my own implementation of LogFormatter class which is used for more than a year in www.rawsugar.com application. I could provide the file but do not know how to attach it to the issue. I hope this change will be accepted by the community. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
no static NutchConf
Hi, to move forward in the direction of having a nutch gui, I would love to start removing the static access of NutchConf. Based on experience first I would love to get a kind of general agreement and a 'go' before wasting to much time for an unaccented solution. I suggest: + removing NutchConf.get(). + in case a lower level object use only one, two but not more than 3 parameters from the nutch configuration, we add this parameter to the constructor of this object. (e.g. MapFile.Reader needs only the parameter INDEX_SKIP) + for higher level objects like fetcher tool- that need more than 3 parameters for the lower level object - we add a instance of NutchConf to the Constructor + for all dynamic used object that implements a specific interface (interface no control over the object constructor) we use the Configurable interface to set the NutchConf in a inversion of control like style. (e.g. Plugin Extension Implementations like Parser or Protocols) + PluginRegestry will not longer a singleton but will get an constructor with a NutchConf instance. + Getting a Extension, require also a NutchConf that is injected in case the Extension Object (e.g. a Parser) implements a Configurable interface. Any comments, improvement suggestions, more use-cases? I would love to do this job, can I get a go from the other developers? From my point of view NutchConf is actually a showblocker since a lot of people run in trouble integrating nutch in other projects, also my suggestions are require to write a nutch gui. Stefan
Re: no static NutchConf
Stefan Groschupf wrote: Hi, to move forward in the direction of having a nutch gui, I would love to start removing the static access of NutchConf. Based on experience first I would love to get a kind of general agreement and a 'go' before wasting to much time for an unaccented solution. I agree with the general direction. Some comments below: I suggest: + removing NutchConf.get(). I'm not sure about this... Somewhere you need to instantiate the default config, and this looks like a good place. + in case a lower level object use only one, two but not more than 3 parameters from the nutch configuration, we add this parameter to the constructor of this object. (e.g. MapFile.Reader needs only the parameter INDEX_SKIP) I don't fully agree with this. In most such cases, you already have a NutchConf instance in the method or class context, so it makes sense to use it in the constructor. You could add these construtors with all parameters iterated, but I'd expect that the constructors using NutchConf would be used most frequently. + for higher level objects like fetcher tool- that need more than 3 parameters for the lower level object - we add a instance of NutchConf to the Constructor Ok. + for all dynamic used object that implements a specific interface (interface no control over the object constructor) we use the Configurable interface to set the NutchConf in a inversion of control like style. (e.g. Plugin Extension Implementations like Parser or Protocols) Ok. + PluginRegestry will not longer a singleton but will get an constructor with a NutchConf instance. Definitely yes. + Getting a Extension, require also a NutchConf that is injected in case the Extension Object (e.g. a Parser) implements a Configurable interface. Yes. If you remember our discussion, I'd like also to follow a pattern where such instances are cached inside this NutchConf instance, if appropriate (i.e. if they are reusable and multi-threaded). Any comments, improvement suggestions, more use-cases? I would love to do this job, can I get a go from the other developers? +1 from me. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: no static NutchConf
I don't fully agree with this. In most such cases, you already have a NutchConf instance in the method or class context, so it makes sense to use it in the constructor. You could add these construtors with all parameters iterated, but I'd expect that the constructors using NutchConf would be used most frequently. My idea is to be able using low level things outside of nutch also. It is may a philosophically question in case of the map file writer you pass a complete hashmap with a bunch of properties to the object, but the objects only reads one int from this hashmap. I personal don't like to use a hashmap to 'transport' just one value. So my suggestion looks like: new MapFile.Reader(parameterA, nutchConf.getInt(parameterKey, 0)); if I understand you correct you prefer: new MapFile.Reader(parameterA, nutchConf); ... public MapFile(...){ this.parameter = nutchConf.getInt(parameterKey,0); } As mentioned this is more a code philosophy question and this is not important for me, my only idea was to decouple things as much as possible if we touch it anyway. + Getting a Extension, require also a NutchConf that is injected in case the Extension Object (e.g. a Parser) implements a Configurable interface. Yes. If you remember our discussion, I'd like also to follow a pattern where such instances are cached inside this NutchConf instance, if appropriate (i.e. if they are reusable and multi- threaded). I'm afraid I still do not clearly understand your idea here. As discussed it makes from my point of view no sense to cache any objects in a nutchConf. Especially extension implementation like parsers are multithreaded and exists that often as we have threads. A caching would make more sense behind the sense of the plugin registry, but it is may difficult since you can run in trouble with resource life cycle management. PluginClass instances are already cached and working like a kind of singleton for each existing plugin registry. Also I see some trouble when using this caching mechanism since NutchConf can be serialized. Actually I have no idea where this mechanism is used, but I guess distributed map reduce will use this mechanism heavily. So the cached objects need to be Serializable as well. Stefan
Re: IndexSorter optimizer
Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category. What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept. A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first. Doug
Re: no static NutchConf
Jérôme Charron wrote: Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: no static NutchConf
If you are going to be able to reconfigure a nutch component at runtime, you need to remove any configuration from the constructor and have a method that allows you to get/set the configuration for the component. The problem with keeping the entire configuration in a single component is trying to display/filter the configuration information for the user. So the user knows what component it is configuring. Eclipse has a very good pattern for handling configuration for each of the components. Basically each component is responsible for its own configuration, and the tool just provides the framework to allow the configuration to be displayed, updated, and stored. The drawback of that approach is that you really don't have a GUI, or at least have to be able to run without one. I think that, at the very least, removing the configuration information from the constructor is the first step. You can still have a properties object set the configuration. Then we can discuss the relative merits of displaying, changing, and storing the configuration. (Like, how a user is supposed to know what component is affected by which property.) Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 04, 2006 12:22 PM To: nutch-dev@lucene.apache.org Subject: Re: no static NutchConf I don't fully agree with this. In most such cases, you already have a NutchConf instance in the method or class context, so it makes sense to use it in the constructor. You could add these construtors with all parameters iterated, but I'd expect that the constructors using NutchConf would be used most frequently. My idea is to be able using low level things outside of nutch also. It is may a philosophically question in case of the map file writer you pass a complete hashmap with a bunch of properties to the object, but the objects only reads one int from this hashmap. I personal don't like to use a hashmap to 'transport' just one value. So my suggestion looks like: new MapFile.Reader(parameterA, nutchConf.getInt(parameterKey, 0)); if I understand you correct you prefer: new MapFile.Reader(parameterA, nutchConf); ... public MapFile(...){ this.parameter = nutchConf.getInt(parameterKey,0); } As mentioned this is more a code philosophy question and this is not important for me, my only idea was to decouple things as much as possible if we touch it anyway. + Getting a Extension, require also a NutchConf that is injected in case the Extension Object (e.g. a Parser) implements a Configurable interface. Yes. If you remember our discussion, I'd like also to follow a pattern where such instances are cached inside this NutchConf instance, if appropriate (i.e. if they are reusable and multi- threaded). I'm afraid I still do not clearly understand your idea here. As discussed it makes from my point of view no sense to cache any objects in a nutchConf. Especially extension implementation like parsers are multithreaded and exists that often as we have threads. A caching would make more sense behind the sense of the plugin registry, but it is may difficult since you can run in trouble with resource life cycle management. PluginClass instances are already cached and working like a kind of singleton for each existing plugin registry. Also I see some trouble when using this caching mechanism since NutchConf can be serialized. Actually I have no idea where this mechanism is used, but I guess distributed map reduce will use this mechanism heavily. So the cached objects need to be Serializable as well. Stefan
Re: no static NutchConf
Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. Ok, I understand this Andrzej, but it is not really what I call a use case. It is more a feature that you describe here. In fact, what I mean is that I don't understand in which cases it will be usefull. And I don't understand how a particular NutchConfig will be selected for a particular task... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: IndexSorter optimizer
Doug Cutting wrote: Byron Miller wrote: On optimizing performance, does anyone know if google is exporting its entire dataset as an index or only somehow indexing the topN % (since they only show the first 1000 or so results anyway) Both. The highest-scoring pages are kept in separate indexes that are searched first. When a query fails to match 1000 or so documents in the high-scoring indexes then the entire dataset is searched. In general there can be multiple levels, e.g.: high-scoring, mid-scoring and low-scoring indexes, with the vast majority of pages in the last category, and the vast majority of queries resolved consulting only the first category. What I have implemented so far for Nutch is a single-index version of this. The current index-sorting implementation does not yet scale well to indexes larger than ~50M urls. It is a proof-of-concept. A better long-term approach is to introduce another MapReduce pass that collects Lucene documents (or equivalent) as values, and page scores as keys. Then the indexing MapReduce pass can partition and sort by score before creating indexes. The distributed search code will also need to be modified to search high-score indexes first. The WWW2005 conference presented a couple of interesting papers on the subject (http://www2005.org), among others these: 1. http://www2005.org/cdrom/docs/p235.pdf 2. http://www2005.org/cdrom/docs/p245.pdf 3. http://www2005.org/cdrom/docs/p257.pdf The techniques described in the first paper are not too difficult to implement, especially the Carmel's method of index pruning, which gives satisfactory results at moderate costs. The third paper, by Long Suel, presents a concept of using a cache of intersections for multi-term queries, which we already sort of use with CachingFilters, only they propose to store them on-disk instead of limiting the cache to relatively small number of filters kept in RAM... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: no static NutchConf
Jérôme Charron wrote: Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. Ok, I understand this Andrzej, but it is not really what I call a use case. It is more a feature that you describe here. In fact, what I mean is that I don't understand in which cases it will be usefull. And I don't understand how a particular NutchConfig will be selected for a particular task... Use case: executing multiple tasks on any single tasktracker node, but with drastically different configurations per each task. Example: what happens now if you try to run more than one fetcher at the same time, where the fetcher parameters differ (or a set of activated plugins differs)? You can't - the local tasks on each tasktracker will use whatever local config is there. What happens if you change the config on a node that submits the job? The changes won't be propagated to the tasktracker nodes, because tasktrackers use local configuration (through a singleton NutchConf.get()), instead of supplying a serialized/deserialized instance of the config from the originating node... etc. NutchConf instances will be created when you create a JobConf. Then they will have to be serialized/deserialized when job descriptors are sent by jobtracker to tasktrackers on mapred nodes, and used locally by tasktrackers to instantiate local tasks using copies of the original NutchConf instance. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-164) Locale (language) choice by first session has global effect to all sessions
[ http://issues.apache.org/jira/browse/NUTCH-164?page=comments#action_12361782 ] KuroSaka TeruHiko commented on NUTCH-164: - Actually, the current language selection scheme needs an overhaul. The locale for the message bundle is determined only by the preferred language setting of the browser, while the selection of the localized JSP is done by clicking on the language code link in the bottom of each page. There is no coordination. The choosen language by the language code does not persist in the session. See the discussion about the locale selection at W3C: http://www.w3.org/International/questions/qa-accept-lang-locales#answer Locale (language) choice by first session has global effect to all sessions --- Key: NUTCH-164 URL: http://issues.apache.org/jira/browse/NUTCH-164 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1 Environment: any Reporter: KuroSaka TeruHiko Here's a report posted on nutch-users ML by Sergio [EMAIL PROTECTED] on 1/02/2006: I just installed nutch in a Fedora Core 3 server. Once installed, I crawled a small site to test it. I opened my navigator (mozilla 1.7 which reports by default ES-ES locales, and everything was ok). Then I asked a friend of mine (the owner of the server) to test it. He did a search with an EN-US locale navigator, and the search page appeared in Spanish. After a few hours, I did the following: I restarted tomcat, I changed the locale of my mozilla to EN, and I opened the search page. Now I always get English search page even if I open with a mozilla ES-ES locale. I wrote a message to my friend: nutch keeps the locale of the first navigator that makes a request for all other requests. By this reason, yesterday as the first request was from my ES locale browser, you saw the page in Spanish with your browser that reports EN locale. There is a way to make this work: * Making sure that, after the server is restarted, the first request is done by a browser that reports EN locale. This happened in my environment too. After taking a look the code, I believe this is caused by use of the default message bundle in search.jsp. The code snipplet looks like: i18n:bundle baseName=org.nutch.jsp.search/ ... titleNutch: i18n:message key=title//title ... The default message bundle probably has the application scope. Because of that, the first setting of the language has global effect to every session created afterward. The right fix is to limit the scope to the session by inserting the scope specifier, as in: i18n:bundle scope=session baseName=org.nutch.jsp.search/ Other JSP files need to be inspected for the same issue and should be fixed as well. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: no static NutchConf
+1 in general In fact I like the approach presented by Stefan to pass only required parameters to objects that have small number of configurable params instead of NutchConf - it makes it obvious which parameters are required for such basic objects to run and as they are usually building blocks for something bigger it makes it easier to reuse it with different params in different parts of the code. But I like the direction and will not oppose against passing the whole NutchConf in this case. Regards Piotr
Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/
Andrzej, Do you think it would be a good idea to commit it in 0.7 branch for 0.7.2 release? I personally prefer to use released libraries instead of RC if possible. It does not require a lot of changes and you have already tested it with existing code... Piotr [EMAIL PROTECTED] wrote: Author: ab Date: Tue Jan 3 23:32:04 2006 New Revision: 365850 URL: http://svn.apache.org/viewcvs?rev=365850view=rev Log: Update Commons HTTPClient to v. 3.0. Add some default headers to prefer HTML content, and in English.
Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/
Piotr Kosiorowski wrote: Andrzej, Do you think it would be a good idea to commit it in 0.7 branch for 0.7.2 release? I personally prefer to use released libraries instead of RC if possible. It does not require a lot of changes and you have already tested it with existing code... Piotr I didn't see any problems, I think you can go ahead. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: no static NutchConf
Hi, Stefan Groschupf wrote: [...] Any comments, improvement suggestions, more use-cases? I completely agree with you. I have two more ideas: 1) create NutchConf as interface (not class) 2) make it work as plugin 1) If NutchConf is an interface, the NutchConf implementation can be written with a hashmap in mind (like now) or with JMX or commons-configuration. 2) There are only 4 required configuration options (plugin.excludes, plugin.includes, plugin.folders, plugin.auto-activation) the plugin registry needs to start up. If these options are provided by a bootstrap configuration, configuration plugins will be possible. If help is needed, i would like to implement a JMX implementation of NutchConf (since i will need it myself;). Regards, Thomas
Re: no static NutchConf
Andrzej Bialecki wrote: Example: what happens now if you try to run more than one fetcher at the same time, where the fetcher parameters differ (or a set of activated plugins differs)? You can't - the local tasks on each tasktracker will use whatever local config is there. That's true when mapred.job.tracker=local, but when things are distributed the config can vary since each task is spawned in a separate JVM with a separate classpath. The nutch-site.xml on each node can never be overidden. For example, so long as plugin.includes is not specified in nutch-site.xml on each node, then each task can override plugin.includes to use different plugins. Also note that plugin implementations can submitted in a jar file with the job, and plugin.folders can be overridden in the job to find the new plugins. So a job jar might include a folder named my.plugins and set plugin.folders to my.plugins, plugins, then alter plugin.includes to include job-specific plugins. What happens if you change the config on a node that submits the job? The changes won't be propagated to the tasktracker nodes, because tasktrackers use local configuration (through a singleton NutchConf.get()), instead of supplying a serialized/deserialized instance of the config from the originating node... etc. Again, I'm not sure this is a problem. Properties which tasks should be able to override should not be specified in nutch-site.xml, but rather in mapred-default.xml. Lots of job-specific properties are currently passed this way. Another use case for eliminating the static uses of NutchConf is to simplify the construction of a configuration gui. It would be nice to have a web-based interface which permits one to configure parameters and then have it run the system. This should be able to run multiple Nutch instances in a single JVM. For example, a single Nutch-based search appliance daemon should be able to crawl and search both your intranet and your public websites, each configured separately. Doug
[jira] Closed: (NUTCH-142) NutchConf should use the thread context classloader
[ http://issues.apache.org/jira/browse/NUTCH-142?page=all ] Piotr Kosiorowski closed NUTCH-142: --- Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed NutchConf should use the thread context classloader --- Key: NUTCH-142 URL: http://issues.apache.org/jira/browse/NUTCH-142 Project: Nutch Type: Improvement Versions: 0.7 Reporter: Mike Cannon-Brookes Fix For: 0.7.2-dev, 0.8-dev Right now NutchConf uses it's own static classloader which is _evil_ in a J2EE scenario. This is simply fixed. Line 52: private ClassLoader classLoader = NutchConf.class.getClassLoader(); Should be: private ClassLoader classLoader = Thread.currentThread().getContextClassLoader(); This means no matter where Nutch classes are loaded from, it will use the correct J2EE classloader to try to find configuration files (ie from WEB-INF/classes). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: no static NutchConf
Hi Stefan, I think these are fine things to be doing. Just two points: (1) Why not just always pass the NutchConf to the constructor of any class that needs it? Instead of distinguishing between the case of whether the class will use 1 or 2 configuration parameters; or more than that. Just for consistency. Also, it's possible that a class that CURRENTLY only uses 2 configuration parameters will use 3 or 4 at some point in the future, and it would be a shame to have to rewrite its constructor when that happens. (2) What I'd REALLY like to see is if NutchConf were an interface, with methods that allow the retrieval of properties from any source. There could be a class NutchXmlConf which implements the NutchConf interface, which works the current way (with nutch-default.xml, nutch-site.xml and so on). Where we need to create a NutchConf, we actually create a NutchXmlConf, but pass it to class constructors whose arguments are of type NutchConf. That way, if I want to use a non-standard mechanism for storing my Nutch parameters (eg, a properties file, a relational database, the Windows Registry, whatever), I can write my own class that implements the NutchConf interface; then instantiate it and pass it around, without having to re-write every Nutch class that uses it. The benefits of (2) are legion. In particular, for people who want to use a Nutch search engine as part of an existing web application, where that existing application uses a specific (non-XML) mechanism for storing configuration parameters. It would also give extra flexibility for people working on Nutch installations that sit in multiple environments (Development, System Test, UAT, Production etc) and get deployed from one environment to the next. Regards, David. From: Stefan Groschupf [EMAIL PROTECTED] Date: Wed, 4 Jan 2006 15:39:38 +0100 Subject: [Nutch-dev] no static NutchConf Hi, to move forward in the direction of having a nutch gui, I would love to start removing the static access of NutchConf. Based on experience first I would love to get a kind of general agreement and a 'go' before wasting to much time for an unaccented solution. I suggest: + removing NutchConf.get(). + in case a lower level object use only one, two but not more than 3 parameters from the nutch configuration, we add this parameter to the constructor of this object. (e.g. MapFile.Reader needs only the parameter INDEX_SKIP) + for higher level objects like fetcher tool- that need more than 3 parameters for the lower level object - we add a instance of NutchConf to the Constructor + for all dynamic used object that implements a specific interface (interface no control over the object constructor) we use the Configurable interface to set the NutchConf in a inversion of control like style. (e.g. Plugin Extension Implementations like Parser or Protocols) + PluginRegestry will not longer a singleton but will get an constructor with a NutchConf instance. + Getting a Extension, require also a NutchConf that is injected in case the Extension Object (e.g. a Parser) implements a Configurable interface. Any comments, improvement suggestions, more use-cases? I would love to do this job, can I get a go from the other developers? From my point of view NutchConf is actually a showblocker since a lot of people run in trouble integrating nutch in other projects, also my suggestions are require to write a nutch gui. Stefan This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network.
injection infinite loop
If you inject the crawldb with a url file that doesn't end with a line feed, an infinite loop is entered. Anybody else encounter this problem? 060104 160950 Running job: job_7uku5w 060104 160952 map 0% 060104 160954 map 50% 060104 160957 map -2631% 060104 160959 map -259756% 060104 161002 map -538552% 060104 161006 map -818413% 060104 161009 map -1098421% 060104 161011 map -1377851% 060104 161014 map -1657718% 060104 161018 map -1939534% 060104 161021 map -2218515% 060104 161023 map -2588212% 060104 161026 map -2868787% 060104 161030 map -3147637%
Re: mapred crawling exception - Job failed!
I gave it a next try this night and I still have troubles. This is the very end of my log (full version is attached) and you can see another nasty exception: ... 060104 213644 map 100% 060104 213645 Optimizing index. java.lang.NullPointerException: value cannot be null at org.apache.lucene.document.Field.init(Field.java:469) at org.apache.lucene.document.Field.init(Field.java:412) at org.apache.lucene.document.Field.UnIndexed(Field.java:195) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:199) at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90) Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.indexer.Indexer.index(Indexer.java:259) at org.apache.nutch.crawl.Crawl.main(Crawl.java:121) I tried to turn off most of parsing pluggins but it didn't help so there is probably some general issue. Any ideas? Regards, Lukas On 1/4/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Thanks guys! I really didn't have the latest copy... L. On 1/4/06, Byron Miller [EMAIL PROTECTED] wrote: Fixed in the copy i run as i've been able to get my 100k pages indexed without getting that error. -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: mapred crawling exception - Job failed!
Lukas Vlcek wrote: I gave it a next try this night and I still have troubles. This is the very end of my log (full version is attached) and you can see another nasty exception: Do you use the Fetcher in parsing or non-parsing mode, i.e. do you run a ParseSegment as a separate step? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com