Re: mapred crawling exception - Job failed!

2006-01-04 Thread Gal Nitzan
Yes it was fixed. just update your code from trunk.


On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote:
 Lukas Vlcek wrote:
 
 Hi,
 
 I am trying to use the latest nutch-trunk version but I am facing
 unexpected Job failed! exception. It seems that all crawling work
 has been already done but some threads are hunged which results into
 exception after some timeout.
 
   
 
 
 This was fixed (or should be fixed :) in the revision r365576. Please 
 report if it doesn't fix it for you.
 




[bug] Re: NegativeArraySizeException in search server

2006-01-04 Thread Marko Bauhardt

Hi,
I got the same Exception. The cause of this exception is the default  
value of searcher.max.hits property in the nutch-default.xml. The  
default value is Integer.MAX_VALUE. But the class   
org.apache.lucene.util.PriorityQueue increment this max.value.
The next number after Integer.MAX_VALUE is -2147483648. You must  
decrease the searcher.max.hits to fix this.
But notice: The PriorityQueue use an Array of this size. If large  a  
value is defined an OutOfMemoryException occurs.

Any Ideas suggestion how to fix this?

Marko




Am 04.01.2006 um 02:00 schrieb Gal Nitzan:


When trying to use the search server I get.

I use the trunk from today...

060104 025549 13 Server handler 0 on 9004 call error:
java.io.IOException: java.lang.NegativeArraySizeException
java.io.IOException: java.lang.NegativeArraySizeException
at
org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:35)
at org.apache.lucene.search.HitQueue.init(HitQueue.java:23)
at
org.apache.lucene.search.TopDocCollector.init 
(TopDocCollector.java:47)

at org.apache.nutch.searcher.LuceneQueryOptimizer
$LimitedCollector.init(LuceneQueryOptimizer.java:52)
at
org.apache.nutch.searcher.LuceneQueryOptimizer.optimize 
(LuceneQueryOptimizer.java:153)

at
org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:93)
at
org.apache.nutch.searcher.NutchBean.search(NutchBean.java:155)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:39)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:324)
at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:200)







Re: mapred crawling exception - Job failed!

2006-01-04 Thread Lukas Vlcek
Hmmm...

If I am looking correctly into my local SVN copy then I see I last
updated yesterday - thus I have revision 365850 (Update of HTTPClient
to v3.0). So this should be already fixed... :-(

Andrzej, since you did probably the fix, is there anything special I
should check to be sure I have the fixed version?

Anyway, I will update from SVN again today and give it a next try this
night. Will let you tomorow.

Thanks,
Lukas

On 1/4/06, Gal Nitzan [EMAIL PROTECTED] wrote:
 Yes it was fixed. just update your code from trunk.


 On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote:
  Lukas Vlcek wrote:
 
  Hi,
  
  I am trying to use the latest nutch-trunk version but I am facing
  unexpected Job failed! exception. It seems that all crawling work
  has been already done but some threads are hunged which results into
  exception after some timeout.
  
  
  
 
  This was fixed (or should be fixed :) in the revision r365576. Please
  report if it doesn't fix it for you.
 





Re: mapred crawling exception - Job failed!

2006-01-04 Thread Byron Miller
Fixed in the copy i run as i've been able to get my
100k pages indexed without getting that error.

-byron

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Lukas Vlcek wrote:
 
 Hi,
 
 I am trying to use the latest nutch-trunk version
 but I am facing
 unexpected Job failed! exception. It seems that
 all crawling work
 has been already done but some threads are hunged
 which results into
 exception after some timeout.
 
   
 
 
 This was fixed (or should be fixed :) in the
 revision r365576. Please 
 report if it doesn't fix it for you.
 
 -- 
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _  
 __
 [__ || __|__/|__||\/|  Information Retrieval,
 Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System
 Integration
 http://www.sigram.com  Contact: info at sigram dot
 com
 
 
 



Re: mapred crawling exception - Job failed!

2006-01-04 Thread Lukas Vlcek
Thanks guys!
I really didn't have the latest copy...
L.
On 1/4/06, Byron Miller [EMAIL PROTECTED] wrote:
 Fixed in the copy i run as i've been able to get my
 100k pages indexed without getting that error.

 -byron

 --- Andrzej Bialecki [EMAIL PROTECTED] wrote:

  Lukas Vlcek wrote:
 
  Hi,
  
  I am trying to use the latest nutch-trunk version
  but I am facing
  unexpected Job failed! exception. It seems that
  all crawling work
  has been already done but some threads are hunged
  which results into
  exception after some timeout.
  
  
  
 
  This was fixed (or should be fixed :) in the
  revision r365576. Please
  report if it doesn't fix it for you.
 
  --
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _
  __
  [__ || __|__/|__||\/|  Information Retrieval,
  Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System
  Integration
  http://www.sigram.com  Contact: info at sigram dot
  com
 
 
 




[jira] Created: (NUTCH-163) LogFormatter design

2006-01-04 Thread Daniel Feinstein (JIRA)
LogFormatter design
---

 Key: NUTCH-163
 URL: http://issues.apache.org/jira/browse/NUTCH-163
 Project: Nutch
Type: Improvement
 Environment: All platforms
Reporter: Daniel Feinstein


In Nutch project LogFormatter has duplicated functionality:
1) Logger records format and
2) Severe error handler
 
The first usage is standard and usually could be overwritten by a user of the 
package by modifying logging.properties file.
The second usage is much more problematic because it affects the behavior of 
the whole application (not only Nutch package). To support the error handling 
LogFormatter enforce usage of the formatter class by all classes of the whole 
application which uses Nutch package. This is done by overwriting all the 
system handlers (class java.util.logging.Handler). This operation prevents the 
application to use its own log formatter. Also this cause 
LogFormatter.hasLoggedSevere() to be sensitive to all severe records in the big 
system but not only to relevant. More than that this flag, 
LogFormatter.loggedSevere is never cleaned what means if an application had 
one, even unrelated severe record, tools like Fetcher will never run until the 
application will be restarted.
 
I would like to suggest the following solutions:
1) To separate the functionality of log formatting and error handling or
2) Change LogFormatter class to be affected only by nutch package functions
 
For my opinion the first solution is much better especially if error handling 
will be encapsulated for each task. I have found the following usages of 
LogFormatter.hasLoggedSevere():
- Fetcher
- URLFilterChecker
- ParseSegment
Unfortunately I'm not familiar enough with the usages above to implement this 
solution that why I suggest the second one.
I have rewritten my own implementation of LogFormatter class which is used for 
more than a year in www.rawsugar.com application.
I could provide the file but do not know how to attach it to the issue. I hope 
this change will be accepted by the community.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



no static NutchConf

2006-01-04 Thread Stefan Groschupf

Hi,
to move forward in the direction of having a nutch gui, I would love  
to start removing the static access of NutchConf.
Based on experience first I would love to get a kind of general  
agreement and a 'go' before wasting to much time for an unaccented  
solution.


I suggest:

+ removing NutchConf.get().
+ in case a lower level object use only one, two but not more than 3  
parameters from the nutch configuration, we add this parameter to the  
constructor of this object.

(e.g. MapFile.Reader needs only the parameter INDEX_SKIP)
+ for higher level objects like fetcher tool- that need more than 3  
parameters for the lower level object -  we add a instance of  
NutchConf to the Constructor
+ for all dynamic used object that implements a specific interface  
(interface  no control over the object constructor) we use the  
Configurable interface to set the NutchConf in a inversion of control  
like style.

(e.g. Plugin Extension Implementations like Parser or Protocols)
+ PluginRegestry will not longer a singleton but will get an  
constructor with a NutchConf instance.
+ Getting a Extension, require also a NutchConf that is injected in  
case the Extension Object (e.g. a Parser) implements a Configurable  
interface.


Any comments, improvement suggestions, more use-cases?
I would love to do this job, can I get a go from the other developers?
From my point of view NutchConf is actually a showblocker since a  
lot of people run in trouble integrating nutch in other projects,  
also my suggestions are require to write a nutch gui.


Stefan




Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki

Stefan Groschupf wrote:


Hi,
to move forward in the direction of having a nutch gui, I would love  
to start removing the static access of NutchConf.
Based on experience first I would love to get a kind of general  
agreement and a 'go' before wasting to much time for an unaccented  
solution.



I agree with the general direction. Some comments below:



I suggest:

+ removing NutchConf.get().



I'm not sure about this... Somewhere you need to instantiate the default 
config, and this looks like a good place.


+ in case a lower level object use only one, two but not more than 3  
parameters from the nutch configuration, we add this parameter to the  
constructor of this object.

(e.g. MapFile.Reader needs only the parameter INDEX_SKIP)



I don't fully agree with this. In most such cases, you already have a 
NutchConf instance in the method or class context, so it makes sense to 
use it in the constructor. You could add these construtors with all 
parameters iterated, but I'd expect that the constructors using 
NutchConf would be used most frequently.


+ for higher level objects like fetcher tool- that need more than 3  
parameters for the lower level object -  we add a instance of  
NutchConf to the Constructor



Ok.

+ for all dynamic used object that implements a specific interface  
(interface  no control over the object constructor) we use the  
Configurable interface to set the NutchConf in a inversion of control  
like style.

(e.g. Plugin Extension Implementations like Parser or Protocols)



Ok.

+ PluginRegestry will not longer a singleton but will get an  
constructor with a NutchConf instance.



Definitely yes.

+ Getting a Extension, require also a NutchConf that is injected in  
case the Extension Object (e.g. a Parser) implements a Configurable  
interface.



Yes. If you remember our discussion, I'd like also to follow a pattern 
where such instances are cached inside this NutchConf instance, if 
appropriate (i.e. if they are reusable and multi-threaded).




Any comments, improvement suggestions, more use-cases?
I would love to do this job, can I get a go from the other developers?



+1 from me.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: no static NutchConf

2006-01-04 Thread Stefan Groschupf


I don't fully agree with this. In most such cases, you already have  
a NutchConf instance in the method or class context, so it makes  
sense to use it in the constructor. You could add these construtors  
with all parameters iterated, but I'd expect that the constructors  
using NutchConf would be used most frequently.


My  idea is to be able using low level things outside of nutch also.  
It is may a philosophically question in case of the map file writer  
you pass a complete hashmap with a bunch of properties to the object,  
but the objects only reads one int from this hashmap. I personal  
don't like to use a hashmap to 'transport' just one value.


So my suggestion looks like:
new MapFile.Reader(parameterA, nutchConf.getInt(parameterKey, 0));
if I understand you correct you prefer:
new MapFile.Reader(parameterA, nutchConf);
...
public MapFile(...){
this.parameter = nutchConf.getInt(parameterKey,0);
}

As mentioned this is more a code philosophy question and this is not  
important for me, my only idea was to decouple things as much as  
possible if we touch it anyway.


+ Getting a Extension, require also a NutchConf that is injected  
in  case the Extension Object (e.g. a Parser) implements a  
Configurable  interface.



Yes. If you remember our discussion, I'd like also to follow a  
pattern where such instances are cached inside this NutchConf  
instance, if appropriate (i.e. if they are reusable and multi- 
threaded).



I'm afraid I still do not clearly understand your idea here. As  
discussed it makes from my point of view no sense to cache any  
objects in a nutchConf.
Especially extension implementation like parsers are multithreaded  
and exists that often as we have threads. A caching would make more  
sense behind the sense of the plugin registry, but it is may  
difficult since you can run in trouble with resource life cycle  
management. PluginClass instances are already cached and working like  
a kind of singleton for each existing plugin registry.
Also I see some trouble  when using this caching mechanism since  
NutchConf can be serialized. Actually I have no idea where this  
mechanism is used, but I guess distributed map reduce will use this  
mechanism heavily.

So the cached objects need to be Serializable as well.

Stefan



Re: IndexSorter optimizer

2006-01-04 Thread Doug Cutting

Byron Miller wrote:

On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)


Both.  The highest-scoring pages are kept in separate indexes that are 
searched first.  When a query fails to match 1000 or so documents in the 
high-scoring indexes then the entire dataset is searched.  In general 
there can be multiple levels, e.g.: high-scoring, mid-scoring and 
low-scoring indexes, with the vast majority of pages in the last 
category, and the vast majority of queries resolved consulting only the 
first category.


What I have implemented so far for Nutch is a single-index version of 
this.  The current index-sorting implementation does not yet scale well 
to indexes larger than ~50M urls.  It is a proof-of-concept.


A better long-term approach is to introduce another MapReduce pass that 
collects Lucene documents (or equivalent) as values, and page scores as 
keys.  Then the indexing MapReduce pass can partition and sort by score 
before creating indexes.  The distributed search code will also need to 
be modified to search high-score indexes first.


Doug


Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki

Jérôme Charron wrote:


Excuse me in advance, I probably missed something, but what are the use
cases for having many NutchConf instances with different values?
 



Running many different tasks in parallel, each using different config, 
inside the same JVM.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: no static NutchConf

2006-01-04 Thread Steve Betts
If you are going to be able to reconfigure a nutch component at runtime, you
need to remove any configuration from the constructor and have a method that
allows you to get/set the configuration for the component. The problem with
keeping the entire configuration in a single component is trying to
display/filter the configuration information for the user. So the user knows
what component it is configuring.

Eclipse has a very good pattern for handling configuration for each of the
components. Basically each component is responsible for its own
configuration, and the tool just provides the framework to allow the
configuration to be displayed, updated, and stored.

The drawback of that approach is that you really don't have a GUI, or at
least have to be able to run without one.

I think that, at the very least, removing the configuration information from
the constructor is the first step.  You can still have a properties object
set the configuration. Then we can discuss the relative merits of
displaying, changing, and storing the configuration.  (Like, how a user is
supposed to know what component is affected by which property.)

Thanks,

Steve Betts
[EMAIL PROTECTED]
937-477-1797


-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 12:22 PM
To: nutch-dev@lucene.apache.org
Subject: Re: no static NutchConf


 I don't fully agree with this. In most such cases, you already have
 a NutchConf instance in the method or class context, so it makes
 sense to use it in the constructor. You could add these construtors
 with all parameters iterated, but I'd expect that the constructors
 using NutchConf would be used most frequently.

My  idea is to be able using low level things outside of nutch also.
It is may a philosophically question in case of the map file writer
you pass a complete hashmap with a bunch of properties to the object,
but the objects only reads one int from this hashmap. I personal
don't like to use a hashmap to 'transport' just one value.

So my suggestion looks like:
new MapFile.Reader(parameterA, nutchConf.getInt(parameterKey, 0));
if I understand you correct you prefer:
new MapFile.Reader(parameterA, nutchConf);
...
public MapFile(...){
this.parameter = nutchConf.getInt(parameterKey,0);
}

As mentioned this is more a code philosophy question and this is not
important for me, my only idea was to decouple things as much as
possible if we touch it anyway.

 + Getting a Extension, require also a NutchConf that is injected
 in  case the Extension Object (e.g. a Parser) implements a
 Configurable  interface.


 Yes. If you remember our discussion, I'd like also to follow a
 pattern where such instances are cached inside this NutchConf
 instance, if appropriate (i.e. if they are reusable and multi-
 threaded).


I'm afraid I still do not clearly understand your idea here. As
discussed it makes from my point of view no sense to cache any
objects in a nutchConf.
Especially extension implementation like parsers are multithreaded
and exists that often as we have threads. A caching would make more
sense behind the sense of the plugin registry, but it is may
difficult since you can run in trouble with resource life cycle
management. PluginClass instances are already cached and working like
a kind of singleton for each existing plugin registry.
Also I see some trouble  when using this caching mechanism since
NutchConf can be serialized. Actually I have no idea where this
mechanism is used, but I guess distributed map reduce will use this
mechanism heavily.
So the cached objects need to be Serializable as well.

Stefan




Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
 Excuse me in advance, I probably missed something, but what are the use
 cases for having many NutchConf instances with different values?
 Running many different tasks in parallel, each using different config,
 inside the same JVM.

Ok, I understand this Andrzej, but it is not really what I call a use case.
It is more a feature that you describe here.
In fact, what I mean is that I don't understand in which cases it will be
usefull. And I don't understand how a particular
NutchConfig will be selected for a particular task...

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: IndexSorter optimizer

2006-01-04 Thread Andrzej Bialecki

Doug Cutting wrote:


Byron Miller wrote:


On optimizing performance, does anyone know if google
is exporting its entire dataset as an index or only
somehow indexing the topN % (since they only show the
first 1000 or so results anyway)



Both.  The highest-scoring pages are kept in separate indexes that are 
searched first.  When a query fails to match 1000 or so documents in 
the high-scoring indexes then the entire dataset is searched.  In 
general there can be multiple levels, e.g.: high-scoring, mid-scoring 
and low-scoring indexes, with the vast majority of pages in the last 
category, and the vast majority of queries resolved consulting only 
the first category.


What I have implemented so far for Nutch is a single-index version of 
this.  The current index-sorting implementation does not yet scale 
well to indexes larger than ~50M urls.  It is a proof-of-concept.


A better long-term approach is to introduce another MapReduce pass 
that collects Lucene documents (or equivalent) as values, and page 
scores as keys.  Then the indexing MapReduce pass can partition and 
sort by score before creating indexes.  The distributed search code 
will also need to be modified to search high-score indexes first.



The WWW2005 conference presented a couple of interesting papers on the 
subject (http://www2005.org), among others these:


1. http://www2005.org/cdrom/docs/p235.pdf
2. http://www2005.org/cdrom/docs/p245.pdf
3. http://www2005.org/cdrom/docs/p257.pdf

The techniques described in the first paper are not too difficult to 
implement, especially the Carmel's method of index pruning, which gives 
satisfactory results at moderate costs.


The third paper, by Long  Suel, presents a concept of using a cache of 
intersections for multi-term queries, which we already sort of use with 
CachingFilters, only they propose to store them on-disk instead of 
limiting the cache to relatively small number of filters kept in RAM...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: no static NutchConf

2006-01-04 Thread Andrzej Bialecki

Jérôme Charron wrote:


Excuse me in advance, I probably missed something, but what are the use
cases for having many NutchConf instances with different values?
 


Running many different tasks in parallel, each using different config,
inside the same JVM.
   



Ok, I understand this Andrzej, but it is not really what I call a use case.
It is more a feature that you describe here.
In fact, what I mean is that I don't understand in which cases it will be
usefull. And I don't understand how a particular
NutchConfig will be selected for a particular task...
 



Use case: executing multiple tasks on any single tasktracker node, but 
with drastically different configurations per each task.


Example: what happens now if you try to run more than one fetcher at the 
same time, where the fetcher parameters differ (or a set of activated 
plugins differs)? You can't - the local tasks on each tasktracker will 
use whatever local config is there. What happens if you change the 
config on a node that  submits the job? The changes won't be propagated 
to the tasktracker nodes, because tasktrackers use local configuration 
(through a singleton NutchConf.get()), instead of supplying a 
serialized/deserialized instance of the config from the originating 
node... etc.


NutchConf instances will be created when you create a JobConf. Then they 
will have to be serialized/deserialized when job descriptors are sent by 
jobtracker to tasktrackers on mapred nodes, and used locally by 
tasktrackers to instantiate local tasks using copies of the original 
NutchConf instance.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-164) Locale (language) choice by first session has global effect to all sessions

2006-01-04 Thread KuroSaka TeruHiko (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-164?page=comments#action_12361782 ] 

KuroSaka TeruHiko commented on NUTCH-164:
-

Actually, the current language selection scheme needs an overhaul.

The locale for the message bundle is determined only by the preferred language 
setting of the browser, while the selection of the localized JSP is done by 
clicking on the language code link in the bottom of each page.  There is no 
coordination.  The choosen language by the language code does not persist in 
the session.

See the discussion about the locale selection at W3C:
http://www.w3.org/International/questions/qa-accept-lang-locales#answer



 Locale (language) choice by first session has global effect to all sessions
 ---

  Key: NUTCH-164
  URL: http://issues.apache.org/jira/browse/NUTCH-164
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1
  Environment: any
 Reporter: KuroSaka TeruHiko


 Here's a report posted on nutch-users ML by Sergio [EMAIL PROTECTED] on 
 1/02/2006:
 
 I just installed nutch in a Fedora Core 3 server.
 Once installed, I crawled a small site to test it. I opened my navigator
 (mozilla 1.7 which reports by default ES-ES locales, and everything was ok).
 Then I asked a friend of mine  (the owner of the server) to test it. He did
 a search with an EN-US locale navigator, and the search page appeared in
 Spanish.
 After a few hours, I did the following: I restarted tomcat, I changed the
 locale of my mozilla to EN, and I opened the search page. Now I always get
 English search page even if I open with a mozilla ES-ES locale.
 I wrote a message to my friend:
 nutch keeps the locale of the first navigator that makes a request for all
 other requests. By this reason, yesterday as the first request was from my
 ES locale browser, you saw the page in Spanish with your browser that
 reports EN locale. There is a way to make this work:
 * Making sure that, after the server is restarted, the first request is done
 by a browser that reports EN locale.
 
 This happened in my environment too.  After taking a look the code, I believe 
 this is caused by
 use of the default message bundle in search.jsp.  The code snipplet looks 
 like:
 i18n:bundle baseName=org.nutch.jsp.search/
 ...
 titleNutch: i18n:message key=title//title
 ...
 The default message bundle probably has the application scope.  Because of 
 that, the first
 setting of the language has global effect to every session created afterward.
 The right fix is to limit the scope to the session by inserting the scope 
 specifier, as in:
 i18n:bundle scope=session baseName=org.nutch.jsp.search/
 Other JSP files need to be inspected for the same issue and should be fixed 
 as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: no static NutchConf

2006-01-04 Thread Piotr Kosiorowski

+1 in general
In fact I like the approach presented by Stefan to pass only required 
parameters to objects that have small number of configurable params 
instead of NutchConf - it makes it obvious which parameters are required 
for such basic objects to run and as they are usually building blocks 
for something bigger it makes it easier to reuse it with different 
params in different parts of the code. But I like the direction and will 
not oppose against passing the whole NutchConf in this case.

Regards
Piotr


Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Piotr Kosiorowski

Andrzej,
Do you think it would be a good idea to commit it in 0.7 branch for 
0.7.2 release? I personally prefer to use released libraries instead of 
RC if possible. It does not require a lot of changes and you have 
already tested it with existing code...

Piotr

[EMAIL PROTECTED] wrote:

Author: ab
Date: Tue Jan  3 23:32:04 2006
New Revision: 365850

URL: http://svn.apache.org/viewcvs?rev=365850view=rev
Log:
Update Commons HTTPClient to v. 3.0.

Add some default headers to prefer HTML content, and in English.





Re: svn commit: r365850 - in /lucene/nutch/trunk/src/plugin/protocol-httpclient: ./ lib/ src/java/org/apache/nutch/protocol/httpclient/

2006-01-04 Thread Andrzej Bialecki

Piotr Kosiorowski wrote:


Andrzej,
Do you think it would be a good idea to commit it in 0.7 branch for 
0.7.2 release? I personally prefer to use released libraries instead 
of RC if possible. It does not require a lot of changes and you have 
already tested it with existing code...

Piotr



I didn't see any problems, I think you can go ahead.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: no static NutchConf

2006-01-04 Thread Thomas Jaeger
Hi,

Stefan Groschupf wrote:
[...]
 Any comments, improvement suggestions, more use-cases?

I completely agree with you.

I have two more ideas:
1) create NutchConf as interface (not class)
2) make it work as plugin

1) If NutchConf is an interface, the NutchConf implementation can be
written with a hashmap in mind (like now) or with JMX or
commons-configuration.
2) There are only 4 required configuration options (plugin.excludes,
plugin.includes, plugin.folders, plugin.auto-activation) the plugin
registry needs to start up. If these options are provided by a bootstrap
configuration, configuration plugins will be possible.

If help is needed, i would like to implement a JMX implementation of
NutchConf (since i will need it myself;).


Regards,

Thomas


Re: no static NutchConf

2006-01-04 Thread Doug Cutting

Andrzej Bialecki wrote:
Example: what happens now if you try to run more than one fetcher at the 
same time, where the fetcher parameters differ (or a set of activated 
plugins differs)? You can't - the local tasks on each tasktracker will 
use whatever local config is there.


That's true when mapred.job.tracker=local, but when things are 
distributed the config can vary since each task is spawned in a separate 
JVM with a separate classpath.  The nutch-site.xml on each node can 
never be overidden.  For example, so long as plugin.includes is not 
specified in nutch-site.xml on each node, then each task can override 
plugin.includes to use different plugins.


Also note that plugin implementations can submitted in a jar file with 
the job, and plugin.folders can be overridden in the job to find the new 
plugins.  So a job jar might include a folder named my.plugins and set 
plugin.folders to my.plugins, plugins, then alter plugin.includes to 
include job-specific plugins.


What happens if you change the 
config on a node that  submits the job? The changes won't be propagated 
to the tasktracker nodes, because tasktrackers use local configuration 
(through a singleton NutchConf.get()), instead of supplying a 
serialized/deserialized instance of the config from the originating 
node... etc.


Again, I'm not sure this is a problem.  Properties which tasks should be 
able to override should not be specified in nutch-site.xml, but rather 
in mapred-default.xml.  Lots of job-specific properties are currently 
passed this way.


Another use case for eliminating the static uses of NutchConf is to 
simplify the construction of a configuration gui.  It would be nice to 
have a web-based interface which permits one to configure parameters and 
then have it run the system.  This should be able to run multiple Nutch 
instances in a single JVM.  For example, a single Nutch-based search 
appliance daemon should be able to crawl and search both your intranet 
and your public websites, each configured separately.


Doug


[jira] Closed: (NUTCH-142) NutchConf should use the thread context classloader

2006-01-04 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-142?page=all ]
 
Piotr Kosiorowski closed NUTCH-142:
---

Fix Version: 0.7.2-dev
 0.8-dev
 Resolution: Fixed

 NutchConf should use the thread context classloader
 ---

  Key: NUTCH-142
  URL: http://issues.apache.org/jira/browse/NUTCH-142
  Project: Nutch
 Type: Improvement
 Versions: 0.7
 Reporter: Mike Cannon-Brookes
  Fix For: 0.7.2-dev, 0.8-dev


 Right now NutchConf uses it's own static classloader which is _evil_ in a 
 J2EE scenario.
 This is simply fixed. Line 52:
private ClassLoader classLoader = NutchConf.class.getClassLoader();
 Should be:
private ClassLoader classLoader = 
 Thread.currentThread().getContextClassLoader();
 This means no matter where Nutch classes are loaded from, it will use the 
 correct J2EE classloader to try to find configuration files (ie from 
 WEB-INF/classes).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: no static NutchConf

2006-01-04 Thread David Wallace
Hi Stefan,
I think these are fine things to be doing.  Just two points:
 
(1) Why not just always pass the NutchConf to the constructor of any
class that needs it?  Instead of distinguishing between the case of
whether the class will use 1 or 2 configuration parameters; or more than
that.  Just for consistency.  Also, it's possible that a class that
CURRENTLY only uses 2 configuration parameters will use 3 or 4 at some
point in the future, and it would be a shame to have to rewrite its
constructor when that happens.
 
(2) What I'd REALLY like to see is if NutchConf were an interface, with
methods that allow the retrieval of properties from any source.  There
could be a class NutchXmlConf which implements the NutchConf interface,
which works the current way (with nutch-default.xml, nutch-site.xml and
so on).  Where we need to create a NutchConf, we actually create a
NutchXmlConf, but pass it to class constructors whose arguments are of
type NutchConf.  That way, if I want to use a non-standard mechanism for
storing my Nutch parameters (eg, a properties file, a relational
database, the Windows Registry, whatever), I can write my own class that
implements the NutchConf interface; then instantiate it and pass it
around, without having to re-write every Nutch class that uses it.
 
The benefits of (2) are legion.  In particular, for people who want to
use a Nutch search engine as part of an existing web application, where
that existing application uses a specific (non-XML) mechanism for
storing configuration parameters.  It would also give extra flexibility
for people working on Nutch installations that sit in multiple
environments (Development, System Test, UAT, Production etc) and get
deployed from one environment to the next.
 
Regards,
David.
 
 
 
 
From: Stefan Groschupf [EMAIL PROTECTED]
Date: Wed, 4 Jan 2006 15:39:38 +0100
Subject: [Nutch-dev] no static NutchConf

Hi,
to move forward in the direction of having a nutch gui, I would love  
to start removing the static access of NutchConf.
Based on experience first I would love to get a kind of general  
agreement and a 'go' before wasting to much time for an unaccented  
solution.

I suggest:

+ removing NutchConf.get().
+ in case a lower level object use only one, two but not more than 3  
parameters from the nutch configuration, we add this parameter to the 

constructor of this object.
(e.g. MapFile.Reader needs only the parameter INDEX_SKIP)
+ for higher level objects like fetcher tool- that need more than 3  
parameters for the lower level object -  we add a instance of  
NutchConf to the Constructor
+ for all dynamic used object that implements a specific interface  
(interface  no control over the object constructor) we use the  
Configurable interface to set the NutchConf in a inversion of control 

like style.
(e.g. Plugin Extension Implementations like Parser or Protocols)
+ PluginRegestry will not longer a singleton but will get an  
constructor with a NutchConf instance.
+ Getting a Extension, require also a NutchConf that is injected in  
case the Extension Object (e.g. a Parser) implements a Configurable  
interface.

Any comments, improvement suggestions, more use-cases?
I would love to do this job, can I get a go from the other developers?
From my point of view NutchConf is actually a showblocker since a  
lot of people run in trouble integrating nutch in other projects,  
also my suggestions are require to write a nutch gui.

Stefan



This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.




injection infinite loop

2006-01-04 Thread Andy Liu
If you inject the crawldb with a url file that doesn't end with a line feed,
an infinite loop is entered.  Anybody else encounter this problem?

060104 160950 Running job: job_7uku5w
060104 160952  map 0%
060104 160954  map 50%
060104 160957  map -2631%
060104 160959  map -259756%
060104 161002  map -538552%
060104 161006  map -818413%
060104 161009  map -1098421%
060104 161011  map -1377851%
060104 161014  map -1657718%
060104 161018  map -1939534%
060104 161021  map -2218515%
060104 161023  map -2588212%
060104 161026  map -2868787%
060104 161030  map -3147637%


Re: mapred crawling exception - Job failed!

2006-01-04 Thread Lukas Vlcek
I gave it a next try this night and I still have troubles.
This is the very end of my log (full version is attached) and you can
see another nasty exception:

...
060104 213644  map 100%
060104 213645 Optimizing index.
java.lang.NullPointerException: value cannot be null
at org.apache.lucene.document.Field.init(Field.java:469)
at org.apache.lucene.document.Field.init(Field.java:412)
at org.apache.lucene.document.Field.UnIndexed(Field.java:195)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:199)
at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260)
at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:259)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

I tried to turn off most of parsing pluggins but it didn't help so
there is probably some general issue.

Any ideas?

Regards,
Lukas

On 1/4/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 Thanks guys!
 I really didn't have the latest copy...
 L.
 On 1/4/06, Byron Miller [EMAIL PROTECTED] wrote:
  Fixed in the copy i run as i've been able to get my
  100k pages indexed without getting that error.
 
  -byron
 
  --- Andrzej Bialecki [EMAIL PROTECTED] wrote:
 
   Lukas Vlcek wrote:
  
   Hi,
   
   I am trying to use the latest nutch-trunk version
   but I am facing
   unexpected Job failed! exception. It seems that
   all crawling work
   has been already done but some threads are hunged
   which results into
   exception after some timeout.
   
   
   
  
   This was fixed (or should be fixed :) in the
   revision r365576. Please
   report if it doesn't fix it for you.
  
   --
   Best regards,
   Andrzej Bialecki 
___. ___ ___ ___ _ _
   __
   [__ || __|__/|__||\/|  Information Retrieval,
   Semantic Web
   ___|||__||  \|  ||  |  Embedded Unix, System
   Integration
   http://www.sigram.com  Contact: info at sigram dot
   com
  
  
  
 
 



Re: mapred crawling exception - Job failed!

2006-01-04 Thread Andrzej Bialecki

Lukas Vlcek wrote:


I gave it a next try this night and I still have troubles.
This is the very end of my log (full version is attached) and you can
see another nasty exception:

 



Do you use the Fetcher in parsing or non-parsing mode, i.e. do you run a 
ParseSegment as a separate step?


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com