Re: found resource parse-plugins.xm?

2006-03-07 Thread Andrzej Bialecki

Hi,

I just applied your fix with minor changes. Thanks!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: found resource parse-plugins.xm?

2006-03-07 Thread Stefan Groschupf

Thanks!

Am 07.03.2006 um 10:28 schrieb Andrzej Bialecki:


Hi,

I just applied your fix with minor changes. Thanks!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: found resource parse-plugins.xm?

2006-03-06 Thread Stefan Groschupf

Hi  Stack, :)
yes! Until fetching with switched on parsing on one tasktracker that  
tries to crawl a 10 mio segment with 800 threads.

:-?
Stefan

Am 07.03.2006 um 04:27 schrieb [EMAIL PROTECTED]:


Stefan Groschupf wrote:

Hi,
after a short time I already had 1602 time this lines in my  
tasktracker log files.
060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at  
file:/home/joa/nutch/conf/parse-plugins.xml


Sounds like this file is loaded 1602 (after lets say 3 minutes) I  
guess that wasn't the goal or do I oversee anything?

Is it being loaded by the same task each time Stefan?
St.Ack



-
blog: http://www.find23.org
company: http://www.media-style.com




RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
Hi Stefan,

 after a short time I already had 1602 time this lines in my
 tasktracker log files.
 060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
 file:/home/joa/nutch/conf/parse-plugins.xml
 
 Sounds like this file is loaded 1602 (after lets say 3 minutes) I
 guess that wasn't the goal or do I oversee anything?

It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the
following line in the ParserFactory.java class:

  /** List of parser plugins. */
  private static final ParsePluginList PARSE_PLUGIN_LIST =
  new ParsePluginsReader().parse();


(see revision 326889)

Looking at the revision history for the ParserFactory file, after the
application of NUTCH-169, the above changes to:


  private ParsePluginList parsePluginList;

//... code here

public ParserFactory(NutchConf nutchConf) {
this.nutchConf = nutchConf;
this.extensionPoint = nutchConf.getPluginRepository().getExtensionPoint(
Parser.X_POINT_ID);
this.parsePluginList = new ParsePluginsReader().parse(nutchConf);

if (this.extensionPoint == null) {
  throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);
}
if (this.parsePluginList == null) {
  throw new RuntimeException(
  Parse Plugins preferences could not be loaded.);
}
  }


Thus, every time the ParserFactory is constructed, the parse-plugins.xml
file is read (it's the result of the call to
ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times,
I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm
wondering why the parse-plugins.xml configuration parameters aren't declared
as final static anymore?

 That could be a serious performance improvement to just load this
 file once.

Yup, I think that's the reason we made it final static. If there is no
reason to not have it final static, I would suggest that it be put back to
final static. There may be a problem however, now since NUTCH-169, the
loading requires an existing Configuration object I believe. So, we may need
a static Configuration object as well. Thoughts? 

 I was not able to find the code that is logging this statement, has
 anyone a idea where this happens?

The statement gets logged within the ParsePluginsReader.java class, line 98:

ppInputStream = conf.getConfResourceAsInputStream(
  conf.get(PP_FILE_PROP));

HTH,
  Chris


 
 Thanks.
 Stefan
 -
 blog: http://www.find23.org
 company: http://www.media-style.com




Re: found resource parse-plugins.xm?

2006-03-06 Thread Stefan Groschupf

Hi Chris,
thanks for the clarification.
Do you think we can we somehow cache it in the nutchConf instance,  
since this is the way we doing this on other places as well?

Cheers,
Stefan

Am 07.03.2006 um 04:38 schrieb Chris Mattmann:


Hi Stefan,


after a short time I already had 1602 time this lines in my
tasktracker log files.
060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
file:/home/joa/nutch/conf/parse-plugins.xml

Sounds like this file is loaded 1602 (after lets say 3 minutes) I
guess that wasn't the goal or do I oversee anything?


It certainly wasn't the goal at all. After NUTCH-88, Jerome and I  
had the

following line in the ParserFactory.java class:

  /** List of parser plugins. */
  private static final ParsePluginList PARSE_PLUGIN_LIST =
  new ParsePluginsReader().parse();


(see revision 326889)

Looking at the revision history for the ParserFactory file, after the
application of NUTCH-169, the above changes to:


  private ParsePluginList parsePluginList;

//... code here

public ParserFactory(NutchConf nutchConf) {
this.nutchConf = nutchConf;
this.extensionPoint = nutchConf.getPluginRepository 
().getExtensionPoint(

Parser.X_POINT_ID);
this.parsePluginList = new ParsePluginsReader().parse(nutchConf);

if (this.extensionPoint == null) {
  throw new RuntimeException(x point  + Parser.X_POINT_ID +   
not

found.);
}
if (this.parsePluginList == null) {
  throw new RuntimeException(
  Parse Plugins preferences could not be loaded.);
}
  }


Thus, every time the ParserFactory is constructed, the parse- 
plugins.xml

file is read (it's the result of the call to
ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded  
1602 times,
I'd guess that the ParserFactory is loaded 1602 times?  
Additionally, I'm
wondering why the parse-plugins.xml configuration parameters aren't  
declared

as final static anymore?


That could be a serious performance improvement to just load this
file once.


Yup, I think that's the reason we made it final static. If there is no
reason to not have it final static, I would suggest that it be put  
back to

final static. There may be a problem however, now since NUTCH-169, the
loading requires an existing Configuration object I believe. So, we  
may need

a static Configuration object as well. Thoughts?


I was not able to find the code that is logging this statement, has
anyone a idea where this happens?


The statement gets logged within the ParsePluginsReader.java class,  
line 98:


ppInputStream = conf.getConfResourceAsInputStream(
  conf.get(PP_FILE_PROP));

HTH,
  Chris




Thanks.
Stefan
-
blog: http://www.find23.org
company: http://www.media-style.com






-
blog: http://www.find23.org
company: http://www.media-style.com




RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
Hi Stefan,


 Hi Chris,
 thanks for the clarification.

No probs. 

 Do you think we can we somehow cache it in the nutchConf instance,
 since this is the way we doing this on other places as well?

Yeah I think we can. Here is a small patch to the ParserFactory that should
do the trick. Give it a test and let me know if it works. If it does, I
would say +1 to the committers to get this into the sources ASAP, no?

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,13 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject(parsePluginList) != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject(parsePluginList);
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);

+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);


Cheers,
  Chris

 Cheers,
 Stefan
 
 Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
 
  Hi Stefan,
 
  after a short time I already had 1602 time this lines in my
  tasktracker log files.
  060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
  file:/home/joa/nutch/conf/parse-plugins.xml
 
  Sounds like this file is loaded 1602 (after lets say 3 minutes) I
  guess that wasn't the goal or do I oversee anything?
 
  It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
  had the
  following line in the ParserFactory.java class:
 
/** List of parser plugins. */
private static final ParsePluginList PARSE_PLUGIN_LIST =
new ParsePluginsReader().parse();
 
 
  (see revision 326889)
 
  Looking at the revision history for the ParserFactory file, after the
  application of NUTCH-169, the above changes to:
 
 
private ParsePluginList parsePluginList;
 
  //... code here
 
  public ParserFactory(NutchConf nutchConf) {
  this.nutchConf = nutchConf;
  this.extensionPoint = nutchConf.getPluginRepository
  ().getExtensionPoint(
  Parser.X_POINT_ID);
  this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
 
  if (this.extensionPoint == null) {
throw new RuntimeException(x point  + Parser.X_POINT_ID + 
  not
  found.);
  }
  if (this.parsePluginList == null) {
throw new RuntimeException(
Parse Plugins preferences could not be loaded.);
  }
}
 
 
  Thus, every time the ParserFactory is constructed, the parse-
  plugins.xml
  file is read (it's the result of the call to
  ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded
  1602 times,
  I'd guess that the ParserFactory is loaded 1602 times?
  Additionally, I'm
  wondering why the parse-plugins.xml configuration parameters aren't
  declared
  as final static anymore?
 
  That could be a serious performance improvement to just load this
  file once.
 
  Yup, I think that's the reason we made it final static. If there is no
  reason to not have it final static, I would suggest that it be put
  back to
  final static. There may be a problem however, now since NUTCH-169, the
  loading requires an existing Configuration object I believe. So, we
  may need
  a static Configuration object as well. Thoughts?
 
  I was not able to find the code that is logging this statement, has
  anyone a idea where this happens?
 
  The statement gets logged within the ParsePluginsReader.java class,
  line 98:
 
  ppInputStream = conf.getConfResourceAsInputStream(
conf.get(PP_FILE_PROP));
 
  HTH,
Chris
 
 
 
  Thanks.
  Stefan
  -
  blog: http://www.find23.org
  company: http://www.media-style.com
 
 
 
 
 -
 blog: http://www.find23.org
 company: http://www.media-style.com




RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
Sorry,

 My last patch was missing one line. Here's the update:

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,14 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject(parsePluginList) != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject(parsePluginList);
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);
+conf.setObject(parsePluginList, this.parsePluginList);
+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);


 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 06, 2006 7:51 PM
 To: 'nutch-dev@lucene.apache.org'
 Subject: RE: found resource parse-plugins.xm?
 
 Hi Stefan,
 
 
  Hi Chris,
  thanks for the clarification.
 
 No probs.
 
  Do you think we can we somehow cache it in the nutchConf instance,
  since this is the way we doing this on other places as well?
 
 Yeah I think we can. Here is a small patch to the ParserFactory that
 should do the trick. Give it a test and let me know if it works. If it
 does, I would say +1 to the committers to get this into the sources ASAP,
 no?
 
 Index: src/java/org/apache/nutch/parse/ParserFactory.java
 ===
 --- src/java/org/apache/nutch/parse/ParserFactory.java(revision
 383463)
 +++ src/java/org/apache/nutch/parse/ParserFactory.java(working
copy)
 @@ -55,7 +55,13 @@
  this.conf = conf;
  this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
  Parser.X_POINT_ID);
 -this.parsePluginList = new ParsePluginsReader().parse(conf);
 +
 +if(conf.getObject(parsePluginList) != null){
 + this.parsePluginList =
 (ParsePluginList)conf.getObject(parsePluginList);
 +}
 +else{
 +this.parsePluginList = new ParsePluginsReader().parse(conf);
 
 +}
 
  if (this.extensionPoint == null) {
throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
 found.);
 
 
 Cheers,
   Chris
 
  Cheers,
  Stefan
 
  Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
 
   Hi Stefan,
  
   after a short time I already had 1602 time this lines in my
   tasktracker log files.
   060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
   file:/home/joa/nutch/conf/parse-plugins.xml
  
   Sounds like this file is loaded 1602 (after lets say 3 minutes) I
   guess that wasn't the goal or do I oversee anything?
  
   It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
   had the
   following line in the ParserFactory.java class:
  
 /** List of parser plugins. */
 private static final ParsePluginList PARSE_PLUGIN_LIST =
 new ParsePluginsReader().parse();
  
  
   (see revision 326889)
  
   Looking at the revision history for the ParserFactory file, after the
   application of NUTCH-169, the above changes to:
  
  
 private ParsePluginList parsePluginList;
  
   //... code here
  
   public ParserFactory(NutchConf nutchConf) {
   this.nutchConf = nutchConf;
   this.extensionPoint = nutchConf.getPluginRepository
   ().getExtensionPoint(
   Parser.X_POINT_ID);
   this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
  
   if (this.extensionPoint == null) {
 throw new RuntimeException(x point  + Parser.X_POINT_ID + 
   not
   found.);
   }
   if (this.parsePluginList == null) {
 throw new RuntimeException(
 Parse Plugins preferences could not be loaded.);
   }
 }
  
  
   Thus, every time the ParserFactory is constructed, the parse-
   plugins.xml
   file is read (it's the result of the call to
   ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded
   1602 times,
   I'd guess that the ParserFactory is loaded 1602 times?
   Additionally, I'm
   wondering why the parse-plugins.xml configuration parameters aren't
   declared
   as final static anymore?
  
   That could be a serious performance improvement to just load this
   file once.
  
   Yup, I think that's the reason we made it final static. If there is no
   reason to not have it final static, I would suggest that it be put
   back to
   final static. There may be a problem however, now since NUTCH-169, the
   loading requires an existing Configuration object I believe. So, we
   may need
   a static Configuration object as well. Thoughts?
  
   I was not able to find the code that is logging this statement, has
   anyone a idea where this happens?
  
   The statement gets logged