from:"Jerome Charron \(JIRA\)"

[jira] Commented: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12419670 ] 

Jerome Charron commented on NUTCH-309:
--

As already discussed, it perfectly makes sense and I have planned to work on 
this issue.
Another minor change I would like to make is to replace the log4j.properties by 
log4j.xml : The log4j.xml provides more funtionality and flexibility : 
especially filters that provide a way to log to different appenders depending 
on the log level for instance (for instance I use this to log all levels to a 
file and warn and error level to the console).
 

 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-317) Clarify what the queryLanguage argument of Query.parse(...) means

2006-07-06 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-317?page=all ]
 
Jerome Charron resolved NUTCH-317:
--

Fix Version: 0.8-dev
 Resolution: Fixed

Fixed

 Clarify what the queryLanguage argument of Query.parse(...) means
 -

  Key: NUTCH-317
  URL: http://issues.apache.org/jira/browse/NUTCH-317
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: KuroSaka TeruHiko
  Fix For: 0.8-dev


 API document on 
   Query.parse(String queryString,
   String queryLang,
   Configuration conf)
 does not explain what queryLang is, and should be explained.
 There can be at least two interpretations:
 (1) Create a Query that restricts the search to include only the documents 
 written in the specified language. So this would
 be equivalent of specifying lang:xx where xx is a two-letter language code.
 (2) Create a Query interpreting the queryString according to the rules of the 
 specified languages.  In reality, this is used to
 select the proper language Analyzer to parse the query string.
 I am guessing that (2) is intended.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-309) Uses commons logging Code Guards

2006-06-29 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418404 ] 

Jerome Charron commented on NUTCH-309:
--

Dawid, you know, sed, awk and regex are my friends, so it was not so painful
;-)
As I mentionned in a previous mail, it was just a crude pass on logging : A 
finest one is planned to review log levels and code guards.
AspectJ = +1 for using it for logging, but I don't what are the preformance 
impacts...

 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-309) Uses commons logging Code Guards

2006-06-22 Thread Jerome Charron (JIRA)

Uses commons logging Code Guards


 Key: NUTCH-309
 URL: http://issues.apache.org/jira/browse/NUTCH-309
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Jerome Charron
 Assigned to: Jerome Charron 
Priority: Minor
 Fix For: 0.8-dev


Code guards are typically used to guard code that only needs to execute in 
support of logging, that otherwise introduces undesirable runtime overhead in 
the general case (logging disabled). Examples are multiple parameters, or 
expressions (e.g. string +  more) for parameters. Use the guard methods of 
the form log.isPriority() to verify that logging should be performed, before 
incurring the overhead of the logging method call. Yes, the logging methods 
will perform the same check, but only after resolving parameters.
(description extracted from 
http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-310) Review Log Levels

2006-06-22 Thread Jerome Charron (JIRA)

Review Log Levels
-

 Key: NUTCH-310
 URL: http://issues.apache.org/jira/browse/NUTCH-310
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Jerome Charron
 Assigned to: Jerome Charron 
Priority: Minor
 Fix For: 0.8-dev


Review of logs content and logs levels (see Commons Logging Best Parctices : 
http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-309) Uses commons logging Code Guards

2006-06-22 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-309?page=all ]
 
Jerome Charron resolved NUTCH-309:
--

Resolution: Fixed

Logging code guards added.
http://svn.apache.org/viewvc?view=revrevision=416346


 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-307) wrong configured log4j.properties

2006-06-21 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-307?page=all ]
 
Jerome Charron resolved NUTCH-307:
--

Resolution: Fixed
 Assign To: Jerome Charron

Nutch now uses Hadoop var names for the file name used by DRFA logging.

 wrong configured log4j.properties
 -

  Key: NUTCH-307
  URL: http://issues.apache.org/jira/browse/NUTCH-307
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
 Assignee: Jerome Charron
 Priority: Blocker
  Fix For: 0.8-dev


 In nutch/conf is only one  log4j.properties and it define:
 log4j.appender.DRFA.File=${nutch.log.dir}/${nutch.log.file}
 nutch.log.dir and nutch.log.file is only defined in the bin/nutch script. 
 In case of starting a distributed nutch instance with bin/start-all the 
 remove tasktracker crash with:
  java.io.FileNotFoundException: / (Is a directory)
 cr06:   at java.io.FileOutputStream.openAppend(Native Method)
 cr06:   at java.io.FileOutputStream.init(FileOutputStream.java:177)
 cr06:   at java.io.FileOutputStream.init(FileOutputStream.java:102)
 cr06:   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
 cr06:   at 
 org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
 cr06:   at 
 org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215)
 cr06:   at 
 org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
 since the hadoop scripts used to start the tasktrackers and datanodes never 
 define the nutch log properties but the log4j.properties require such a 
 definition.
 I suggest to leave the log4j.properties as it is in hadoop but define the 
 hadoop property names in the bin/nutch script instead of intriduce new 
 variable names. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-307) wrong configured log4j.properties

2006-06-20 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-307?page=comments#action_12416895 ] 

Jerome Charron commented on NUTCH-307:
--

Hi Stefan,

Thanks for this feedback. In fact, as I mentioned in a previous mail 
(http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg03907.html) I had 
some hesitations about using the hadoop properties instead of introducing some 
nutch properties.

I change this right now!

 wrong configured log4j.properties
 -

  Key: NUTCH-307
  URL: http://issues.apache.org/jira/browse/NUTCH-307
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
 Priority: Blocker
  Fix For: 0.8-dev


 In nutch/conf is only one  log4j.properties and it define:
 log4j.appender.DRFA.File=${nutch.log.dir}/${nutch.log.file}
 nutch.log.dir and nutch.log.file is only defined in the bin/nutch script. 
 In case of starting a distributed nutch instance with bin/start-all the 
 remove tasktracker crash with:
  java.io.FileNotFoundException: / (Is a directory)
 cr06:   at java.io.FileOutputStream.openAppend(Native Method)
 cr06:   at java.io.FileOutputStream.init(FileOutputStream.java:177)
 cr06:   at java.io.FileOutputStream.init(FileOutputStream.java:102)
 cr06:   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
 cr06:   at 
 org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
 cr06:   at 
 org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215)
 cr06:   at 
 org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)
 since the hadoop scripts used to start the tasktrackers and datanodes never 
 define the nutch log properties but the log4j.properties require such a 
 definition.
 I suggest to leave the log4j.properties as it is in hadoop but define the 
 hadoop property names in the bin/nutch script instead of intriduce new 
 variable names. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ] 

Jerome Charron commented on NUTCH-110:
--

This patch process the String twice if it contains some illegal characters!

 OpenSearchServlet outputs illegal xml characters
 

  Key: NUTCH-110
  URL: http://issues.apache.org/jira/browse/NUTCH-110
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7
  Environment: linux, jdk 1.5
 Reporter: [EMAIL PROTECTED]
  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, 
 fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch

 OpenSearchServlet does not check text-to-output for illegal xml characters; 
 dependent on  search result, its possible for OSS to output xml that is not 
 well-formed.  For example, if text has the character FF character in it -- -- 
 i.e. the ascii character at position (decimal) 12 --  the produced XML will 
 show the FF character as '#12;' The character/entity '#12;' is not legal in 
 XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-236) PdfParser and RSSParser Log4j appender redirection

2006-06-13 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-236?page=all ]
 
Jerome Charron closed NUTCH-236:


Fix Version: 0.8-dev
 Resolution: Fixed

As a side effect, this issue is solved by NUTCH-303 since nutch now uses 
Jakarta Commons Logging with the log4j default implementation.


 PdfParser and RSSParser Log4j appender redirection
 --

  Key: NUTCH-236
  URL: http://issues.apache.org/jira/browse/NUTCH-236
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: Linux, Nutch embedded in an other application
 Reporter: Jason Calabrese
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.8-dev
  Attachments: NUTCH-236.Mattmann.060806.patch.txt

 I just found a bug in the way the log messages from Hadoop LogFormatter are 
 added as a new appender to the Log4j rootLogger in the PdfParser and 
 RSSParser.
 Since a new Log4j appender is created and added to the root logger each time 
 these classes are loaded log messages start getting repeated.
 I'm using Nutch/Hadoop inside an other application so other may not be seeing 
 this problem.
 I think the simple fix is as easy as setting a name for the new appender 
 before adding it and then at the begining of the constructor checking to see 
 if it's already been added.
 Also as the comment says in both the PdfParser and RSSParser this code should 
 be moved to a common place.
 I'd be happy to make these changes and submit a patch, but I wanted to know 
 it 
 the change would be welcome first.  Also does anyone know a good place for 
 the new util method?  Maybe a new static method on LogFormatter, but then the 
 log4j jar would need to be added to the to the common lib and the classpath.
 It would also be good to create a property in nutch-site.xml that could 
 disable this logging appender redirection.
 Like I said above I'd be more than happy to do this work, I'll just need some 
 guidance to follow the project's conventions.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-13 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12415984 ] 

Jerome Charron commented on NUTCH-258:
--

Thanks for this patch Chris - even if now it is outdate by NUTCH-303 :-(
Since Nutch no more use the deprecated Hadoop LogFormatter, there is no more 
logSevere check in the code.
So we quickly need to have a patch for this issue in order to have the same 
behaviors.

In your patch Chris, you set a severe flag each time a log severe is called.
But I'm not sure all these log severe should be marked as severe (fatal level 
is used now).
For instance, is it really fatal for the fetcher that the conf file for 
RegexUrlNormalizer is wrong?
Is it really fatal for the fetcher if the language identifier raise an 
exception while loading ngrams profiles?
Is it really fatal for the fetcher if the ontology plugin failed on reading an 
ontology?
But sure it is fatal if the user-agent is not correctly setted in http plugins!

So, what I suggest is to review all the fatal logs and check if they are really 
fatal for the whole process.
And finally, why not simply throwing a RuntimeException that will by catched 
the Fetcher if something wrong really occurs?

 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --

  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Assignee: Chris A. Mattmann
 Priority: Critical
  Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch

 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
  This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in 
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running 
 service.  Furthermore, I believe it is a poor decision to rely on a logging 
 event to determine the state of the application - this could have any number 
 of side-effects that would be extremely difficult to track down.  (As it has 
 already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-303) logging improvements

2006-06-12 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-303?page=all ]
 
Jerome Charron resolved NUTCH-303:
--

Resolution: Fixed

Nutch now uses the Commons Logging API and log4j as the default implementation.

There is 3 log4j.properties configuration files:
  1. conf/log4j.properties used by the back-end. 
It uses by default the Daily Rolling File Appender. By default, the logging 
file is located in $NUTCH_HOME/logs/nutch.log
Another location can be specified by using env. variables $NUTCH_LOG_DIR and 
$NUTCH_LOGFILE.

   2. src/web/log4j.properties used by the front-end container.
It uses by default the Console Appender.

  3. srr/test/log4j.properties used by unit tests
It uses by default the Console Appender.

I have tested this patch on both front-end / back-end and unit tests env.
But please notice, that I have only one box available = I just tested it on 
mono-deployment env.


 logging improvements
 

  Key: NUTCH-303
  URL: http://issues.apache.org/jira/browse/NUTCH-303
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Switch to the apache commons logging facade.
 See HADOOP-211 and following thread 
 http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg08706.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-301?page=comments#action_12415098 ] 

Jerome Charron commented on NUTCH-301:
--

We can store the CommonGrams instance in the Configuration as it is already 
done in many places in Nutch code.

 CommonGrams loads analysis.common.terms.file for each query
 ---

  Key: NUTCH-301
  URL: http://issues.apache.org/jira/browse/NUTCH-301
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.8-dev
 Reporter: Chris Schneider


 The move away from static objects toward instance variables has resulted in 
 CommonGrams constructor parsing its analysis.common.terms.file for each 
 query. I'm not certain how large a performance impact this really is, but it 
 seems like something you'd want to avoid doing for each query. Perhaps the 
 solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-06-07 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-275?page=all ]
 
Jerome Charron resolved NUTCH-275:
--

Fix Version: 0.8-dev
 Resolution: Fixed

Magic guessing removed for xml content-type.

 Fetcher not parsing XHTML-pages at all
 --

  Key: NUTCH-275
  URL: http://issues.apache.org/jira/browse/NUTCH-275
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: problem with nightly-2006-05-20; worked fine with same website 
 on 0.7.2
 Reporter: Stefan Neufeind
  Fix For: 0.8-dev


 Server reports page as text/html - so I thought it would be processed as 
 html.
 But something I guess evaluated the headers of the document and re-labeled it 
 as text/xml (why not text/xhtml?).
 For some reason there is no plugin to be found for indexing text/xml (why 
 does TextParser not feel responsible?).
 Links inside this document are NOT indexed at all - no digging this website 
 actually stops here.
 Funny thing: For some magical reasons the dtd-files referenced in the header 
 seem to be valid links for the fetcher and as such are indexed in the next 
 round (if urlfilter allows).
 060521 025018 fetching http://www.secreturl.something/
 060521 025018 http.proxy.host = null
 060521 025018 http.proxy.port = 8080
 060521 025018 http.timeout = 1
 060521 025018 http.content.limit = 65536
 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 025018 fetcher.server.delay = 1000
 060521 025018 http.max.delays = 1000
 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 025019  map 0%  reduce 0%
 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-303) logging improvements

2006-06-07 Thread Jerome Charron (JIRA)

logging improvements


 Key: NUTCH-303
 URL: http://issues.apache.org/jira/browse/NUTCH-303
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Jerome Charron
 Assigned to: Jerome Charron 
Priority: Minor
 Fix For: 0.8-dev


Switch to the apache commons logging facade.
See HADOOP-211 and following thread 
http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg08706.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]
 
Jerome Charron resolved NUTCH-301:
--

Fix Version: 0.8-dev
 Resolution: Fixed

Patch applied with some minor modifications.
Thanks Stefan.

 CommonGrams loads analysis.common.terms.file for each query
 ---

  Key: NUTCH-301
  URL: http://issues.apache.org/jira/browse/NUTCH-301
  Project: Nutch
 Type: Improvement

   Components: searcher
 Versions: 0.8-dev
 Reporter: Chris Schneider
  Fix For: 0.8-dev
  Attachments: CommonGramsCacheV1.patch

 The move away from static objects toward instance variables has resulted in 
 CommonGrams constructor parsing its analysis.common.terms.file for each 
 query. I'm not certain how large a performance impact this really is, but it 
 seems like something you'd want to avoid doing for each query. Perhaps the 
 solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-05 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
 
Jerome Charron resolved NUTCH-298:
--

Resolution: Fixed

Committed + some unit tests to reproduce.
Thanks Stefan.
As you mentioned it in a previous mail, I agree that the RobotRulesParser 
should be rewrite.

 if a 404 for a robots.txt is returned a NPE is thrown
 -

  Key: NUTCH-298
  URL: http://issues.apache.org/jira/browse/NUTCH-298
  Project: Nutch
 Type: Bug

 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: fixNpeRobotRuleSet.patch

 What happen:
 Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
 robots.txt.
 In case http response code is not 200 or 403 but for example 404 we do  
 robotRules = EMPTY_RULES;  (line: 402)
 EMPTY_RULES is a RobotRuleSet created with the default constructor.
 tmpEntries and entries is null and will never changed.
 If we now try to fetch a page from the host that use the EMPTY_RULES is used 
 and we call isAllowed in the RobotRuleSet.
 In this case a NPE is thrown in this line:
  if (entries == null) {
 entries= new RobotsEntry[tmpEntries.size()];
 possible Solution:
 We can intialize the tmpEntries by default and also remove other null checks 
 and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-05-22 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412835 ] 

Jerome Charron commented on NUTCH-275:
--

This problem as already been reported by Doug : 
http://mail-archive.com/nutch-dev%40lucene.apache.org/msg03474.html
It is related to magic based content-type guessing.
Nothing as been stated about this for now, but I should work on it.

Workaround : 
* inactivate the mime-type magic resolution (mime.type.magic = false)
* or remove the magic offset=0 ...  line in mime-types.xml

Thanks for opening a jira issue about this.

 Fetcher not parsing XHTML-pages at all
 --

  Key: NUTCH-275
  URL: http://issues.apache.org/jira/browse/NUTCH-275
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
  Environment: problem with nightly-2006-05-20; worked fine with same website 
 on 0.7.2
 Reporter: Stefan Neufeind


 Server reports page as text/html - so I thought it would be processed as 
 html.
 But something I guess evaluated the headers of the document and re-labeled it 
 as text/xml (why not text/xhtml?).
 For some reason there is no plugin to be found for indexing text/xml (why 
 does TextParser not feel responsible?).
 Links inside this document are NOT indexed at all - no digging this website 
 actually stops here.
 Funny thing: For some magical reasons the dtd-files referenced in the header 
 seem to be valid links for the fetcher and as such are indexed in the next 
 round (if urlfilter allows).
 060521 025018 fetching http://www.secreturl.something/
 060521 025018 http.proxy.host = null
 060521 025018 http.proxy.port = 8080
 060521 025018 http.timeout = 1
 060521 025018 http.content.limit = 65536
 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
 060521 025018 fetcher.server.delay = 1000
 060521 025018 http.max.delays = 1000
 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
 mapped to contentType text/xml via parse-plugins.xml, but
  its plugin.xml file does not claim to support contentType: text/xml
 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
 mapped to contentType text/xml via parse-plugins.xml, but 
 not enabled via plugin.includes in nutch-default.xml
 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
 060521 025019  map 0%  reduce 0%
 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-13 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]
 
Jerome Charron resolved NUTCH-134:
--

Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Jerome Charron

Solution proposed by Andrzej implemented.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2
 Reporter: Andrzej Bialecki 
 Assignee: Jerome Charron
  Fix For: 0.8-dev
  Attachments: summarizer.060506.patch

 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-05 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-134?page=all ]

Jerome Charron updated NUTCH-134:
-

Attachment: summarizer.060506.patch

Here is a patch that add a summarizer extension point and two summarizer 
plugins : summarizer-basic (the current nutch implementation) and 
summarizer-lucene (the lucene highlighter implementation).
Please notice that the lucene plugin is a very crude implementation : the 
highlighter directly constructs a text representation of the summary, so we 
need to parse the text to build a Summary object!!! (improvements are welcome).

This is a first step to this issue resolution.
If no objection, I will commit this patch in the next few days and then:
1. Fix in the summarizer-basic the original issue reported by Andrzej 
2. Add a toString(Encoder, Formatter) method in Summarizer so that a Summary 
object could be encoded and formatted with many implementations (it is the same 
logic as the one in Lucene Highlight) - Andrzej, do you prefer this solution or 
a solution where Summary is Writable?

PS: Chris, sorry but the major part of this patch was already done when you 
added your comment.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: summarizer.060506.patch

 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-263) MapWritable.equals() doesn't work properly

2006-05-04 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-263?page=comments#action_12377749 ] 

Jerome Charron commented on NUTCH-263:
--

Andrzej, a small but efficient improvement could be to check the maps sizes 
prior to any other tests:

if (obj instanceof MapWritable) {
  MapWritable map = (MapWritable) obj;
  if (map.fSize == fSize) {
...
  }
}
return false;

No? 

 MapWritable.equals() doesn't work properly
 --

  Key: NUTCH-263
  URL: http://issues.apache.org/jira/browse/NUTCH-263
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch1.txt

 MapWritable.equals() is sensitive to the order in which map entries have been 
 created. E.g. this fails but it should succeed:
 MapWritable map1 = new MapWritable();
 MapWritable map2 = new MapWritable();
 map1.put(new UTF8(key1), new UTF8(val1));
 map1.put(new UTF8(key2), new UTF8(val2));
 map2.put(new UTF8(key2), new UTF8(val2));
 map2.put(new UTF8(key1), new UTF8(val1));
 assertTrue(map1.equals(map2));
 Users expect that this should not be the case, i.e. this class should follow 
 the same rules as Map.equals() (Returns true if the given object is also a 
 map and the two Maps represent the same mappings).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-261) Multi Language Support

2006-05-03 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-261?page=all ]

Jerome Charron updated NUTCH-261:
-

Attachment: query-lang.patch

Here is a patch that provides a language dependent analysis of the queries.

If you have activated some language analysis plugins (such as analysis-fr or 
analysis-de) during indexing, if
these plugins are activated during searching phase, the analyzer corresponding 
to the browser's language
will be applied: For instance, if you search for the french term moteurs it 
will returns documents containing
moteur or moteurs.

Please notice that if no analyzer plugin is activated, nutch behaviors must be 
unchanged (backward functional).
There is some well known issues about the summaries (I have planned to solve 
this very soon).

Thanks for reviewing this patch.
Thanks for your feedback.

Regards

Jérôme

 Multi Language Support
 --

  Key: NUTCH-261
  URL: http://issues.apache.org/jira/browse/NUTCH-261
  Project: Nutch
 Type: New Feature

   Components: indexer, searcher
 Versions: 0.7, 0.8-dev, 0.6, 0.7.1, 0.7.2
 Reporter: Jerome Charron
 Assignee: Jerome Charron
  Fix For: 0.8-dev
  Attachments: query-lang.patch

 Add multi-lingual support in Nutch, as described in 
 http://wiki.apache.org/nutch/MultiLingualSupport
 The document analysis part is actually implemented, and two analysis plugins 
 (fr and de) are provided for testing (not deployed by default).
 The query analysis part is missing for a complete multi-lingual support.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-262) Summary excerpts and highlights problems

2006-05-03 Thread Jerome Charron (JIRA)

Summary excerpts and highlights problems


 Key: NUTCH-262
 URL: http://issues.apache.org/jira/browse/NUTCH-262
 Project: Nutch
Type: Sub-task

  Components: searcher  
Versions: 0.8-dev
Reporter: Jerome Charron
 Assigned to: Jerome Charron 
 Fix For: 0.8-dev


There is some problems selecting and highlighting snippets for summary when 
multi-lingual support is used.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-245) DTD for plugin.xml configuration files

2006-04-13 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-245?page=comments#action_12374339 ] 

Jerome Charron commented on NUTCH-245:
--

I would prefer to change the ugly parts of the DTD now (before a future 1.0) 
and suggest to change it to something like (and to change the plugin.xml and 
Plugin Manifest Reader too):

!ELEMENT implementation (parameter*)
!ELEMENT parameter EMPTY
!ATTLIST parameter
name CDATA #REQUIRED
value CDATA #REQUIRED


 DTD for plugin.xml configuration files
 --

  Key: NUTCH-245
  URL: http://issues.apache.org/jira/browse/NUTCH-245
  Project: Nutch
 Type: New Feature

   Components: fetcher, indexer, ndfs, searcher, web gui
 Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev
  Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although 
 improvement is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Attachments: NUTCH-245.Mattmann.patch.txt

 Currently, the plugin.xml file does not have a DTD or XML Schema associated 
 with it, and most people just go look at an existing plugin's plugin.xml file 
 to determine what are the allowable elements, etc. There should be an 
 explicit plugin DTD file that describes the plugin.xml file. I'll look at the 
 code and attach a plugin.dtd file for the Nutch conf directory later today. 
 This way, people can use the DTD file to automatically (using tools such as 
 XMLSpy) generate plugin.xml files that can then be validated. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-06 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-244?page=all ]
 
Jerome Charron closed NUTCH-244:


Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Jerome Charron

Fixed : http://svn.apache.org/viewcvs.cgi?rev=391958view=rev

 Inconsistent handling of property values boundaries / unable to set 
 db.max.outlinks.per.page to infinite
 

  Key: NUTCH-244
  URL: http://issues.apache.org/jira/browse/NUTCH-244
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: AJ Banck
 Assignee: Jerome Charron
  Fix For: 0.8-dev


 Some properties like file.content.limit support using negative numbers (-1) 
 to 'disable' a limitation.
 Other properties do not support this. 
 I tried disabling the limit set by db.max.outlinks.per.page, but this isn't 
 possible.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-05 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373393 ] 

Jerome Charron commented on NUTCH-244:
--

While taking a quick look at this, something astonished me in the code.
The db.max.outlinks.per.page property is exclusively used in ParseData.
In the ParseData, the number of outlinks used is filtered in the readFields 
method ... 
Shouldn't it be directly filtered in the ParseData constructor ?

 Inconsistent handling of property values boundaries / unable to set 
 db.max.outlinks.per.page to infinite
 

  Key: NUTCH-244
  URL: http://issues.apache.org/jira/browse/NUTCH-244
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: AJ Banck


 Some properties like file.content.limit support using negative numbers (-1) 
 to 'disable' a limitation.
 Other properties do not support this. 
 I tried disabling the limit set by db.max.outlinks.per.page, but this isn't 
 possible.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite

2006-04-05 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373398 ] 

Jerome Charron commented on NUTCH-244:
--

That perfectly makes sense!
Thanks Andrzej.

 Inconsistent handling of property values boundaries / unable to set 
 db.max.outlinks.per.page to infinite
 

  Key: NUTCH-244
  URL: http://issues.apache.org/jira/browse/NUTCH-244
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: AJ Banck


 Some properties like file.content.limit support using negative numbers (-1) 
 to 'disable' a limitation.
 Other properties do not support this. 
 I tried disabling the limit set by db.max.outlinks.per.page, but this isn't 
 possible.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] 

Jerome Charron commented on NUTCH-240:
--

+1

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins

2006-03-29 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-196?page=all ]
 
Jerome Charron closed NUTCH-196:


Fix Version: 0.8-dev
 Resolution: Fixed

Added a lib-xml that gathers many xml libraries previously used in parse-rss.
(http://svn.apache.org/viewcvs?rev=389716view=rev)


 lib-xml and lib-log4j plugins
 -

  Key: NUTCH-196
  URL: http://issues.apache.org/jira/browse/NUTCH-196
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Fix For: 0.8-dev
  Attachments: NUTCH-196.lib-log4j.patch

 Many places in Nutch use XML. Parsing XML using the JDK API is painful. I 
 propose to add one (or more) library plugins with JDOM, DOM4J, Jaxen, etc. 
 This should simplify the current deployment, and help plugin writers to use 
 the existing API.
 Similarly, many plugins use log4j. Either we add it to the /lib, or we could 
 create a lib-log4j plugin.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-210) Context.xml file for Nutch web application

2006-03-25 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-210?page=all ]

Jerome Charron updated NUTCH-210:
-

Attachment: NUTCH-210.060325.patch

I Chris,

I made some minor changes to your patch (see my attached patch 
NUTCH-210.060325.patch):
* Refactoring of the xsl code, and add query.* properties to the nutch.xml
* Remove the JspUtil class and move the code to a 
NutchConfiguration.get(ServletContext) method.

I used this patch = very usefull, I like it.
If no objections about it, I will commit it in the next few days.

Thanks Chris

Jérôme

 Context.xml file for Nutch web application
 --

  Key: NUTCH-210
  URL: http://issues.apache.org/jira/browse/NUTCH-210
  Project: Nutch
 Type: Improvement
   Components: web gui
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although 
 improvement is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1
  Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt

 Currently the nutch web gui references a few parameters that are highly 
 dynamic, e.g., searcher.dir. These dynamic properties are read from the 
 configuration files, such as nutch-default.xml. One problem I'm noticing 
 however is that in order to change the parameter in the built webapp (the WAR 
 file), I am required to change the parameter first in the checked out Nutch 
 source tree, rebuild the webapp, then redploy. Or, if I'm feeling really 
 gutsty, I can go poke around in the unpackaged WAR file if the servlet 
 container exposes it to me, and try and modify the nutch-default.xml file 
 that way. However, I think that it would be really nice (and highly useful 
 for that matter) to factor out some of the more dynamic parameters of the web 
 application into a separate deliverable Context.xml file that would accompany 
 the webapp. The Context.xml file would be deployed in the webapps directory, 
 as oppossed to the WAR file itself, and the parameters could be updated 
 there, and changed as many times as necessary, without rebuilding the WAR 
 file. 
 Of course this will involve making minor modifications in the web GUI to 
 where some of the dynamic parameters are read from (i.e., make it read them 
 from the Context.xml file (using application.getParameter most likely). Right 
 now the only one I can think of is searcher.dir, but I'm sure that there are 
 others (in particular the searcher.dir one is the most annoying for me). 
 The timeframe on this patch will be within the next month.
 Thanks,
   Chris

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] 

Jerome Charron commented on NUTCH-233:
--

Stefan,

I have created a small unit test for urlfilter-regexp and I doesn't notice any 
incompatibility in java.util.regex with this regexp. Could you please provide 
the urls that cause problem so that I can add them to me unit tests.
Thanks

Jérôme

 wrong regular expression hang reduce process for ever
 -

  Key: NUTCH-233
  URL: http://issues.apache.org/jira/browse/NUTCH-233
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Fix For: 0.8-dev


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-228) Clustering plugin descriptor broken (fix included)

2006-03-12 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-228?page=all ]
 
Jerome Charron closed NUTCH-228:


Fix Version: 0.8-dev
 Resolution: Fixed

Committed :
* http://svn.apache.org/viewcvs.cgi?rev=385267view=rev
* http://svn.apache.org/viewcvs.cgi?rev=385268view=rev

Thanks Dawid.

 Clustering plugin descriptor broken (fix included)
 --

  Key: NUTCH-228
  URL: http://issues.apache.org/jira/browse/NUTCH-228
  Project: Nutch
 Type: Bug
 Reporter: Dawid Weiss
 Priority: Minor
  Fix For: 0.8-dev
  Attachments: clustering.patch

 The plugin descriptor for clustering-carrot2 is currently broken (points to a 
 missing JAR). I'm adding a patch fixing this to this issue in a minute.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)

2006-03-12 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-217?page=all ]
 
Jerome Charron resolved NUTCH-217:
--

Resolution: Fixed

Fixed : http://svn.apache.org/viewcvs.cgi?view=revrev=384011
Thanks Dawid.

 InstantiationException when deserializing Query (no parameterless constructor)
 --

  Key: NUTCH-217
  URL: http://issues.apache.org/jira/browse/NUTCH-217
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.8-dev
 Reporter: Dawid Weiss


 I've been playing with the trunk. The distributed searcher complains with an 
 instantiation exception when deserializing Query. A quick code inspection 
 shows that Query doesn't have any parameterless constructor.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-227?page=all ]
 
Jerome Charron closed NUTCH-227:


Resolution: Fixed

Oups.. sorry guys... and thanks for you prompt remarks.
All is in fact OK.

 Basic Query Filter no more uses Configuration
 -

  Key: NUTCH-227
  URL: http://issues.apache.org/jira/browse/NUTCH-227
  Project: Nutch
 Type: Bug
   Components: searcher
 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
  Fix For: 0.8-dev


 Since NUTCH-169, the BasicIndexingFilter has no way to retrieve its 
 configuration parameters (query.url.boost, query.anchor.boost, 
 query.title.boost, query.host.boost, query.phrase.boost) : The 
 setConf(Configuration) method is never called by the QueryFilters class.
 More generaly, we should provide a way for QueryFilter to be Configurable. 
 Two solutions:
 1. The QueryFilters checks that a QueryFilter implements Configurable and 
 then call the setConf() method.
 2. QueryFilter extends Configurable = all QueryFilter must implement 
 Configurable.
 My preference goes to 1, and if there is no objection, I will commit a patch 
 in the next few days.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-219) file.content.limit ftp.content.limit should be changed to -1 to be consistent with http

2006-03-02 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-219?page=all ]
 
Jerome Charron closed NUTCH-219:


Fix Version: 0.8-dev
 Resolution: Fixed

Solved: http://svn.apache.org/viewcvs.cgi?rev=382535view=rev

 file.content.limit  ftp.content.limit should be changed to -1 to be 
 consistent with http
 -

  Key: NUTCH-219
  URL: http://issues.apache.org/jira/browse/NUTCH-219
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7.1
 Reporter: Richard Braman
 Priority: Minor
  Fix For: 0.8-dev


 file and ftp are 0 for no trunccation, but http need be -1
 This is easily missed when configuting even by expereienced users.
 Here is the help I got in nutch-user from Jermoe, who is a developer.
 Edit your nutch-site.xml (or nutch-default.xml) and change the
 http.content.limit (set it to 0 if you don't want no content truncation at 
 all).
 Jérôme

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]

Jerome Charron updated NUTCH-204:
-

Attachment: NUTCH-204.jc.060227.patch

Stefan,

Here is a proposed patch (NUTCH-204.jc.060227.patch).
If you agree, I will commit it.

Jérôme

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-204?page=all ]
 
Jerome Charron closed NUTCH-204:


Resolution: Fixed

Committed : http://svn.apache.org/viewcvs.cgi?rev=381465view=rev
Thanks Stefan for pointing out the performance issue of my patch.
We can perhaps in a next patch add a cache of field/values to avoid iterating 
over the whole list each time the getValues method is called.


 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content

2006-02-27 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368050 ] 

Jerome Charron commented on NUTCH-61:
-

Not an objection, but a simple comment.
Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule 
and AdaptiveFetchSchedule some fetch schedule plugins?


 Adaptive re-fetch interval. Detecting umodified content
 ---

  Key: NUTCH-61
  URL: http://issues.apache.org/jira/browse/NUTCH-61
  Project: Nutch
 Type: New Feature
   Components: fetcher
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: 20050606.diff, 20051230.txt, 20060227.txt

 Currently Nutch doesn't adjust automatically its re-fetch period, no matter 
 if individual pages change seldom or frequently. The goal of these changes is 
 to extend the current codebase to support various possible adjustments to 
 re-fetch times and intervals, and specifically a re-fetch schedule which 
 tries to adapt the period between consecutive fetches to the period of 
 content changes.
 Also, these patches implement checking if the content has changed since last 
 fetching; protocol plugins are also changed to make use of this information, 
 so that if content is unmodified it doesn't have to be fetched and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367513 ] 

Jerome Charron commented on NUTCH-204:
--

Hi Stefan,

There is something I don't understand with this patch. The way Lucene manage 
multi-valued fields is to have many mono-valued Field objects with the same 
name. My interrogation, is why not keeping this logic?
It will avoid patching the HitSearcher, and modifying the HitDetails 
constructor signature.
The idea I have in mind is to add a generic name/value(s) container (like 
Metadata but without the syntax tolerant feature. In fact, the actual metadata 
will internaly uses this generic container), that will be used by the 
HitDetails to store multivalued fields.
What do you think about this?
I imagine you are very busy with the admin GUI (it is really a big challenge, 
and a big new feature), so if you are ok with my proposed solution, I will code 
it.

Regards

Jérôme


 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367530 ] 

Jerome Charron commented on NUTCH-204:
--

 HitDetails is a writable and in case of multiple searchservers distributed in 
 a network it makes
 sense to minimize the network io since getting details should be as fast as 
 possible.
Sure Stefan.
I will take this into account of course. Using a map like structure in 
HitDetails will reduce the bytes used by not duplicating keys.
I will commit something in the next few days.

 multiple field values in HitDetails
 ---

  Key: NUTCH-204
  URL: http://issues.apache.org/jira/browse/NUTCH-204
  Project: Nutch
 Type: Improvement
   Components: searcher
 Versions: 0.8-dev
 Reporter: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: DetailGetValues070206.patch

 Improvement as Howie Wang suggested.
 http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
 PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html

2006-02-22 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-188?page=all ]
 
Jerome Charron closed NUTCH-188:


Fix Version: 0.8-dev
 Resolution: Fixed

Duplicated with NUTCH-214

 Add searchable mailing list links to 
 http://lucene.apache.org/nutch/mailing_lists.html
 --

  Key: NUTCH-188
  URL: http://issues.apache.org/jira/browse/NUTCH-188
  Project: Nutch
 Type: Improvement
 Reporter: Andy Liu
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: mailing_list.patch

 Post links to searchable mail archives on nutch.org 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-215) Plugin execution order

2006-02-21 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-215?page=comments#action_12367180 ] 

Jerome Charron commented on NUTCH-215:
--

The primary meaning of the plugin dependency is to specify that a plugin rely 
on the code of another plugin, not that it must be executed after a plugin it 
depends on.
I think that in some particular cases (as in yours), this concept of plugin 
order is important, and it is really an issue.
But I don't think it is the right way for solving it.
We should think about a more generic/secure solution (a parse plugin must not 
be allowed to be declared to be called before a protocol plugin, a plugin can 
implement many extension points, ...).
For a short term workaround, you can for instance directly call the plugin that 
you need to be called before yours.

+1 for this issue to be solved, but -1 for this patch.

Jérôme

 Plugin execution order
 --

  Key: NUTCH-215
  URL: http://issues.apache.org/jira/browse/NUTCH-215
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Enrico Triolo
 Priority: Minor
  Attachments: plugin_order.patch

 This patch allows nutch to automatically guess the correct order of execution 
 of plugins, depending on their dependencies.
 This means that, for example, if plugin A depends on plugin B (as stated in 
 the plugins.xml file), then B will be executed before A.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-212) ant build problem with locale-sr

2006-02-21 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-212?page=all ]
 
Jerome Charron resolved NUTCH-212:
--

Fix Version: 0.8-dev
 Resolution: Fixed

Not directly related, but it should be solved with this commit:
http://svn.apache.org/viewcvs.cgi?rev=379453view=rev

 ant build problem with locale-sr
 

  Key: NUTCH-212
  URL: http://issues.apache.org/jira/browse/NUTCH-212
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.8-dev
  Environment: win32
 Reporter: Alain Fankhauser
 Priority: Trivial
  Fix For: 0.8-dev


 Problem while executing ant from eclipse: 
 build.xml: 
 antcall target=generate-locale
   param name=doc.locale value=sr/
 /antcall
 errormessage:
 generate-locale:
  [echo] Generating docs for locale=sr
  [xslt] Transforming into C:\eclipse_projects\nutchTrunk\docs\sr
  [xslt] Processing 
 C:\eclipse_projects\nutchTrunk\src\web\pages\sr\about.xml to 
 C:\eclipse_projects\nutchTrunk\docs\sr\about.html
  [xslt] Loading stylesheet 
 C:\eclipse_projects\nutchTrunk\build\docs\sr\nutch-page.xsl
  [xslt] C:/eclipse_projects/nutchTrunk/src/web/pages/sr/about.xml:1: 
 Fatal Error! Dokumentwurzelelement fehlt
  [xslt] Failed to process 
 C:\eclipse_projects\nutchTrunk\src\web\pages\sr\about.xml
 BUILD FAILED
 C:\eclipse_projects\nutchTrunk\build.xml:393: The following error occurred 
 while executing this line:
 C:\eclipse_projects\nutchTrunk\build.xml:324: Fatal error during 
 transformation

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
-

Attachment: NUTCH-139.060208.patch

A new patch which I hope is compliant with all our requirements (not tested yet 
on a wide fetch/index/query cycle)

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]
 
Jerome Charron resolved NUTCH-139:
--

Fix Version: (was: 0.7.2-dev)
 (was: 0.7.1)
 (was: 0.7)
 (was: 0.6)
 Resolution: Fixed

Tested and commited with some corrections in cached.jsp (missed 
ContentProperties usage) and build.xml (add commons-lang jar in war lib):
http://svn.apache.org/viewcvs.cgi?rev=376089view=rev

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.8-dev
  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365066 ] 

Jerome Charron commented on NUTCH-139:
--

Sorry for this very late response...
The idea behind separate subclasses of Metadata for content and parses is to 
enforce the semantic separation between content metadata and parse metadata:
ContentProperties only defines constants for content related metadata.
ParseProperties only defines constants for parse related metadata.
Does it makes sense?

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365095 ] 

Jerome Charron commented on NUTCH-139:
--

Ok Doug. Your point of view makes sense for me.
I hope, I can provide a (final) patch for the next week.

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365103 ] 

Jerome Charron commented on NUTCH-139:
--

 except for the sake of purity of OO approach
Andrzej, as you noticed certainly, it is my defect...   ;-)
You know, I have still the temptation to split the metadata constants into 
several interfaces (DublinCore, HttpHeaders, ...)
;-)


 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364218 ] 

Jerome Charron commented on NUTCH-139:
--

  I think we're near agreement here.
I really hope ...   ;-)

 We should add an add() method to Metadata, and change set() to replace all 
 values rather than add a new value.
I'm not sure we are looking at the same piece of code, since this how add() and 
set() methods works in the last attached patch 
(http://issues.apache.org/jira/secure/attachment/12321740/NUTCH-139.060105.patch)

 MetadataNames belongs in the protocol package, not util
+1 (but in my mind there is no more MetadatNames only MetaData, 
ContentProperties and ParseProperties, no?)

 We should rename ContentProperties to Metadata
What about having a generic Metadata container extended by ContentProperties 
and ParseProperties?
(as described in a previous comment : 
http://issues.apache.org/jira/browse/NUTCH-139#action_12362618)
By having two separate maps (one for Content and one for Parse in ParseData) we 
easily handle the problem of original value / final value and we avoid the 
copying af the Content metadata map to the Parse metadata map in all parsers:

ContentProperties metadata = new ContentProperties();
metadata.putAll(content.getMetadata()); // copy through


 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364112 ] 

Jerome Charron commented on NUTCH-139:
--

In fact, the more I look at this, the more I agreed with last Doug comment. 
There is no real needs (for now) for a so complicated meta-info container.

I would like to summarize the key goals related to this issue:

1. Defines some constants for protocol and content metadata names.
2. Provides some correction mechanisms for erroneous protocol headers names.
3. Handles multi-valued properties (such as SMTP recipients, or TAGS attached 
to a html page, ...)
4. Provides a easy way to keep tracks of protocol original values even if they 
are overridden by parsers (I don't think there a need for a concept of original 
value at the parsers level. If a parser override a previously existing value 
setted by another parser, then this new value must replace the existing one).

I really think that one of my comment (13/Jan/06 - 
http://issues.apache.org/jira/browse/NUTCH-139#action_12362618) covers all 
these cases.
In this proposal, the ParseData object keeps a reference on the protocol 
original metadata map (ContentProperties), instead of copying the map into a 
new one. 
The policy is then as follow :
* The ContentProperties is created at the protocol level and is then never 
modified after.
* The ParseProperties is created by the content parser and is the place to 
store any kind of metadata in all the next nutch processes.
* Any metadata stored in ParseProperties can be overridded (the last who 
speak has the last word).


 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-24 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363834 ] 

Jerome Charron commented on NUTCH-139:
--

Andrzej,

I really don't like this X-Nutch naming convention. First it's really 
protocol level oriented, and it forces to map X-Nutch values with original 
ones (of course an utility method can easily provides this mapping). But I 
really think this solution is really clean (from my point of view).

We should perhaps define one more time what is a MetaData value.
I suggest to define a new class to represent a metadata value instead of using 
a simple String.
Thus, we can define a class that holds both original and final value.
The idea is that the only way to set the original value is to construct a new 
object (I will call this class MetaValue, but native english speakers are 
encourage to propose a better name), then when you set the value of this 
metadata value, it never override the original one, but the final one.
Here is a short piece of code:

public class MetaValue {
private String[] original = null;
private List actual = null;

public MetaValue(String[] values) {
// Constructor for multi value
original = values;
}
public MetaValue(String value) {
// Constructor for single value
   original = new String[] { value };
}
   public void setValue(String[] values) {
   // copies the values in a new empty actual list
   }

   public void addValue(String value) {
   // append this value to the list of values
   }

   public String[] getOriginalValues() { }

   public String[] getFinalValues() { }

   public String[] getValues() {
   // Return the final values if the list of values is not null
  // otherwise return the final values
  }
}

With this approach we can keep the same value (MetaValue) with the same key.


 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-13 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362618 ] 

Jerome Charron commented on NUTCH-139:
--

Here is a new proposal for this issue.

org.apache.nutch.util.MetaData
  * becomes an utility class that is only a container of multi-valued, typo 
toletent String properties (using the same kind of API than JavaMail : the add 
/ set methods mentionned by Doug - it is already implemented in the actual 
patch).
  * There is no more metadata names constants in this class, since it becomes a 
generic object for storing String/String[] mappings


org.apache.nutch.protocol.ContentProperties
  * This class simply extends the MetaData class
  * It defines the content related constants (Content-Type, and so on)

org.apache.nutch.parse.ParseProperties
  * This class simply extends the MetaData class
  * It defines the parse related constants (Dublin Core constans)

org.apache.nutch.parse.ParseData
  * The constructor becomes ParseData(ParseStatus, String, Outlink[], 
ContentProperties)
  * This class holds two metadata sets : 
 1. ContentProperties for the original metadata set which came from protocol
 2. MetaDataProperties for the parse metadata set.

  * This class provides 3 ways to retrieve a metadata value:
1. public ContentProperties getContentMeta();
2. public ParseProperties getParseMeta();
3. public MetaData getMetaData(); // Returns a mix of the two previous one 
where values in parse properties override those in content properties.

In all parsers implementations:
* Remove the copying of content metadata to parse metadata.

From my point of view the key benefits are:
  1. Provide a clear separation between content metadata and parse metadata.
  2. Metadata names are defined at the right places.
  3. Keeps the advantage of metadata names normalization and syntax correction
  4. An easy mapping beetween the content metadatas name and parse metadata 
names (both can use the real name of the metadata, without adding an artificial 
X-Nutch prefix for parse metadata name)


Comments are welcome.

Jérôme

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-11 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
 
Jerome Charron resolved NUTCH-151:
--

Resolution: Fixed

Changes committed : http://svn.apache.org/viewcvs.cgi?rev=368060view=rev
Thanks Paul.

 CommandRunner can hang after the main thread exec is finished and has 
 inefficient busy loop
 ---

  Key: NUTCH-151
  URL: http://issues.apache.org/jira/browse/NUTCH-151
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: all
 Reporter: Paul Baclace
  Fix For: 0.8-dev
  Attachments: CommandRunner.060110.patch, CommandRunner.java, 
 CommandRunner.java.patch

 I encountered a case where the JVM of a Tasktracker child did not exit after 
 the main thread returned; a thread dump showed only the threads named STDOUT 
 and STDERR from CommandRunner as non-daemon threads, and both were doing a 
 read().
 CommandRunner usually works correctly when the subprocess is expected to be 
 finished before the timeout or when no timeout is used. By _usually_, I mean 
 in the absence of external thread interrupts.  The busy loop that waits for 
 the process to finish has a sleep that is skipped over by an exception; this 
 causes the waiting main thread to compete with the subprocess in a tight loop 
 and effectively reduces the available cpu by 50%.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Reopened: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
 
Jerome Charron reopened NUTCH-151:
--


Due to the removal of calling barrier in PumperThread the process is always 
timedout (for instance , unit tests of parse-ext fails) because only the main 
thread is assumed to be finished.

 CommandRunner can hang after the main thread exec is finished and has 
 inefficient busy loop
 ---

  Key: NUTCH-151
  URL: http://issues.apache.org/jira/browse/NUTCH-151
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: all
 Reporter: Paul Baclace
  Fix For: 0.8-dev
  Attachments: CommandRunner.java, CommandRunner.java.patch

 I encountered a case where the JVM of a Tasktracker child did not exit after 
 the main thread returned; a thread dump showed only the threads named STDOUT 
 and STDERR from CommandRunner as non-daemon threads, and both were doing a 
 read().
 CommandRunner usually works correctly when the subprocess is expected to be 
 finished before the timeout or when no timeout is used. By _usually_, I mean 
 in the absence of external thread interrupts.  The busy loop that waits for 
 the process to finish has a sleep that is skipped over by an exception; this 
 causes the waiting main thread to compete with the subprocess in a tight loop 
 and effectively reduces the available cpu by 50%.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]

Jerome Charron updated NUTCH-151:
-

Attachment: CommandRunner.060110.patch

Here is a very small patch that solves this issue.
If Paul is ok with this, I will commit.

 CommandRunner can hang after the main thread exec is finished and has 
 inefficient busy loop
 ---

  Key: NUTCH-151
  URL: http://issues.apache.org/jira/browse/NUTCH-151
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.8-dev
  Environment: all
 Reporter: Paul Baclace
  Fix For: 0.8-dev
  Attachments: CommandRunner.060110.patch, CommandRunner.java, 
 CommandRunner.java.patch

 I encountered a case where the JVM of a Tasktracker child did not exit after 
 the main thread returned; a thread dump showed only the threads named STDOUT 
 and STDERR from CommandRunner as non-daemon threads, and both were doing a 
 read().
 CommandRunner usually works correctly when the subprocess is expected to be 
 finished before the timeout or when no timeout is used. By _usually_, I mean 
 in the absence of external thread interrupts.  The busy loop that waits for 
 the process to finish has a sleep that is skipped over by an exception; this 
 causes the waiting main thread to compete with the subprocess in a tight loop 
 and effectively reduces the available cpu by 50%.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]

Jerome Charron updated NUTCH-169:
-

Attachment: NutchConf.Http.060111.patch

Attached is the patch for http related classes (lib-http, protocol-http and 
protocol-httpclient).
Pfou, Stefan, it was a huge work since a lot of code was static and use the 
static NutchConf !!!
;-)

But it is ok and it works (with a patch to the Fetcher that I will submit just 
after).
Please notice, that it is a raw version, and it probably needs a full review 
after commit.

 remove static NutchConf
 ---

  Key: NUTCH-169
  URL: http://issues.apache.org/jira/browse/NUTCH-169
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: NutchConf.Http.060111.patch, nutchConf.patch

 Removing the static NutchConf.get is required for a set of improvements and 
 new features.
 + it allows a better integration of nutch in j2ee or other systems.
 + it allows the management of nutch from a web based gui (a kind of nutch 
 appliance) which will improve the usability and also increase the user 
 acceptance of nutch
 + it allows to change configuration properties until runtime
 + it allows to implement NutchConf as a abstract class or interface to 
 provide other configuration value sources than xml files. (community request)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]

Jerome Charron updated NUTCH-169:
-

Attachment: NutchConf.Fetcher.060111.patch

Same as the one provided in Stefan patch + the Fetcher set the NutchConf to 
protocol.
Not sure it is the right way: it could be better that the ProtocolFactory set 
the NutchConf to protocols.
???

 remove static NutchConf
 ---

  Key: NUTCH-169
  URL: http://issues.apache.org/jira/browse/NUTCH-169
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: NutchConf.Fetcher.060111.patch, NutchConf.Http.060111.patch, 
 nutchConf.patch

 Removing the static NutchConf.get is required for a set of improvements and 
 new features.
 + it allows a better integration of nutch in j2ee or other systems.
 + it allows the management of nutch from a web based gui (a kind of nutch 
 appliance) which will improve the usability and also increase the user 
 acceptance of nutch
 + it allows to change configuration properties until runtime
 + it allows to implement NutchConf as a abstract class or interface to 
 provide other configuration value sources than xml files. (community request)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-169) remove static NutchConf

2006-01-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-169?page=all ]

Jerome Charron updated NUTCH-169:
-

Attachment: NutchConf.RegexURLFilter.060111.patch

This patch is a merge of the version provided in Stefan's patch and the last 
changes committed by Doug (use JDK regexp).

 remove static NutchConf
 ---

  Key: NUTCH-169
  URL: http://issues.apache.org/jira/browse/NUTCH-169
  Project: Nutch
 Type: Improvement
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: NutchConf.Fetcher.060111.patch, NutchConf.Http.060111.patch, 
 NutchConf.RegexURLFilter.060111.patch, nutchConf.patch

 Removing the static NutchConf.get is required for a set of improvements and 
 new features.
 + it allows a better integration of nutch in j2ee or other systems.
 + it allows the management of nutch from a web based gui (a kind of nutch 
 appliance) which will improve the usability and also increase the user 
 acceptance of nutch
 + it allows to change configuration properties until runtime
 + it allows to implement NutchConf as a abstract class or interface to 
 provide other configuration value sources than xml files. (community request)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-07 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362061 ] 

Jerome Charron commented on NUTCH-139:
--

I agree with your analysis Andrzej.
I suggested to commit this patch because it is a response to this issue: 
standard metadata names + misspelled/erroneous names.
The history is not a new feature = ContentProperties is a kind of 
history.
So after commiting this patch, I (and others) could focus on other sub-issues:
1. In fact, by taking a closer look to it, I agree that there is no real need 
of a metadata history in nutch.
2. What we need: 
2.1 MetaData must be used to store multi valued meta data and not the actual 
kind of history.
3.1 Only two historical values must be stored : the original one (protocol 
only) and some extra metadata (that could be or not some derivated values of 
the original ones).

What I suggest is that the MetaData deals with two collections instead of one:
* One for original protocol values : headers
* Another one for other metadata

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
 NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-21 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ] 

Jerome Charron commented on NUTCH-139:
--

Ok, Chris and me will implement MetadataNames in this way.
Just some few comments:

I plan to move the MetadataNames to a class rather than an interface. Two 
reasons:

1.1 I don't like the design of implementing an interface in order to import 
some constants in a class: It gives some javadoc with a lot of class with many 
public constants defined without any really needs to show these constants in 
the javadoc.

1.2 I want to add an utility method in MetadataNames that tries to find the 
appropriate Nutch normalized metadata name from a string. It will be based on 
the Levenshtein Distance (available in commons-lang). More about Levenshtein 
Distance at http://www.merriampark.com/ld.htm




 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-21 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ] 

Jerome Charron commented on NUTCH-139:
--

Andrzej,

Do you read in my mind?
Yes of course, that's the way I want to do it: First checks for the most common 
cases (lower cases + keeps only letters), then use the Levenshtein distance if 
needed (last chance).
Regards

Jérôme

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-20 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Jerome Charron updated NUTCH-139:
-

Attachment: NUTCH-139.jc.review.patch.txt

Here is a new patch from Chris. I reviewed it, tested it.
From my point of view, all seems to be ok.
So if no objections, I will commit it during the day.

Regards
Jérôme

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-20 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360902 ] 

Jerome Charron commented on NUTCH-139:
--

Andrzej,

Thanks for taking time to take a look at the patch.
In fact, we have some discussion with Chris about this point
(that's why I don't commit the patch directly, I already have some doubts about 
this).

I will check right now how to handle things in this way.


 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-20 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360906 ] 

Jerome Charron commented on NUTCH-139:
--

Andrzej,

Here are more comments about my doubts, and how to handle metadata names.

if for instance a protocol plugin doesn't have any Content-Length information 
(no header like in FTP), then is should compute the content-length and add it 
in X-nutch-content-length attribute.
But what do you suggest if a protocol have a Content-Length header (HTTP may 
provide one)?
My feeling is adding the two metadata:
1. One for the Content-Length header in the Content-Length attribute
2. One for the real Content-Length (computed) in the X-nutch-content-length 
attribute.

In other words and more generally:
* When adding a native protocol header, if an equivalent x-nutch attribute 
exists in MetadataNames, then it must be added too with the same value, or with 
a more precise value.
* If no header information is available, tries to fill the more x-nutch 
attribute the protocol level can.

Do you agree with that?



 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-20 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360920 ] 

Jerome Charron commented on NUTCH-139:
--

And why not using the fact that the ContentProperties object can now handles 
multi-valued properties.
Each piece of code that wants to add some more reliable content about a 
property simply add its own value to the property = the first value is the raw 
one (for instance from the protocol level) and the more you iterate over values 
of the property, the more youo have a reliable value (the last one should be 
the more reliable, and is generally the interesting one, or for other reasons, 
the original value may be needed, then it is simply the first value).

Yes, you loose one information with this solution: You cannot ensure that the 
first value of a multi-valued property is the one from the protocol level.
But it avoid searching the same kind of information (the content-type for 
instance) using many properties names (Content-Type for the protocol level 
and X-Nutch-Content-Type for other levels).

We can extend the Multi-Valued Properties by adding a provider attribute while 
adding a property:
public void addProperty(key, value, provider).
The provider can be one of PROTOCOL, CONTENT, OTHER,  for instance (to be 
defined)



 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-3?page=all ]
 
Jerome Charron closed NUTCH-3:
--

Fix Version: 0.8-dev
 Resolution: Fixed

Double-Checked Tests (unit and functional).

http://svn.apache.org/viewcvs.cgi?rev=357334view=rev
http://svn.apache.org/viewcvs.cgi?rev=357335view=rev

Tks Stefan.

 multi values of header discarded
 

  Key: NUTCH-3
  URL: http://issues.apache.org/jira/browse/NUTCH-3
  Project: Nutch
 Type: Bug
 Reporter: Stefan Groschupf
 Assignee: Stefan Groschupf
  Fix For: 0.8-dev
  Attachments: multiValuesPropertyPatch.txt

 orignal by: phoebe
 http://sourceforge.net/tracker/index.php?func=detailaid=185group_id=59548atid=491356
 multi values of header discarded
 Each successive setting of a header value deletes the
 previous one.
 This patch allows multi values to be retained, such as
 cookies, using lf cr as a delimiter for each values.
 --- /tmp/HttpResponse.java 2005-01-27
 19:57:55.0 -0500
 +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500
 @@ -324,7 +324,19 @@
 }
 String value = line.substring(valueStart);
 - headers.put(key, value);
 +//Spec allows multiple values, such as Set-Cookie -
 using lf cr as delimiter
 + if ( headers.containsKey(key)) {
 + try {
 + Object obj= headers.get(key);
 + if ( obj != null) {
 + String oldvalue=
 headers.get(key).toString();
 + value = oldvalue +
 \r\n + value;
 + }
 + } catch (Exception e) {
 + e.printStackTrace();
 + }
 + }
 + headers.put(key, value);
 }
 private Map parseHeaders(PushbackInputStream in,
 StringBuffer line)
 @@ -399,5 +411,3 @@
 }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Jerome Charron updated NUTCH-135:
-

Attachment: cached.jsp.patch

cached.jsp must be patched too.

 http header meta data are case insensitive in the real world (e.g. 
 Content-Type or content-type)
 

  Key: NUTCH-135
  URL: http://issues.apache.org/jira/browse/NUTCH-135
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7, 0.7.1
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev, 0.7.2-dev
  Attachments: cached.jsp.patch, contentProperties_patch.txt, 
 contentProperties_patch_WithContentProperties.txt

 As described in issue nutch-133, some webservers return http header meta data 
 not standard conform case insensitive.
 This provides many negative side effects, for example query thet content type 
 from the meta data return null also in case the webserver returns a content 
 type, but the key is not standard conform e.g. lower case. Also this has 
 effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]
 
Jerome Charron resolved NUTCH-135:
--

Fix Version: (was: 0.7.2-dev)
 Resolution: Fixed

Committed to trunk (to be merged into branche 0.7?)
Thanks Stefan.

I have performed unit and functional tests, but I don't have resources for a 
wide and intensive test.
If someone can perform such test, it would be greatly apreciated.

Note: During my tests, I notice some strange content-types returned by 
de.yahoo.com and all de.yahoo related files. The content-type returned by the 
protocol layer to the Content constructor is always text/plain, but when 
performing some wget on these sites the content-type in headers is text/html 
... sorry, I don't have time for more investigations..


 http header meta data are case insensitive in the real world (e.g. 
 Content-Type or content-type)
 

  Key: NUTCH-135
  URL: http://issues.apache.org/jira/browse/NUTCH-135
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7, 0.7.1
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev
  Attachments: cached.jsp.patch, contentProperties_patch.txt, 
 contentProperties_patch_WithContentProperties.txt

 As described in issue nutch-133, some webservers return http header meta data 
 not standard conform case insensitive.
 This provides many negative side effects, for example query thet content type 
 from the meta data return null also in case the webserver returns a content 
 type, but the key is not standard conform e.g. lower case. Also this has 
 effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ] 

Jerome Charron commented on NUTCH-133:
--

Stefan:
Taking a closer look at the ParserFactory patch:

1. You can use the MimeType.clean(String) static method to clean the 
content-type
2. In the actual MimeTypes implementation, the getMimeType(String, byte[]) 
method returns the MimeType from the document name if one match (without 
guessing from magic). So, use, the getMimeType(byte[] if you want to guess 
content type from magic).
3. Your patch doen't really try to guess the content-type, but instead it will 
try to parse the content by using the parsers declared for the header 
content-type AND the then by the ones declared for the file extension detected 
content-type: It means that you guess header content-type is more reliable... 
no?
4. There's too much calls to .toLowerCase() and .euqlasIgnoreCase() methods in 
your code. One of the major Java bottleneck is the String manipulations, so the 
basic idea is to use the less you can the string manipulations.
5. Looking at http://www.w3.org/TR/REC-html40/types.html#h-6.7 it seems that 
content-type are case-insensitive, so the solution to deal with content-type 
sensitivity is to simply to patch the MimeType.clean(String) method so that it 
performs a toLowerCase to the mime-type.


 ParserFactory does not work as expected
 ---

  Key: NUTCH-133
  URL: http://issues.apache.org/jira/browse/NUTCH-133
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev, 0.7.1, 0.7.2-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
 Parserutil_test_patch.txt

 Marcel Schnippe detect a set of problems until working with different content 
 and parser types, we worked together to identify the problem source.
 From our point of view this described problems could be the source for many 
 other problems daily described in the mailing lists.
 Find a conclusion of the problems below.
 Problem:
 Some servers returns mixed case but correct header keys like 'Content-type' 
 or 'content-Length'  in the http response header.
 That's why for example a get(Content-Type) fails and a page is detected as 
 zip using the magic content type detection mechanism. 
 Also we note that this a common reason why pdf parsing fails since 
 Content-Length does return the correct value. 
 Sample:
 returns text/HTML or application/PDF or Content-length
 or this url:
 http://www.lanka.info/dictionary/EnglishToSinhala.jsp
 Solution:
 First just write only lower case keys into the properties and later convert 
 all keys that are used to query the metadata to lower case as well.
 e.g.:
 HttpResponse.java, line 353:
 use lower cases here and for all keys used to query header properties (also 
 content-length) change:  String key = line.substring(0, colonIndex); to  
 String key = line.substring(0, colonIndex) .toLowerCase();
 Problem:
 MimeTypes based discovery (magic and url based) is only done in case the 
 content type was not delivered by the web server, this happens not that 
 often, mostly this was a problem with mixed case keys in the header.
 see:
  public Content toContent() {
 String contentType = getHeader(Content-Type);
 if (contentType == null) {
   MimeType type = null;
   if (MAGIC) {
 type = MIME.getMimeType(orig, content);
   } else {
 type = MIME.getMimeType(orig);
   }
   if (type != null) {
   contentType = type.getName();
   } else {
   contentType = ;
   }
 }
 return new Content(orig, base, content, contentType, headers);
   }
 Solution:
 Use the content-type information as it is from the webserver and move the 
 content type discovering from Protocol plugins to the Component where the 
 parsing is done - to the ParseFactory.
 Than just create a list of parsers for the content type returned by the 
 server and the custom detected content type. In the end we can iterate over 
 all parser until we got a successfully parsed status.
 Problem:
 Content will be parsed also if the protocol reports a exception and has a non 
 successful status, in such a case the content is new byte[0] in any case.
 Solution:
 Fetcher.java, line 243.
 Change:   if (!Fetcher.this.parsing ) { .. to 
  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
// TODO we may should not write out here emthy parse text and parse 
 date, i suggest give outputpage a parameter parsed true / false
   outputPage(new FetcherOutput(fle, hash, protocolStatus),
 content, new ParseText(),
 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new 
 Outlink[0], new Properties()));
 return null;
   }
 Problem:
 Actually the configuration of parser is done based on plugin id's, but one 
 plugin can have several extentions, so  normally a plugin

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359647 ] 

Jerome Charron commented on NUTCH-133:
--

Stefan:

1. url extentions and also magic content type detection are used. This is the 
only way protocol-file and protocol-ftp can guess the content-type of a 
document (see FileResponse.java and FtpResponse.java).
So, the problem is only for HTTP.
I suggest to you for a ASAP solution to patch the HTTP related plugins by 
systematically using the mime-type resolver. But what is the policy to apply if 
you have both a mime-type from the protocol layer and another one from the 
mime-type solver? Which one to use (we have not yet stated on this...) What do 
you think about it?

2. I'm ok with Doug. This issue should be splitted in six separate issues.

3. Unit Tests: I'm ok to commit tests provided by Stefan about the content-type 
case. But I'm not sure that the TestParseUtil is the right place for this. It 
doesn't test the ParseUtil itself, but the way meta-data keys are stored in 
nutch.

4. I think we can use case-insensitive metadata keys. I don't know any protocol 
for which the case sensitivity is really used for headers or metadata keys 
(even if the specification says they are case sensitive).



 ParserFactory does not work as expected
 ---

  Key: NUTCH-133
  URL: http://issues.apache.org/jira/browse/NUTCH-133
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev, 0.7.1, 0.7.2-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
 Parserutil_test_patch.txt

 Marcel Schnippe detect a set of problems until working with different content 
 and parser types, we worked together to identify the problem source.
 From our point of view this described problems could be the source for many 
 other problems daily described in the mailing lists.
 Find a conclusion of the problems below.
 Problem:
 Some servers returns mixed case but correct header keys like 'Content-type' 
 or 'content-Length'  in the http response header.
 That's why for example a get(Content-Type) fails and a page is detected as 
 zip using the magic content type detection mechanism. 
 Also we note that this a common reason why pdf parsing fails since 
 Content-Length does return the correct value. 
 Sample:
 returns text/HTML or application/PDF or Content-length
 or this url:
 http://www.lanka.info/dictionary/EnglishToSinhala.jsp
 Solution:
 First just write only lower case keys into the properties and later convert 
 all keys that are used to query the metadata to lower case as well.
 e.g.:
 HttpResponse.java, line 353:
 use lower cases here and for all keys used to query header properties (also 
 content-length) change:  String key = line.substring(0, colonIndex); to  
 String key = line.substring(0, colonIndex) .toLowerCase();
 Problem:
 MimeTypes based discovery (magic and url based) is only done in case the 
 content type was not delivered by the web server, this happens not that 
 often, mostly this was a problem with mixed case keys in the header.
 see:
  public Content toContent() {
 String contentType = getHeader(Content-Type);
 if (contentType == null) {
   MimeType type = null;
   if (MAGIC) {
 type = MIME.getMimeType(orig, content);
   } else {
 type = MIME.getMimeType(orig);
   }
   if (type != null) {
   contentType = type.getName();
   } else {
   contentType = ;
   }
 }
 return new Content(orig, base, content, contentType, headers);
   }
 Solution:
 Use the content-type information as it is from the webserver and move the 
 content type discovering from Protocol plugins to the Component where the 
 parsing is done - to the ParseFactory.
 Than just create a list of parsers for the content type returned by the 
 server and the custom detected content type. In the end we can iterate over 
 all parser until we got a successfully parsed status.
 Problem:
 Content will be parsed also if the protocol reports a exception and has a non 
 successful status, in such a case the content is new byte[0] in any case.
 Solution:
 Fetcher.java, line 243.
 Change:   if (!Fetcher.this.parsing ) { .. to 
  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
// TODO we may should not write out here emthy parse text and parse 
 date, i suggest give outputpage a parameter parsed true / false
   outputPage(new FetcherOutput(fle, hash, protocolStatus),
 content, new ParseText(),
 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new 
 Outlink[0], new Properties()));
 return null;
   }
 Problem:
 Actually the configuration of parser is done based on plugin id's, but one 
 plugin can have several extentions, so  normally a plugin can provide several 
 parser, but this is no limited just wrong values are used in the

[jira] Closed: (NUTCH-112) Link in cached.jsp page to cached content is an absolute link

2005-12-06 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-112?page=all ]
 
Jerome Charron closed NUTCH-112:


Fix Version: 0.8-dev
 Resolution: Fixed

Committed to trunk and mapred.
http://svn.apache.org/viewcvs?rev=354575view=rev
http://svn.apache.org/viewcvs?rev=354582view=rev

Thanks Chris.

 Link in cached.jsp page to cached content is an absolute link
 -

  Key: NUTCH-112
  URL: http://issues.apache.org/jira/browse/NUTCH-112
  Project: Nutch
 Type: Bug
   Components: web gui
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Windows XP Professional SP2, Intel Pentium M 2.0 Ghz, 512 MB 
 RAM, although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: NUTCH-112.Mattmann.patch.txt

 The link in the cached.jsp page that points to the cached content uses an 
 absolute link, of the form /servlet/cached?idx=xxxid=yyy. This causes an 
 error when the user goes to click on the link and the Nutch war is not 
 deployed at the root context of the application server. The link should be of 
 the form ./servlet/cached?idx=xxxid=yyy, i.e., a relative link to correct 
 this problem.
 I've attached a small patch that fixes the error. I've tested the patch in my 
 local environment and it fixes the error.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-06 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359487 ] 

Jerome Charron commented on NUTCH-133:
--

Thanks for this really very good description.
Just a quick note: I'm currently in the final steps of a new mime-type 
repository implementation (compliant with freedesktop specification). So, I 
suggest not to focus on the mime-type issues for now.

About the MimeResolution moved to the parser factory: +1.
(As you probably notice by looking at the comments in the code, it was 
planned... when the new mime type repository will be available. But 
unfortunaly, the it takes more time than excpected).


 ParserFactory does not work as expected
 ---

  Key: NUTCH-133
  URL: http://issues.apache.org/jira/browse/NUTCH-133
  Project: Nutch
 Type: Bug
 Versions: 0.8-dev, 0.7.1, 0.7.2-dev
 Reporter: Stefan Groschupf
 Priority: Blocker
  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
 Parserutil_test_patch.txt

 Marcel Schnippe detect a set of problems until working with different content 
 and parser types, we worked together to identify the problem source.
 From our point of view this described problems could be the source for many 
 other problems daily described in the mailing lists.
 Find a conclusion of the problems below.
 Problem:
 Some servers returns mixed case but correct header keys like 'Content-type' 
 or 'content-Length'  in the http response header.
 That's why for example a get(Content-Type) fails and a page is detected as 
 zip using the magic content type detection mechanism. 
 Also we note that this a common reason why pdf parsing fails since 
 Content-Length does return the correct value. 
 Sample:
 returns text/HTML or application/PDF or Content-length
 or this url:
 http://www.lanka.info/dictionary/EnglishToSinhala.jsp
 Solution:
 First just write only lower case keys into the properties and later convert 
 all keys that are used to query the metadata to lower case as well.
 e.g.:
 HttpResponse.java, line 353:
 use lower cases here and for all keys used to query header properties (also 
 content-length) change:  String key = line.substring(0, colonIndex); to  
 String key = line.substring(0, colonIndex) .toLowerCase();
 Problem:
 MimeTypes based discovery (magic and url based) is only done in case the 
 content type was not delivered by the web server, this happens not that 
 often, mostly this was a problem with mixed case keys in the header.
 see:
  public Content toContent() {
 String contentType = getHeader(Content-Type);
 if (contentType == null) {
   MimeType type = null;
   if (MAGIC) {
 type = MIME.getMimeType(orig, content);
   } else {
 type = MIME.getMimeType(orig);
   }
   if (type != null) {
   contentType = type.getName();
   } else {
   contentType = ;
   }
 }
 return new Content(orig, base, content, contentType, headers);
   }
 Solution:
 Use the content-type information as it is from the webserver and move the 
 content type discovering from Protocol plugins to the Component where the 
 parsing is done - to the ParseFactory.
 Than just create a list of parsers for the content type returned by the 
 server and the custom detected content type. In the end we can iterate over 
 all parser until we got a successfully parsed status.
 Problem:
 Content will be parsed also if the protocol reports a exception and has a non 
 successful status, in such a case the content is new byte[0] in any case.
 Solution:
 Fetcher.java, line 243.
 Change:   if (!Fetcher.this.parsing ) { .. to 
  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
// TODO we may should not write out here emthy parse text and parse 
 date, i suggest give outputpage a parameter parsed true / false
   outputPage(new FetcherOutput(fle, hash, protocolStatus),
 content, new ParseText(),
 new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new 
 Outlink[0], new Properties()));
 return null;
   }
 Problem:
 Actually the configuration of parser is done based on plugin id's, but one 
 plugin can have several extentions, so  normally a plugin can provide several 
 parser, but this is no limited just wrong values are used in the 
 configuration process. 
 Solution:
 Change plugin id to  extension id in the parser configuration file and also 
 change this code in the parser factory to use extension id's everywhere.
 Problem:
 there is not a clear differentiation between content type and mime type. 
 I'm notice that some plugins call metaData.get(Content-Type) or 
 content.getContentType();
 Actually in theory this can return different values, since the content type 
 could be detected by the MimesTypes util and is not the same as delivered in 
 the http response header.
 As mentioned actually content type is only detected by the MimeTypes

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-20 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332568 ] 

Jerome Charron commented on NUTCH-88:
-

Corrections are committed 
(http://svn.apache.org/viewcvs.cgi?rev=326889view=rev).
Sorry for the delay, but I do my best...
(thanks Chris for proposing your help)

Implementation Note:
In this implementation, the MimeType.clean(String) method constructs a new 
MimeType object (the MimeType constructor clean the content-type) each type it 
is called.
It was the speedest way for solving this issue. But it is not optimal code, 
since it will better for performance (avoid instantiating very short time life 
objects) that:
1. The clean method really contains the cleaning code.
2. The MimeType constructors uses the clean method.

Regards


 Enhance ParserFactory plugin selection policy
 -

  Key: NUTCH-88
  URL: http://issues.apache.org/jira/browse/NUTCH-88
  Project: Nutch
 Type: Improvement
   Components: indexer
 Versions: 0.7, 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
  Fix For: 0.8-dev


 The ParserFactory choose the Parser plugin to use based on the content-types 
 and path-suffix defined in the parsers plugin.xml file.
 The selection policy is as follow:
 Content type has priority: the first plugin found whose contentType 
 attribute matches the beginning of the content's type is used. 
 If none match, then the first whose pathSuffix attribute matches the end of 
 the url's path is used.
 If neither of these match, then the first plugin whose pathSuffix is the 
 empty string is used.
 This policy has a lot of problems when no matching is found, because a random 
 parser is used (and there is a lot of chance this parser can't handle the 
 content).
 On the other hand, the content-type associated to a parser plugin is 
 specified in the plugin.xml of each plugin (this is the value used by the 
 ParserFactory), AND the code of each parser checks itself in its code if the 
 content-type is ok (it uses an hard-coded content-type value, and not uses 
 the value specified in the plugin.xml = possibility of missmatches between 
 content-type hard-coded and content-type delcared in plugin.xml).
 A complete list of problems and discussion aout this point is available in:
   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-14 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-88?page=all ]
 
Jerome Charron resolved NUTCH-88:
-

Resolution: Fixed
 Assign To: Jerome Charron

Second step implementation details: 
http://svn.apache.org/viewcvs.cgi?rev=292865view=rev
And final step implementation details: 
http://svn.apache.org/viewcvs.cgi?rev=321231view=rev
(some unit tests corrections: 
http://svn.apache.org/viewcvs.cgi?rev=321250view=rev)

Big thanks to Chris Mattmann and Sébastien Le Callonnec.
 

 Enhance ParserFactory plugin selection policy
 -

  Key: NUTCH-88
  URL: http://issues.apache.org/jira/browse/NUTCH-88
  Project: Nutch
 Type: Improvement
   Components: indexer
 Versions: 0.7, 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
  Fix For: 0.8-dev


 The ParserFactory choose the Parser plugin to use based on the content-types 
 and path-suffix defined in the parsers plugin.xml file.
 The selection policy is as follow:
 Content type has priority: the first plugin found whose contentType 
 attribute matches the beginning of the content's type is used. 
 If none match, then the first whose pathSuffix attribute matches the end of 
 the url's path is used.
 If neither of these match, then the first plugin whose pathSuffix is the 
 empty string is used.
 This policy has a lot of problems when no matching is found, because a random 
 parser is used (and there is a lot of chance this parser can't handle the 
 content).
 On the other hand, the content-type associated to a parser plugin is 
 specified in the plugin.xml of each plugin (this is the value used by the 
 ParserFactory), AND the code of each parser checks itself in its code if the 
 content-type is ok (it uses an hard-coded content-type value, and not uses 
 the value specified in the plugin.xml = possibility of missmatches between 
 content-type hard-coded and content-type delcared in plugin.xml).
 A complete list of problems and discussion aout this point is available in:
   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323007 ] 

Jerome Charron commented on NUTCH-88:
-

Dawid,
Thanks for your pointers on IE MimeType resolution. We have in Nutch a MimeType 
resolver based on both file-ext and files magic sequences to find the 
content-type of a file. It is actually underused, and perhaps some enhancement 
must be added: such as the content-type mapping: allow to map a content-type to 
a normalized one (ie mapping for instance application/powerpoint to 
application/vnd.ms-powerpoint, so that only the normalized version must be 
registered in the plugin.xml file).

Chris,
Thanks in advance for your futur work. Could you please synchronize your 
efforts with Sébastien, since he seems very interested to contribute.

Andrzej,
The way to express a preference of one plugin over another, if both support the 
same content type is to activate the plugin you want to handle a content-type 
and deactivate onther ones.
No?

Note: Since the MimeResolver handles associations between file-extensions and 
content-types, the path-suffix in plugin.xml (and in ParserFactory policy for 
choosing a Parser) could certainly be removed in order to have only one central 
point for storing this knowledge.

 Enhance ParserFactory plugin selection policy
 -

  Key: NUTCH-88
  URL: http://issues.apache.org/jira/browse/NUTCH-88
  Project: Nutch
 Type: Improvement
   Components: indexer
 Versions: 0.7, 0.8-dev
 Reporter: Jerome Charron
  Fix For: 0.8-dev


 The ParserFactory choose the Parser plugin to use based on the content-types 
 and path-suffix defined in the parsers plugin.xml file.
 The selection policy is as follow:
 Content type has priority: the first plugin found whose contentType 
 attribute matches the beginning of the content's type is used. 
 If none match, then the first whose pathSuffix attribute matches the end of 
 the url's path is used.
 If neither of these match, then the first plugin whose pathSuffix is the 
 empty string is used.
 This policy has a lot of problems when no matching is found, because a random 
 parser is used (and there is a lot of chance this parser can't handle the 
 content).
 On the other hand, the content-type associated to a parser plugin is 
 specified in the plugin.xml of each plugin (this is the value used by the 
 ParserFactory), AND the code of each parser checks itself in its code if the 
 content-type is ok (it uses an hard-coded content-type value, and not uses 
 the value specified in the plugin.xml = possibility of missmatches between 
 content-type hard-coded and content-type delcared in plugin.xml).
 A complete list of problems and discussion aout this point is available in:
   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-08 Thread Jerome Charron (JIRA)

Enhance ParserFactory plugin selection policy
-

 Key: NUTCH-88
 URL: http://issues.apache.org/jira/browse/NUTCH-88
 Project: Nutch
Type: Improvement
  Components: indexer  
Versions: 0.7, 0.8-dev
Reporter: Jerome Charron
 Fix For: 0.8-dev


The ParserFactory choose the Parser plugin to use based on the content-types 
and path-suffix defined in the parsers plugin.xml file.
The selection policy is as follow:
Content type has priority: the first plugin found whose contentType attribute 
matches the beginning of the content's type is used. 
If none match, then the first whose pathSuffix attribute matches the end of 
the url's path is used.
If neither of these match, then the first plugin whose pathSuffix is the 
empty string is used.

This policy has a lot of problems when no matching is found, because a random 
parser is used (and there is a lot of chance this parser can't handle the 
content).
On the other hand, the content-type associated to a parser plugin is specified 
in the plugin.xml of each plugin (this is the value used by the ParserFactory), 
AND the code of each parser checks itself in its code if the content-type is ok 
(it uses an hard-coded content-type value, and not uses the value specified in 
the plugin.xml = possibility of missmatches between content-type hard-coded 
and content-type delcared in plugin.xml).

A complete list of problems and discussion aout this point is available in:
  * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
  * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-53) Parser plugin for Zip files

2005-09-04 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-53?page=all ]
 
Jerome Charron resolved NUTCH-53:
-

Fix Version: 0.8-dev
 Resolution: Fixed

Parser committed after some minor refactoring due to some API changes.
(http://svn.apache.org/viewcvs.cgi?rev=278626view=rev)

Thanks to Rohit Kulkarni.


 Parser plugin for Zip files
 ---

  Key: NUTCH-53
  URL: http://issues.apache.org/jira/browse/NUTCH-53
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Reporter: Rohit Kulkarni
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: parse-zip.zip

 Nutch plugin to parse Zip files (using java.util.zip)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-09-02 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-21?page=all ]
 
Jerome Charron closed NUTCH-21:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Commited to trunk (http://svn.apache.org/viewcvs.cgi?rev=267226view=rev)
Thanks to Stephan Strittmatter.

Note: Take care of the patches attached to this issue since the unit tests are 
platform dependent (only successed on windows). The committed code is platform 
independent (I hope). I tested it on Linux, so if someone can test it on other 
platforms it would be a good idea.


 parser plugin for MS PowerPoint slides
 --

  Key: NUTCH-21
  URL: http://issues.apache.org/jira/browse/NUTCH-21
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Reporter: Stefan Groschupf
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: MSPowerPointParser.java, build.xml.patch.txt, 
 parse-mspowerpoint.zip, parse-mspowerpoint.zip

 transfered from:
 http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356
 submitted by:
 Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-09-01 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-65?page=all ]
 
Jerome Charron closed NUTCH-65:
---

Resolution: Fixed

Patch committed (http://svn.apache.org/viewcvs.cgi?rev=265794view=rev)

 index-more plugin can't parse large set of  modification-date
 -

  Key: NUTCH-65
  URL: http://issues.apache.org/jira/browse/NUTCH-65
  Project: Nutch
 Type: Bug
   Components: indexer
 Versions: 0.7, 0.8-dev
  Environment: nutch 0.7, java 1.5, linux
 Reporter: Lutischán Ferenc
  Fix For: 0.8-dev
  Attachments: MoreIndexingFilter.diff, MoreIndexingFilter.java, 
 commons-lang-2.1.jar

 I found a problem in MoreIndexingFilter.java.
 When I indexing segments, I get large list of error messages:
 can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or
 can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT
 I modifiing source code (I don't make a 'patch'):
 Original (lines 137-138):
 DateFormat df = new SimpleDateFormat(EEE MMM dd HH:mm:ss  zzz);
 Date d = df.parse(date);
 New:
 DateFormat df = new SimpleDateFormat(EEE, MMM dd HH:mm:ss , Locale.US);
 Date d = df.parse(date.substring(0,25));
 The modified code works fine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-21) parser plugin for MS PowerPoint slides

2005-08-31 Thread Jerome Charron (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_12320717 ] 

Jerome Charron commented on NUTCH-21:
-

Want to commit it, but unit tests failed.

 parser plugin for MS PowerPoint slides
 --

  Key: NUTCH-21
  URL: http://issues.apache.org/jira/browse/NUTCH-21
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Reporter: Stefan Groschupf
 Priority: Trivial
  Attachments: build.xml.patch.txt, parse-mspowerpoint.zip, 
 parse-mspowerpoint.zip

 transfered from:
 http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356
 submitted by:
 Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-20) Extract urls from plain texts

2005-08-19 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]
 
Jerome Charron closed NUTCH-20:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Revision 233559 - http://svn.apache.org/viewcvs.cgi?rev=233559view=rev

* Add utility to extract urls from plain text (thanks to Stephan Strittmatter)
* Uses the OutlinkExtractor in parse plugins PDF, MSWord, Text, RTF, Ext

Note: Take a look at the JSParseFilter in order to use the OutlinkExtractor in 
it.

  Extract urls from plain texts
 --

  Key: NUTCH-20
  URL: http://issues.apache.org/jira/browse/NUTCH-20
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Reporter: Stefan Grroschupf
 Priority: Trivial
  Fix For: 0.8-dev
  Attachments: OutlinkExtractor.java, OutlinkExtractor.java, 
 OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt

 Some parsers have no Outlinks returned. E.g. the Word-Parser.
 This class is able to extract (absolute) hyperlinks from a plain String 
 (content)  and generates outlinks from them.
 This would be very usful for parser which have no explicite extraction of 
 hyperlinks.
 Excample:
 Outlink[] links = OutlinkExtractor.getOutlinks(Nutch is located at 
 http://www.apache.org and ...);
 Will return an array of Outlinks containing the one element of 
 http://www.apache.org;.
 
 transfered from: 
 http://sourceforge.net/tracker/index.php?func=detailaid=1109328group_id=59548atid=491356
 submitted  by: Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-71) Search web page doesn't not focus on query input

2005-08-18 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-71?page=all ]
 
Jerome Charron closed NUTCH-71:
---

Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Jerome Charron

Thanks Christophe for reporting it and for your piece of code.

 Search web page doesn't not focus on query input
 

  Key: NUTCH-71
  URL: http://issues.apache.org/jira/browse/NUTCH-71
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Christophe Noel
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev
  Attachments: searchQueryFocus.patch

 In search.html and search.jsp , keyboard cursor does not focus in the form 
 query input.
 I've made a patch for en and fr search.html and for search.jsp.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-74) French Analyzer Plugin

2005-08-18 Thread Jerome Charron (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-74?page=all ]

Jerome Charron updated NUTCH-74:


  Component: indexer
Fix Version: 0.8-dev
Version: 0.7
 0.6
 0.8-dev

 French Analyzer Plugin
 --

  Key: NUTCH-74
  URL: http://issues.apache.org/jira/browse/NUTCH-74
  Project: Nutch
 Type: New Feature
   Components: indexer
 Versions: 0.7, 0.8-dev, 0.6
  Environment: Nutch
 Reporter: Christophe Noel
 Assignee: Jerome Charron
  Fix For: 0.8-dev
  Attachments: analyze-french.zip, analyzers-050705.patch

 This is DRAFT for a new plugin for French Analysis (all java file come from 
 Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial 
 forms removing, ...
 Analyze-frech should be used instead of NutchDocumentAnalysis as described by 
 Jerome Charron in New Language Identifier project. It should be used also as 
 a query-parser in Nutch searcher.
 We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could 
 anyone help me to build this new Extension Point please ?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

83 matches

Mail list logo