[jira] Commented: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12419670 ] Jerome Charron commented on NUTCH-309: -- As already discussed, it perfectly makes sense and I have planned to work on this issue. Another minor change I would like to make is to replace the log4j.properties by log4j.xml : The log4j.xml provides more funtionality and flexibility : especially filters that provide a way to log to different appenders depending on the log level for instance (for instance I use this to log all levels to a file and warn and error level to the console). Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-317) Clarify what the queryLanguage argument of Query.parse(...) means
[ http://issues.apache.org/jira/browse/NUTCH-317?page=all ] Jerome Charron resolved NUTCH-317: -- Fix Version: 0.8-dev Resolution: Fixed Fixed Clarify what the queryLanguage argument of Query.parse(...) means - Key: NUTCH-317 URL: http://issues.apache.org/jira/browse/NUTCH-317 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: KuroSaka TeruHiko Fix For: 0.8-dev API document on Query.parse(String queryString, String queryLang, Configuration conf) does not explain what queryLang is, and should be explained. There can be at least two interpretations: (1) Create a Query that restricts the search to include only the documents written in the specified language. So this would be equivalent of specifying lang:xx where xx is a two-letter language code. (2) Create a Query interpreting the queryString according to the rules of the specified languages. In reality, this is used to select the proper language Analyzer to parse the query string. I am guessing that (2) is intended. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12418404 ] Jerome Charron commented on NUTCH-309: -- Dawid, you know, sed, awk and regex are my friends, so it was not so painful ;-) As I mentionned in a previous mail, it was just a crude pass on logging : A finest one is planned to review log levels and code guards. AspectJ = +1 for using it for logging, but I don't what are the preformance impacts... Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-309) Uses commons logging Code Guards
Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assigned to: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-310) Review Log Levels
Review Log Levels - Key: NUTCH-310 URL: http://issues.apache.org/jira/browse/NUTCH-310 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assigned to: Jerome Charron Priority: Minor Fix For: 0.8-dev Review of logs content and logs levels (see Commons Logging Best Parctices : http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Jerome Charron resolved NUTCH-309: -- Resolution: Fixed Logging code guards added. http://svn.apache.org/viewvc?view=revrevision=416346 Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-307) wrong configured log4j.properties
[ http://issues.apache.org/jira/browse/NUTCH-307?page=all ] Jerome Charron resolved NUTCH-307: -- Resolution: Fixed Assign To: Jerome Charron Nutch now uses Hadoop var names for the file name used by DRFA logging. wrong configured log4j.properties - Key: NUTCH-307 URL: http://issues.apache.org/jira/browse/NUTCH-307 Project: Nutch Type: Bug Reporter: Stefan Groschupf Assignee: Jerome Charron Priority: Blocker Fix For: 0.8-dev In nutch/conf is only one log4j.properties and it define: log4j.appender.DRFA.File=${nutch.log.dir}/${nutch.log.file} nutch.log.dir and nutch.log.file is only defined in the bin/nutch script. In case of starting a distributed nutch instance with bin/start-all the remove tasktracker crash with: java.io.FileNotFoundException: / (Is a directory) cr06: at java.io.FileOutputStream.openAppend(Native Method) cr06: at java.io.FileOutputStream.init(FileOutputStream.java:177) cr06: at java.io.FileOutputStream.init(FileOutputStream.java:102) cr06: at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) cr06: at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) cr06: at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215) cr06: at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) since the hadoop scripts used to start the tasktrackers and datanodes never define the nutch log properties but the log4j.properties require such a definition. I suggest to leave the log4j.properties as it is in hadoop but define the hadoop property names in the bin/nutch script instead of intriduce new variable names. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-307) wrong configured log4j.properties
[ http://issues.apache.org/jira/browse/NUTCH-307?page=comments#action_12416895 ] Jerome Charron commented on NUTCH-307: -- Hi Stefan, Thanks for this feedback. In fact, as I mentioned in a previous mail (http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg03907.html) I had some hesitations about using the hadoop properties instead of introducing some nutch properties. I change this right now! wrong configured log4j.properties - Key: NUTCH-307 URL: http://issues.apache.org/jira/browse/NUTCH-307 Project: Nutch Type: Bug Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev In nutch/conf is only one log4j.properties and it define: log4j.appender.DRFA.File=${nutch.log.dir}/${nutch.log.file} nutch.log.dir and nutch.log.file is only defined in the bin/nutch script. In case of starting a distributed nutch instance with bin/start-all the remove tasktracker crash with: java.io.FileNotFoundException: / (Is a directory) cr06: at java.io.FileOutputStream.openAppend(Native Method) cr06: at java.io.FileOutputStream.init(FileOutputStream.java:177) cr06: at java.io.FileOutputStream.init(FileOutputStream.java:102) cr06: at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) cr06: at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) cr06: at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215) cr06: at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) since the hadoop scripts used to start the tasktrackers and datanodes never define the nutch log properties but the log4j.properties require such a definition. I suggest to leave the log4j.properties as it is in hadoop but define the hadoop property names in the bin/nutch script instead of intriduce new variable names. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ] Jerome Charron commented on NUTCH-110: -- This patch process the String twice if it contains some illegal characters! OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-236) PdfParser and RSSParser Log4j appender redirection
[ http://issues.apache.org/jira/browse/NUTCH-236?page=all ] Jerome Charron closed NUTCH-236: Fix Version: 0.8-dev Resolution: Fixed As a side effect, this issue is solved by NUTCH-303 since nutch now uses Jakarta Commons Logging with the log4j default implementation. PdfParser and RSSParser Log4j appender redirection -- Key: NUTCH-236 URL: http://issues.apache.org/jira/browse/NUTCH-236 Project: Nutch Type: Bug Versions: 0.8-dev Environment: Linux, Nutch embedded in an other application Reporter: Jason Calabrese Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.8-dev Attachments: NUTCH-236.Mattmann.060806.patch.txt I just found a bug in the way the log messages from Hadoop LogFormatter are added as a new appender to the Log4j rootLogger in the PdfParser and RSSParser. Since a new Log4j appender is created and added to the root logger each time these classes are loaded log messages start getting repeated. I'm using Nutch/Hadoop inside an other application so other may not be seeing this problem. I think the simple fix is as easy as setting a name for the new appender before adding it and then at the begining of the constructor checking to see if it's already been added. Also as the comment says in both the PdfParser and RSSParser this code should be moved to a common place. I'd be happy to make these changes and submit a patch, but I wanted to know it the change would be welcome first. Also does anyone know a good place for the new util method? Maybe a new static method on LogFormatter, but then the log4j jar would need to be added to the to the common lib and the classpath. It would also be good to create a property in nutch-site.xml that could disable this logging appender redirection. Like I said above I'd be more than happy to do this work, I'll just need some guidance to follow the project's conventions. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12415984 ] Jerome Charron commented on NUTCH-258: -- Thanks for this patch Chris - even if now it is outdate by NUTCH-303 :-( Since Nutch no more use the deprecated Hadoop LogFormatter, there is no more logSevere check in the code. So we quickly need to have a patch for this issue in order to have the same behaviors. In your patch Chris, you set a severe flag each time a log severe is called. But I'm not sure all these log severe should be marked as severe (fatal level is used now). For instance, is it really fatal for the fetcher that the conf file for RegexUrlNormalizer is wrong? Is it really fatal for the fetcher if the language identifier raise an exception while loading ngrams profiles? Is it really fatal for the fetcher if the ontology plugin failed on reading an ontology? But sure it is fatal if the user-agent is not correctly setted in http plugins! So, what I suggest is to review all the fatal logs and check if they are really fatal for the whole process. And finally, why not simply throwing a RuntimeException that will by catched the Fetcher if something wrong really occurs? Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Assignee: Chris A. Mattmann Priority: Critical Attachments: NUTCH-258.Mattmann.060906.patch.txt, dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-303) logging improvements
[ http://issues.apache.org/jira/browse/NUTCH-303?page=all ] Jerome Charron resolved NUTCH-303: -- Resolution: Fixed Nutch now uses the Commons Logging API and log4j as the default implementation. There is 3 log4j.properties configuration files: 1. conf/log4j.properties used by the back-end. It uses by default the Daily Rolling File Appender. By default, the logging file is located in $NUTCH_HOME/logs/nutch.log Another location can be specified by using env. variables $NUTCH_LOG_DIR and $NUTCH_LOGFILE. 2. src/web/log4j.properties used by the front-end container. It uses by default the Console Appender. 3. srr/test/log4j.properties used by unit tests It uses by default the Console Appender. I have tested this patch on both front-end / back-end and unit tests env. But please notice, that I have only one box available = I just tested it on mono-deployment env. logging improvements Key: NUTCH-303 URL: http://issues.apache.org/jira/browse/NUTCH-303 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Switch to the apache commons logging facade. See HADOOP-211 and following thread http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg08706.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query
[ http://issues.apache.org/jira/browse/NUTCH-301?page=comments#action_12415098 ] Jerome Charron commented on NUTCH-301: -- We can store the CommonGrams instance in the Configuration as it is already done in many places in Nutch code. CommonGrams loads analysis.common.terms.file for each query --- Key: NUTCH-301 URL: http://issues.apache.org/jira/browse/NUTCH-301 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Chris Schneider The move away from static objects toward instance variables has resulted in CommonGrams constructor parsing its analysis.common.terms.file for each query. I'm not certain how large a performance impact this really is, but it seems like something you'd want to avoid doing for each query. Perhaps the solution is to keep around an instance of the CommonGrams object itself? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-275) Fetcher not parsing XHTML-pages at all
[ http://issues.apache.org/jira/browse/NUTCH-275?page=all ] Jerome Charron resolved NUTCH-275: -- Fix Version: 0.8-dev Resolution: Fixed Magic guessing removed for xml content-type. Fetcher not parsing XHTML-pages at all -- Key: NUTCH-275 URL: http://issues.apache.org/jira/browse/NUTCH-275 Project: Nutch Type: Bug Versions: 0.8-dev Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2 Reporter: Stefan Neufeind Fix For: 0.8-dev Server reports page as text/html - so I thought it would be processed as html. But something I guess evaluated the headers of the document and re-labeled it as text/xml (why not text/xhtml?). For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?). Links inside this document are NOT indexed at all - no digging this website actually stops here. Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows). 060521 025018 fetching http://www.secreturl.something/ 060521 025018 http.proxy.host = null 060521 025018 http.proxy.port = 8080 060521 025018 http.timeout = 1 060521 025018 http.content.limit = 65536 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 025018 fetcher.server.delay = 1000 060521 025018 http.max.delays = 1000 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 025019 map 0% reduce 0% 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-303) logging improvements
logging improvements Key: NUTCH-303 URL: http://issues.apache.org/jira/browse/NUTCH-303 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assigned to: Jerome Charron Priority: Minor Fix For: 0.8-dev Switch to the apache commons logging facade. See HADOOP-211 and following thread http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg08706.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query
[ http://issues.apache.org/jira/browse/NUTCH-301?page=all ] Jerome Charron resolved NUTCH-301: -- Fix Version: 0.8-dev Resolution: Fixed Patch applied with some minor modifications. Thanks Stefan. CommonGrams loads analysis.common.terms.file for each query --- Key: NUTCH-301 URL: http://issues.apache.org/jira/browse/NUTCH-301 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Chris Schneider Fix For: 0.8-dev Attachments: CommonGramsCacheV1.patch The move away from static objects toward instance variables has resulted in CommonGrams constructor parsing its analysis.common.terms.file for each query. I'm not certain how large a performance impact this really is, but it seems like something you'd want to avoid doing for each query. Perhaps the solution is to keep around an instance of the CommonGrams object itself? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ] Jerome Charron resolved NUTCH-298: -- Resolution: Fixed Committed + some unit tests to reproduce. Thanks Stefan. As you mentioned it in a previous mail, I agree that the RobotRulesParser should be rewrite. if a 404 for a robots.txt is returned a NPE is thrown - Key: NUTCH-298 URL: http://issues.apache.org/jira/browse/NUTCH-298 Project: Nutch Type: Bug Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: fixNpeRobotRuleSet.patch What happen: Is no RobotRuleSet is in the cache for a host, we create try to fetch the robots.txt. In case http response code is not 200 or 403 but for example 404 we do robotRules = EMPTY_RULES; (line: 402) EMPTY_RULES is a RobotRuleSet created with the default constructor. tmpEntries and entries is null and will never changed. If we now try to fetch a page from the host that use the EMPTY_RULES is used and we call isAllowed in the RobotRuleSet. In this case a NPE is thrown in this line: if (entries == null) { entries= new RobotsEntry[tmpEntries.size()]; possible Solution: We can intialize the tmpEntries by default and also remove other null checks and initialisations. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412835 ] Jerome Charron commented on NUTCH-275: -- This problem as already been reported by Doug : http://mail-archive.com/nutch-dev%40lucene.apache.org/msg03474.html It is related to magic based content-type guessing. Nothing as been stated about this for now, but I should work on it. Workaround : * inactivate the mime-type magic resolution (mime.type.magic = false) * or remove the magic offset=0 ... line in mime-types.xml Thanks for opening a jira issue about this. Fetcher not parsing XHTML-pages at all -- Key: NUTCH-275 URL: http://issues.apache.org/jira/browse/NUTCH-275 Project: Nutch Type: Bug Versions: 0.8-dev Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2 Reporter: Stefan Neufeind Server reports page as text/html - so I thought it would be processed as html. But something I guess evaluated the headers of the document and re-labeled it as text/xml (why not text/xhtml?). For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?). Links inside this document are NOT indexed at all - no digging this website actually stops here. Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows). 060521 025018 fetching http://www.secreturl.something/ 060521 025018 http.proxy.host = null 060521 025018 http.proxy.port = 8080 060521 025018 http.timeout = 1 060521 025018 http.content.limit = 65536 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060521 025018 fetcher.server.delay = 1000 060521 025018 http.max.delays = 1000 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature 060521 025019 map 0% reduce 0% 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=all ] Jerome Charron resolved NUTCH-134: -- Fix Version: 0.8-dev Resolution: Fixed Assign To: Jerome Charron Solution proposed by Andrzej implemented. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2 Reporter: Andrzej Bialecki Assignee: Jerome Charron Fix For: 0.8-dev Attachments: summarizer.060506.patch Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-134) Summarizer doesn't select the best snippets
[ http://issues.apache.org/jira/browse/NUTCH-134?page=all ] Jerome Charron updated NUTCH-134: - Attachment: summarizer.060506.patch Here is a patch that add a summarizer extension point and two summarizer plugins : summarizer-basic (the current nutch implementation) and summarizer-lucene (the lucene highlighter implementation). Please notice that the lucene plugin is a very crude implementation : the highlighter directly constructs a text representation of the summary, so we need to parse the text to build a Summary object!!! (improvements are welcome). This is a first step to this issue resolution. If no objection, I will commit this patch in the next few days and then: 1. Fix in the summarizer-basic the original issue reported by Andrzej 2. Add a toString(Encoder, Formatter) method in Summarizer so that a Summary object could be encoded and formatted with many implementations (it is the same logic as the one in Lucene Highlight) - Andrzej, do you prefer this solution or a solution where Summary is Writable? PS: Chris, sorry but the major part of this patch was already done when you added your comment. Summarizer doesn't select the best snippets --- Key: NUTCH-134 URL: http://issues.apache.org/jira/browse/NUTCH-134 Project: Nutch Type: Bug Components: searcher Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev Reporter: Andrzej Bialecki Attachments: summarizer.060506.patch Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring). To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an int order field, and the collected excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-263) MapWritable.equals() doesn't work properly
[ http://issues.apache.org/jira/browse/NUTCH-263?page=comments#action_12377749 ] Jerome Charron commented on NUTCH-263: -- Andrzej, a small but efficient improvement could be to check the maps sizes prior to any other tests: if (obj instanceof MapWritable) { MapWritable map = (MapWritable) obj; if (map.fSize == fSize) { ... } } return false; No? MapWritable.equals() doesn't work properly -- Key: NUTCH-263 URL: http://issues.apache.org/jira/browse/NUTCH-263 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch1.txt MapWritable.equals() is sensitive to the order in which map entries have been created. E.g. this fails but it should succeed: MapWritable map1 = new MapWritable(); MapWritable map2 = new MapWritable(); map1.put(new UTF8(key1), new UTF8(val1)); map1.put(new UTF8(key2), new UTF8(val2)); map2.put(new UTF8(key2), new UTF8(val2)); map2.put(new UTF8(key1), new UTF8(val1)); assertTrue(map1.equals(map2)); Users expect that this should not be the case, i.e. this class should follow the same rules as Map.equals() (Returns true if the given object is also a map and the two Maps represent the same mappings). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-261) Multi Language Support
[ http://issues.apache.org/jira/browse/NUTCH-261?page=all ] Jerome Charron updated NUTCH-261: - Attachment: query-lang.patch Here is a patch that provides a language dependent analysis of the queries. If you have activated some language analysis plugins (such as analysis-fr or analysis-de) during indexing, if these plugins are activated during searching phase, the analyzer corresponding to the browser's language will be applied: For instance, if you search for the french term moteurs it will returns documents containing moteur or moteurs. Please notice that if no analyzer plugin is activated, nutch behaviors must be unchanged (backward functional). There is some well known issues about the summaries (I have planned to solve this very soon). Thanks for reviewing this patch. Thanks for your feedback. Regards Jérôme Multi Language Support -- Key: NUTCH-261 URL: http://issues.apache.org/jira/browse/NUTCH-261 Project: Nutch Type: New Feature Components: indexer, searcher Versions: 0.7, 0.8-dev, 0.6, 0.7.1, 0.7.2 Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 0.8-dev Attachments: query-lang.patch Add multi-lingual support in Nutch, as described in http://wiki.apache.org/nutch/MultiLingualSupport The document analysis part is actually implemented, and two analysis plugins (fr and de) are provided for testing (not deployed by default). The query analysis part is missing for a complete multi-lingual support. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-262) Summary excerpts and highlights problems
Summary excerpts and highlights problems Key: NUTCH-262 URL: http://issues.apache.org/jira/browse/NUTCH-262 Project: Nutch Type: Sub-task Components: searcher Versions: 0.8-dev Reporter: Jerome Charron Assigned to: Jerome Charron Fix For: 0.8-dev There is some problems selecting and highlighting snippets for summary when multi-lingual support is used. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-245) DTD for plugin.xml configuration files
[ http://issues.apache.org/jira/browse/NUTCH-245?page=comments#action_12374339 ] Jerome Charron commented on NUTCH-245: -- I would prefer to change the ugly parts of the DTD now (before a future 1.0) and suggest to change it to something like (and to change the plugin.xml and Plugin Manifest Reader too): !ELEMENT implementation (parameter*) !ELEMENT parameter EMPTY !ATTLIST parameter name CDATA #REQUIRED value CDATA #REQUIRED DTD for plugin.xml configuration files -- Key: NUTCH-245 URL: http://issues.apache.org/jira/browse/NUTCH-245 Project: Nutch Type: New Feature Components: fetcher, indexer, ndfs, searcher, web gui Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although improvement is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Attachments: NUTCH-245.Mattmann.patch.txt Currently, the plugin.xml file does not have a DTD or XML Schema associated with it, and most people just go look at an existing plugin's plugin.xml file to determine what are the allowable elements, etc. There should be an explicit plugin DTD file that describes the plugin.xml file. I'll look at the code and attach a plugin.dtd file for the Nutch conf directory later today. This way, people can use the DTD file to automatically (using tools such as XMLSpy) generate plugin.xml files that can then be validated. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite
[ http://issues.apache.org/jira/browse/NUTCH-244?page=all ] Jerome Charron closed NUTCH-244: Fix Version: 0.8-dev Resolution: Fixed Assign To: Jerome Charron Fixed : http://svn.apache.org/viewcvs.cgi?rev=391958view=rev Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite Key: NUTCH-244 URL: http://issues.apache.org/jira/browse/NUTCH-244 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: AJ Banck Assignee: Jerome Charron Fix For: 0.8-dev Some properties like file.content.limit support using negative numbers (-1) to 'disable' a limitation. Other properties do not support this. I tried disabling the limit set by db.max.outlinks.per.page, but this isn't possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite
[ http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373393 ] Jerome Charron commented on NUTCH-244: -- While taking a quick look at this, something astonished me in the code. The db.max.outlinks.per.page property is exclusively used in ParseData. In the ParseData, the number of outlinks used is filtered in the readFields method ... Shouldn't it be directly filtered in the ParseData constructor ? Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite Key: NUTCH-244 URL: http://issues.apache.org/jira/browse/NUTCH-244 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: AJ Banck Some properties like file.content.limit support using negative numbers (-1) to 'disable' a limitation. Other properties do not support this. I tried disabling the limit set by db.max.outlinks.per.page, but this isn't possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite
[ http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373398 ] Jerome Charron commented on NUTCH-244: -- That perfectly makes sense! Thanks Andrzej. Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite Key: NUTCH-244 URL: http://issues.apache.org/jira/browse/NUTCH-244 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: AJ Banck Some properties like file.content.limit support using negative numbers (-1) to 'disable' a limitation. Other properties do not support this. I tried disabling the limit set by db.max.outlinks.per.page, but this isn't possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] Jerome Charron commented on NUTCH-240: -- +1 Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-196) lib-xml and lib-log4j plugins
[ http://issues.apache.org/jira/browse/NUTCH-196?page=all ] Jerome Charron closed NUTCH-196: Fix Version: 0.8-dev Resolution: Fixed Added a lib-xml that gathers many xml libraries previously used in parse-rss. (http://svn.apache.org/viewcvs?rev=389716view=rev) lib-xml and lib-log4j plugins - Key: NUTCH-196 URL: http://issues.apache.org/jira/browse/NUTCH-196 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 0.8-dev Attachments: NUTCH-196.lib-log4j.patch Many places in Nutch use XML. Parsing XML using the JDK API is painful. I propose to add one (or more) library plugins with JDOM, DOM4J, Jaxen, etc. This should simplify the current deployment, and help plugin writers to use the existing API. Similarly, many plugins use log4j. Either we add it to the /lib, or we could create a lib-log4j plugin. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-210) Context.xml file for Nutch web application
[ http://issues.apache.org/jira/browse/NUTCH-210?page=all ] Jerome Charron updated NUTCH-210: - Attachment: NUTCH-210.060325.patch I Chris, I made some minor changes to your patch (see my attached patch NUTCH-210.060325.patch): * Refactoring of the xsl code, and add query.* properties to the nutch.xml * Remove the JspUtil class and move the code to a NutchConfiguration.get(ServletContext) method. I used this patch = very usefull, I like it. If no objections about it, I will commit it in the next few days. Thanks Chris Jérôme Context.xml file for Nutch web application -- Key: NUTCH-210 URL: http://issues.apache.org/jira/browse/NUTCH-210 Project: Nutch Type: Improvement Components: web gui Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although improvement is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1 Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt Currently the nutch web gui references a few parameters that are highly dynamic, e.g., searcher.dir. These dynamic properties are read from the configuration files, such as nutch-default.xml. One problem I'm noticing however is that in order to change the parameter in the built webapp (the WAR file), I am required to change the parameter first in the checked out Nutch source tree, rebuild the webapp, then redploy. Or, if I'm feeling really gutsty, I can go poke around in the unpackaged WAR file if the servlet container exposes it to me, and try and modify the nutch-default.xml file that way. However, I think that it would be really nice (and highly useful for that matter) to factor out some of the more dynamic parameters of the web application into a separate deliverable Context.xml file that would accompany the webapp. The Context.xml file would be deployed in the webapps directory, as oppossed to the WAR file itself, and the parameters could be updated there, and changed as many times as necessary, without rebuilding the WAR file. Of course this will involve making minor modifications in the web GUI to where some of the dynamic parameters are read from (i.e., make it read them from the Context.xml file (using application.getParameter most likely). Right now the only one I can think of is searcher.dir, but I'm sure that there are others (in particular the searcher.dir one is the most annoying for me). The timeframe on this patch will be within the next month. Thanks, Chris -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] Jerome Charron commented on NUTCH-233: -- Stefan, I have created a small unit test for urlfilter-regexp and I doesn't notice any incompatibility in java.util.regex with this regexp. Could you please provide the urls that cause problem so that I can add them to me unit tests. Thanks Jérôme wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-228) Clustering plugin descriptor broken (fix included)
[ http://issues.apache.org/jira/browse/NUTCH-228?page=all ] Jerome Charron closed NUTCH-228: Fix Version: 0.8-dev Resolution: Fixed Committed : * http://svn.apache.org/viewcvs.cgi?rev=385267view=rev * http://svn.apache.org/viewcvs.cgi?rev=385268view=rev Thanks Dawid. Clustering plugin descriptor broken (fix included) -- Key: NUTCH-228 URL: http://issues.apache.org/jira/browse/NUTCH-228 Project: Nutch Type: Bug Reporter: Dawid Weiss Priority: Minor Fix For: 0.8-dev Attachments: clustering.patch The plugin descriptor for clustering-carrot2 is currently broken (points to a missing JAR). I'm adding a patch fixing this to this issue in a minute. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)
[ http://issues.apache.org/jira/browse/NUTCH-217?page=all ] Jerome Charron resolved NUTCH-217: -- Resolution: Fixed Fixed : http://svn.apache.org/viewcvs.cgi?view=revrev=384011 Thanks Dawid. InstantiationException when deserializing Query (no parameterless constructor) -- Key: NUTCH-217 URL: http://issues.apache.org/jira/browse/NUTCH-217 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Dawid Weiss I've been playing with the trunk. The distributed searcher complains with an instantiation exception when deserializing Query. A quick code inspection shows that Query doesn't have any parameterless constructor. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration
[ http://issues.apache.org/jira/browse/NUTCH-227?page=all ] Jerome Charron closed NUTCH-227: Resolution: Fixed Oups.. sorry guys... and thanks for you prompt remarks. All is in fact OK. Basic Query Filter no more uses Configuration - Key: NUTCH-227 URL: http://issues.apache.org/jira/browse/NUTCH-227 Project: Nutch Type: Bug Components: searcher Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 0.8-dev Since NUTCH-169, the BasicIndexingFilter has no way to retrieve its configuration parameters (query.url.boost, query.anchor.boost, query.title.boost, query.host.boost, query.phrase.boost) : The setConf(Configuration) method is never called by the QueryFilters class. More generaly, we should provide a way for QueryFilter to be Configurable. Two solutions: 1. The QueryFilters checks that a QueryFilter implements Configurable and then call the setConf() method. 2. QueryFilter extends Configurable = all QueryFilter must implement Configurable. My preference goes to 1, and if there is no objection, I will commit a patch in the next few days. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-219) file.content.limit ftp.content.limit should be changed to -1 to be consistent with http
[ http://issues.apache.org/jira/browse/NUTCH-219?page=all ] Jerome Charron closed NUTCH-219: Fix Version: 0.8-dev Resolution: Fixed Solved: http://svn.apache.org/viewcvs.cgi?rev=382535view=rev file.content.limit ftp.content.limit should be changed to -1 to be consistent with http - Key: NUTCH-219 URL: http://issues.apache.org/jira/browse/NUTCH-219 Project: Nutch Type: Bug Components: fetcher Versions: 0.7.1 Reporter: Richard Braman Priority: Minor Fix For: 0.8-dev file and ftp are 0 for no trunccation, but http need be -1 This is easily missed when configuting even by expereienced users. Here is the help I got in nutch-user from Jermoe, who is a developer. Edit your nutch-site.xml (or nutch-default.xml) and change the http.content.limit (set it to 0 if you don't want no content truncation at all). Jérôme -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-204) multiple field values in HitDetails
[ http://issues.apache.org/jira/browse/NUTCH-204?page=all ] Jerome Charron updated NUTCH-204: - Attachment: NUTCH-204.jc.060227.patch Stefan, Here is a proposed patch (NUTCH-204.jc.060227.patch). If you agree, I will commit it. Jérôme multiple field values in HitDetails --- Key: NUTCH-204 URL: http://issues.apache.org/jira/browse/NUTCH-204 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch Improvement as Howie Wang suggested. http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL PROTECTED] -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-204) multiple field values in HitDetails
[ http://issues.apache.org/jira/browse/NUTCH-204?page=all ] Jerome Charron closed NUTCH-204: Resolution: Fixed Committed : http://svn.apache.org/viewcvs.cgi?rev=381465view=rev Thanks Stefan for pointing out the performance issue of my patch. We can perhaps in a next patch add a cache of field/values to avoid iterating over the whole list each time the getValues method is called. multiple field values in HitDetails --- Key: NUTCH-204 URL: http://issues.apache.org/jira/browse/NUTCH-204 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch Improvement as Howie Wang suggested. http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL PROTECTED] -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368050 ] Jerome Charron commented on NUTCH-61: - Not an objection, but a simple comment. Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule plugins? Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-204) multiple field values in HitDetails
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367513 ] Jerome Charron commented on NUTCH-204: -- Hi Stefan, There is something I don't understand with this patch. The way Lucene manage multi-valued fields is to have many mono-valued Field objects with the same name. My interrogation, is why not keeping this logic? It will avoid patching the HitSearcher, and modifying the HitDetails constructor signature. The idea I have in mind is to add a generic name/value(s) container (like Metadata but without the syntax tolerant feature. In fact, the actual metadata will internaly uses this generic container), that will be used by the HitDetails to store multivalued fields. What do you think about this? I imagine you are very busy with the admin GUI (it is really a big challenge, and a big new feature), so if you are ok with my proposed solution, I will code it. Regards Jérôme multiple field values in HitDetails --- Key: NUTCH-204 URL: http://issues.apache.org/jira/browse/NUTCH-204 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: DetailGetValues070206.patch Improvement as Howie Wang suggested. http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL PROTECTED] -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-204) multiple field values in HitDetails
[ http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367530 ] Jerome Charron commented on NUTCH-204: -- HitDetails is a writable and in case of multiple searchservers distributed in a network it makes sense to minimize the network io since getting details should be as fast as possible. Sure Stefan. I will take this into account of course. Using a map like structure in HitDetails will reduce the bytes used by not duplicating keys. I will commit something in the next few days. multiple field values in HitDetails --- Key: NUTCH-204 URL: http://issues.apache.org/jira/browse/NUTCH-204 Project: Nutch Type: Improvement Components: searcher Versions: 0.8-dev Reporter: Stefan Groschupf Fix For: 0.8-dev Attachments: DetailGetValues070206.patch Improvement as Howie Wang suggested. http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL PROTECTED] -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-188) Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html
[ http://issues.apache.org/jira/browse/NUTCH-188?page=all ] Jerome Charron closed NUTCH-188: Fix Version: 0.8-dev Resolution: Fixed Duplicated with NUTCH-214 Add searchable mailing list links to http://lucene.apache.org/nutch/mailing_lists.html -- Key: NUTCH-188 URL: http://issues.apache.org/jira/browse/NUTCH-188 Project: Nutch Type: Improvement Reporter: Andy Liu Priority: Trivial Fix For: 0.8-dev Attachments: mailing_list.patch Post links to searchable mail archives on nutch.org -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-215) Plugin execution order
[ http://issues.apache.org/jira/browse/NUTCH-215?page=comments#action_12367180 ] Jerome Charron commented on NUTCH-215: -- The primary meaning of the plugin dependency is to specify that a plugin rely on the code of another plugin, not that it must be executed after a plugin it depends on. I think that in some particular cases (as in yours), this concept of plugin order is important, and it is really an issue. But I don't think it is the right way for solving it. We should think about a more generic/secure solution (a parse plugin must not be allowed to be declared to be called before a protocol plugin, a plugin can implement many extension points, ...). For a short term workaround, you can for instance directly call the plugin that you need to be called before yours. +1 for this issue to be solved, but -1 for this patch. Jérôme Plugin execution order -- Key: NUTCH-215 URL: http://issues.apache.org/jira/browse/NUTCH-215 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Enrico Triolo Priority: Minor Attachments: plugin_order.patch This patch allows nutch to automatically guess the correct order of execution of plugins, depending on their dependencies. This means that, for example, if plugin A depends on plugin B (as stated in the plugins.xml file), then B will be executed before A. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-212) ant build problem with locale-sr
[ http://issues.apache.org/jira/browse/NUTCH-212?page=all ] Jerome Charron resolved NUTCH-212: -- Fix Version: 0.8-dev Resolution: Fixed Not directly related, but it should be solved with this commit: http://svn.apache.org/viewcvs.cgi?rev=379453view=rev ant build problem with locale-sr Key: NUTCH-212 URL: http://issues.apache.org/jira/browse/NUTCH-212 Project: Nutch Type: Bug Components: web gui Versions: 0.8-dev Environment: win32 Reporter: Alain Fankhauser Priority: Trivial Fix For: 0.8-dev Problem while executing ant from eclipse: build.xml: antcall target=generate-locale param name=doc.locale value=sr/ /antcall errormessage: generate-locale: [echo] Generating docs for locale=sr [xslt] Transforming into C:\eclipse_projects\nutchTrunk\docs\sr [xslt] Processing C:\eclipse_projects\nutchTrunk\src\web\pages\sr\about.xml to C:\eclipse_projects\nutchTrunk\docs\sr\about.html [xslt] Loading stylesheet C:\eclipse_projects\nutchTrunk\build\docs\sr\nutch-page.xsl [xslt] C:/eclipse_projects/nutchTrunk/src/web/pages/sr/about.xml:1: Fatal Error! Dokumentwurzelelement fehlt [xslt] Failed to process C:\eclipse_projects\nutchTrunk\src\web\pages\sr\about.xml BUILD FAILED C:\eclipse_projects\nutchTrunk\build.xml:393: The following error occurred while executing this line: C:\eclipse_projects\nutchTrunk\build.xml:324: Fatal error during transformation -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Jerome Charron updated NUTCH-139: - Attachment: NUTCH-139.060208.patch A new patch which I hope is compliant with all our requirements (not tested yet on a wide fetch/index/query cycle) Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Jerome Charron resolved NUTCH-139: -- Fix Version: (was: 0.7.2-dev) (was: 0.7.1) (was: 0.7) (was: 0.6) Resolution: Fixed Tested and commited with some corrections in cached.jsp (missed ContentProperties usage) and build.xml (add commons-lang jar in war lib): http://svn.apache.org/viewcvs.cgi?rev=376089view=rev Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.8-dev Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365066 ] Jerome Charron commented on NUTCH-139: -- Sorry for this very late response... The idea behind separate subclasses of Metadata for content and parses is to enforce the semantic separation between content metadata and parse metadata: ContentProperties only defines constants for content related metadata. ParseProperties only defines constants for parse related metadata. Does it makes sense? Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365095 ] Jerome Charron commented on NUTCH-139: -- Ok Doug. Your point of view makes sense for me. I hope, I can provide a (final) patch for the next week. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365103 ] Jerome Charron commented on NUTCH-139: -- except for the sake of purity of OO approach Andrzej, as you noticed certainly, it is my defect... ;-) You know, I have still the temptation to split the metadata constants into several interfaces (DublinCore, HttpHeaders, ...) ;-) Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364218 ] Jerome Charron commented on NUTCH-139: -- I think we're near agreement here. I really hope ... ;-) We should add an add() method to Metadata, and change set() to replace all values rather than add a new value. I'm not sure we are looking at the same piece of code, since this how add() and set() methods works in the last attached patch (http://issues.apache.org/jira/secure/attachment/12321740/NUTCH-139.060105.patch) MetadataNames belongs in the protocol package, not util +1 (but in my mind there is no more MetadatNames only MetaData, ContentProperties and ParseProperties, no?) We should rename ContentProperties to Metadata What about having a generic Metadata container extended by ContentProperties and ParseProperties? (as described in a previous comment : http://issues.apache.org/jira/browse/NUTCH-139#action_12362618) By having two separate maps (one for Content and one for Parse in ParseData) we easily handle the problem of original value / final value and we avoid the copying af the Content metadata map to the Parse metadata map in all parsers: ContentProperties metadata = new ContentProperties(); metadata.putAll(content.getMetadata()); // copy through Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364112 ] Jerome Charron commented on NUTCH-139: -- In fact, the more I look at this, the more I agreed with last Doug comment. There is no real needs (for now) for a so complicated meta-info container. I would like to summarize the key goals related to this issue: 1. Defines some constants for protocol and content metadata names. 2. Provides some correction mechanisms for erroneous protocol headers names. 3. Handles multi-valued properties (such as SMTP recipients, or TAGS attached to a html page, ...) 4. Provides a easy way to keep tracks of protocol original values even if they are overridden by parsers (I don't think there a need for a concept of original value at the parsers level. If a parser override a previously existing value setted by another parser, then this new value must replace the existing one). I really think that one of my comment (13/Jan/06 - http://issues.apache.org/jira/browse/NUTCH-139#action_12362618) covers all these cases. In this proposal, the ParseData object keeps a reference on the protocol original metadata map (ContentProperties), instead of copying the map into a new one. The policy is then as follow : * The ContentProperties is created at the protocol level and is then never modified after. * The ParseProperties is created by the content parser and is the place to store any kind of metadata in all the next nutch processes. * Any metadata stored in ParseProperties can be overridded (the last who speak has the last word). Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363834 ] Jerome Charron commented on NUTCH-139: -- Andrzej, I really don't like this X-Nutch naming convention. First it's really protocol level oriented, and it forces to map X-Nutch values with original ones (of course an utility method can easily provides this mapping). But I really think this solution is really clean (from my point of view). We should perhaps define one more time what is a MetaData value. I suggest to define a new class to represent a metadata value instead of using a simple String. Thus, we can define a class that holds both original and final value. The idea is that the only way to set the original value is to construct a new object (I will call this class MetaValue, but native english speakers are encourage to propose a better name), then when you set the value of this metadata value, it never override the original one, but the final one. Here is a short piece of code: public class MetaValue { private String[] original = null; private List actual = null; public MetaValue(String[] values) { // Constructor for multi value original = values; } public MetaValue(String value) { // Constructor for single value original = new String[] { value }; } public void setValue(String[] values) { // copies the values in a new empty actual list } public void addValue(String value) { // append this value to the list of values } public String[] getOriginalValues() { } public String[] getFinalValues() { } public String[] getValues() { // Return the final values if the list of values is not null // otherwise return the final values } } With this approach we can keep the same value (MetaValue) with the same key. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362618 ] Jerome Charron commented on NUTCH-139: -- Here is a new proposal for this issue. org.apache.nutch.util.MetaData * becomes an utility class that is only a container of multi-valued, typo toletent String properties (using the same kind of API than JavaMail : the add / set methods mentionned by Doug - it is already implemented in the actual patch). * There is no more metadata names constants in this class, since it becomes a generic object for storing String/String[] mappings org.apache.nutch.protocol.ContentProperties * This class simply extends the MetaData class * It defines the content related constants (Content-Type, and so on) org.apache.nutch.parse.ParseProperties * This class simply extends the MetaData class * It defines the parse related constants (Dublin Core constans) org.apache.nutch.parse.ParseData * The constructor becomes ParseData(ParseStatus, String, Outlink[], ContentProperties) * This class holds two metadata sets : 1. ContentProperties for the original metadata set which came from protocol 2. MetaDataProperties for the parse metadata set. * This class provides 3 ways to retrieve a metadata value: 1. public ContentProperties getContentMeta(); 2. public ParseProperties getParseMeta(); 3. public MetaData getMetaData(); // Returns a mix of the two previous one where values in parse properties override those in content properties. In all parsers implementations: * Remove the copying of content metadata to parse metadata. From my point of view the key benefits are: 1. Provide a clear separation between content metadata and parse metadata. 2. Metadata names are defined at the right places. 3. Keeps the advantage of metadata names normalization and syntax correction 4. An easy mapping beetween the content metadatas name and parse metadata names (both can use the real name of the metadata, without adding an artificial X-Nutch prefix for parse metadata name) Comments are welcome. Jérôme Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron resolved NUTCH-151: -- Resolution: Fixed Changes committed : http://svn.apache.org/viewcvs.cgi?rev=368060view=rev Thanks Paul. CommandRunner can hang after the main thread exec is finished and has inefficient busy loop --- Key: NUTCH-151 URL: http://issues.apache.org/jira/browse/NUTCH-151 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: all Reporter: Paul Baclace Fix For: 0.8-dev Attachments: CommandRunner.060110.patch, CommandRunner.java, CommandRunner.java.patch I encountered a case where the JVM of a Tasktracker child did not exit after the main thread returned; a thread dump showed only the threads named STDOUT and STDERR from CommandRunner as non-daemon threads, and both were doing a read(). CommandRunner usually works correctly when the subprocess is expected to be finished before the timeout or when no timeout is used. By _usually_, I mean in the absence of external thread interrupts. The busy loop that waits for the process to finish has a sleep that is skipped over by an exception; this causes the waiting main thread to compete with the subprocess in a tight loop and effectively reduces the available cpu by 50%. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Reopened: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron reopened NUTCH-151: -- Due to the removal of calling barrier in PumperThread the process is always timedout (for instance , unit tests of parse-ext fails) because only the main thread is assumed to be finished. CommandRunner can hang after the main thread exec is finished and has inefficient busy loop --- Key: NUTCH-151 URL: http://issues.apache.org/jira/browse/NUTCH-151 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: all Reporter: Paul Baclace Fix For: 0.8-dev Attachments: CommandRunner.java, CommandRunner.java.patch I encountered a case where the JVM of a Tasktracker child did not exit after the main thread returned; a thread dump showed only the threads named STDOUT and STDERR from CommandRunner as non-daemon threads, and both were doing a read(). CommandRunner usually works correctly when the subprocess is expected to be finished before the timeout or when no timeout is used. By _usually_, I mean in the absence of external thread interrupts. The busy loop that waits for the process to finish has a sleep that is skipped over by an exception; this causes the waiting main thread to compete with the subprocess in a tight loop and effectively reduces the available cpu by 50%. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop
[ http://issues.apache.org/jira/browse/NUTCH-151?page=all ] Jerome Charron updated NUTCH-151: - Attachment: CommandRunner.060110.patch Here is a very small patch that solves this issue. If Paul is ok with this, I will commit. CommandRunner can hang after the main thread exec is finished and has inefficient busy loop --- Key: NUTCH-151 URL: http://issues.apache.org/jira/browse/NUTCH-151 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Environment: all Reporter: Paul Baclace Fix For: 0.8-dev Attachments: CommandRunner.060110.patch, CommandRunner.java, CommandRunner.java.patch I encountered a case where the JVM of a Tasktracker child did not exit after the main thread returned; a thread dump showed only the threads named STDOUT and STDERR from CommandRunner as non-daemon threads, and both were doing a read(). CommandRunner usually works correctly when the subprocess is expected to be finished before the timeout or when no timeout is used. By _usually_, I mean in the absence of external thread interrupts. The busy loop that waits for the process to finish has a sleep that is skipped over by an exception; this causes the waiting main thread to compete with the subprocess in a tight loop and effectively reduces the available cpu by 50%. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-169) remove static NutchConf
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.Http.060111.patch Attached is the patch for http related classes (lib-http, protocol-http and protocol-httpclient). Pfou, Stefan, it was a huge work since a lot of code was static and use the static NutchConf !!! ;-) But it is ok and it works (with a patch to the Fetcher that I will submit just after). Please notice, that it is a raw version, and it probably needs a full review after commit. remove static NutchConf --- Key: NUTCH-169 URL: http://issues.apache.org/jira/browse/NUTCH-169 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: NutchConf.Http.060111.patch, nutchConf.patch Removing the static NutchConf.get is required for a set of improvements and new features. + it allows a better integration of nutch in j2ee or other systems. + it allows the management of nutch from a web based gui (a kind of nutch appliance) which will improve the usability and also increase the user acceptance of nutch + it allows to change configuration properties until runtime + it allows to implement NutchConf as a abstract class or interface to provide other configuration value sources than xml files. (community request) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-169) remove static NutchConf
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.Fetcher.060111.patch Same as the one provided in Stefan patch + the Fetcher set the NutchConf to protocol. Not sure it is the right way: it could be better that the ProtocolFactory set the NutchConf to protocols. ??? remove static NutchConf --- Key: NUTCH-169 URL: http://issues.apache.org/jira/browse/NUTCH-169 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: NutchConf.Fetcher.060111.patch, NutchConf.Http.060111.patch, nutchConf.patch Removing the static NutchConf.get is required for a set of improvements and new features. + it allows a better integration of nutch in j2ee or other systems. + it allows the management of nutch from a web based gui (a kind of nutch appliance) which will improve the usability and also increase the user acceptance of nutch + it allows to change configuration properties until runtime + it allows to implement NutchConf as a abstract class or interface to provide other configuration value sources than xml files. (community request) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-169) remove static NutchConf
[ http://issues.apache.org/jira/browse/NUTCH-169?page=all ] Jerome Charron updated NUTCH-169: - Attachment: NutchConf.RegexURLFilter.060111.patch This patch is a merge of the version provided in Stefan's patch and the last changes committed by Doug (use JDK regexp). remove static NutchConf --- Key: NUTCH-169 URL: http://issues.apache.org/jira/browse/NUTCH-169 Project: Nutch Type: Improvement Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: NutchConf.Fetcher.060111.patch, NutchConf.Http.060111.patch, NutchConf.RegexURLFilter.060111.patch, nutchConf.patch Removing the static NutchConf.get is required for a set of improvements and new features. + it allows a better integration of nutch in j2ee or other systems. + it allows the management of nutch from a web based gui (a kind of nutch appliance) which will improve the usability and also increase the user acceptance of nutch + it allows to change configuration properties until runtime + it allows to implement NutchConf as a abstract class or interface to provide other configuration value sources than xml files. (community request) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362061 ] Jerome Charron commented on NUTCH-139: -- I agree with your analysis Andrzej. I suggested to commit this patch because it is a response to this issue: standard metadata names + misspelled/erroneous names. The history is not a new feature = ContentProperties is a kind of history. So after commiting this patch, I (and others) could focus on other sub-issues: 1. In fact, by taking a closer look to it, I agree that there is no real need of a metadata history in nutch. 2. What we need: 2.1 MetaData must be used to store multi valued meta data and not the actual kind of history. 3.1 Only two historical values must be stored : the original one (protocol only) and some extra metadata (that could be or not some derivated values of the original ones). What I suggest is that the MetaData deals with two collections instead of one: * One for original protocol values : headers * Another one for other metadata Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ] Jerome Charron commented on NUTCH-139: -- Ok, Chris and me will implement MetadataNames in this way. Just some few comments: I plan to move the MetadataNames to a class rather than an interface. Two reasons: 1.1 I don't like the design of implementing an interface in order to import some constants in a class: It gives some javadoc with a lot of class with many public constants defined without any really needs to show these constants in the javadoc. 1.2 I want to add an utility method in MetadataNames that tries to find the appropriate Nutch normalized metadata name from a string. It will be based on the Levenshtein Distance (available in commons-lang). More about Levenshtein Distance at http://www.merriampark.com/ld.htm Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ] Jerome Charron commented on NUTCH-139: -- Andrzej, Do you read in my mind? Yes of course, that's the way I want to do it: First checks for the most common cases (lower cases + keeps only letters), then use the Levenshtein distance if needed (last chance). Regards Jérôme Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=all ] Jerome Charron updated NUTCH-139: - Attachment: NUTCH-139.jc.review.patch.txt Here is a new patch from Chris. I reviewed it, tested it. From my point of view, all seems to be ok. So if no objections, I will commit it during the day. Regards Jérôme Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360902 ] Jerome Charron commented on NUTCH-139: -- Andrzej, Thanks for taking time to take a look at the patch. In fact, we have some discussion with Chris about this point (that's why I don't commit the patch directly, I already have some doubts about this). I will check right now how to handle things in this way. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360906 ] Jerome Charron commented on NUTCH-139: -- Andrzej, Here are more comments about my doubts, and how to handle metadata names. if for instance a protocol plugin doesn't have any Content-Length information (no header like in FTP), then is should compute the content-length and add it in X-nutch-content-length attribute. But what do you suggest if a protocol have a Content-Length header (HTTP may provide one)? My feeling is adding the two metadata: 1. One for the Content-Length header in the Content-Length attribute 2. One for the real Content-Length (computed) in the X-nutch-content-length attribute. In other words and more generally: * When adding a native protocol header, if an equivalent x-nutch attribute exists in MetadataNames, then it must be added too with the same value, or with a more precise value. * If no header information is available, tries to fill the more x-nutch attribute the protocol level can. Do you agree with that? Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360920 ] Jerome Charron commented on NUTCH-139: -- And why not using the fact that the ContentProperties object can now handles multi-valued properties. Each piece of code that wants to add some more reliable content about a property simply add its own value to the property = the first value is the raw one (for instance from the protocol level) and the more you iterate over values of the property, the more youo have a reliable value (the last one should be the more reliable, and is generally the interesting one, or for other reasons, the original value may be needed, then it is simply the first value). Yes, you loose one information with this solution: You cannot ensure that the first value of a multi-valued property is the one from the protocol level. But it avoid searching the same kind of information (the content-type for instance) using many properties names (Content-Type for the protocol level and X-Nutch-Content-Type for other levels). We can extend the Multi-Valued Properties by adding a provider attribute while adding a property: public void addProperty(key, value, provider). The provider can be one of PROTOCOL, CONTENT, OTHER, for instance (to be defined) Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-3) multi values of header discarded
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Jerome Charron closed NUTCH-3: -- Fix Version: 0.8-dev Resolution: Fixed Double-Checked Tests (unit and functional). http://svn.apache.org/viewcvs.cgi?rev=357334view=rev http://svn.apache.org/viewcvs.cgi?rev=357335view=rev Tks Stefan. multi values of header discarded Key: NUTCH-3 URL: http://issues.apache.org/jira/browse/NUTCH-3 Project: Nutch Type: Bug Reporter: Stefan Groschupf Assignee: Stefan Groschupf Fix For: 0.8-dev Attachments: multiValuesPropertyPatch.txt orignal by: phoebe http://sourceforge.net/tracker/index.php?func=detailaid=185group_id=59548atid=491356 multi values of header discarded Each successive setting of a header value deletes the previous one. This patch allows multi values to be retained, such as cookies, using lf cr as a delimiter for each values. --- /tmp/HttpResponse.java 2005-01-27 19:57:55.0 -0500 +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500 @@ -324,7 +324,19 @@ } String value = line.substring(valueStart); - headers.put(key, value); +//Spec allows multiple values, such as Set-Cookie - using lf cr as delimiter + if ( headers.containsKey(key)) { + try { + Object obj= headers.get(key); + if ( obj != null) { + String oldvalue= headers.get(key).toString(); + value = oldvalue + \r\n + value; + } + } catch (Exception e) { + e.printStackTrace(); + } + } + headers.put(key, value); } private Map parseHeaders(PushbackInputStream in, StringBuffer line) @@ -399,5 +411,3 @@ } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Jerome Charron updated NUTCH-135: - Attachment: cached.jsp.patch cached.jsp must be patched too. http header meta data are case insensitive in the real world (e.g. Content-Type or content-type) Key: NUTCH-135 URL: http://issues.apache.org/jira/browse/NUTCH-135 Project: Nutch Type: Bug Components: fetcher Versions: 0.7, 0.7.1 Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev, 0.7.2-dev Attachments: cached.jsp.patch, contentProperties_patch.txt, contentProperties_patch_WithContentProperties.txt As described in issue nutch-133, some webservers return http header meta data not standard conform case insensitive. This provides many negative side effects, for example query thet content type from the meta data return null also in case the webserver returns a content type, but the key is not standard conform e.g. lower case. Also this has effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Jerome Charron resolved NUTCH-135: -- Fix Version: (was: 0.7.2-dev) Resolution: Fixed Committed to trunk (to be merged into branche 0.7?) Thanks Stefan. I have performed unit and functional tests, but I don't have resources for a wide and intensive test. If someone can perform such test, it would be greatly apreciated. Note: During my tests, I notice some strange content-types returned by de.yahoo.com and all de.yahoo related files. The content-type returned by the protocol layer to the Content constructor is always text/plain, but when performing some wget on these sites the content-type in headers is text/html ... sorry, I don't have time for more investigations.. http header meta data are case insensitive in the real world (e.g. Content-Type or content-type) Key: NUTCH-135 URL: http://issues.apache.org/jira/browse/NUTCH-135 Project: Nutch Type: Bug Components: fetcher Versions: 0.7, 0.7.1 Reporter: Stefan Groschupf Priority: Critical Fix For: 0.8-dev Attachments: cached.jsp.patch, contentProperties_patch.txt, contentProperties_patch_WithContentProperties.txt As described in issue nutch-133, some webservers return http header meta data not standard conform case insensitive. This provides many negative side effects, for example query thet content type from the meta data return null also in case the webserver returns a content type, but the key is not standard conform e.g. lower case. Also this has effects to the pdf parser that queries the content length etc. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-133) ParserFactory does not work as expected
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ] Jerome Charron commented on NUTCH-133: -- Stefan: Taking a closer look at the ParserFactory patch: 1. You can use the MimeType.clean(String) static method to clean the content-type 2. In the actual MimeTypes implementation, the getMimeType(String, byte[]) method returns the MimeType from the document name if one match (without guessing from magic). So, use, the getMimeType(byte[] if you want to guess content type from magic). 3. Your patch doen't really try to guess the content-type, but instead it will try to parse the content by using the parsers declared for the header content-type AND the then by the ones declared for the file extension detected content-type: It means that you guess header content-type is more reliable... no? 4. There's too much calls to .toLowerCase() and .euqlasIgnoreCase() methods in your code. One of the major Java bottleneck is the String manipulations, so the basic idea is to use the less you can the string manipulations. 5. Looking at http://www.w3.org/TR/REC-html40/types.html#h-6.7 it seems that content-type are case-insensitive, so the solution to deal with content-type sensitivity is to simply to patch the MimeType.clean(String) method so that it performs a toLowerCase to the mime-type. ParserFactory does not work as expected --- Key: NUTCH-133 URL: http://issues.apache.org/jira/browse/NUTCH-133 Project: Nutch Type: Bug Versions: 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Stefan Groschupf Priority: Blocker Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source. From our point of view this described problems could be the source for many other problems daily described in the mailing lists. Find a conclusion of the problems below. Problem: Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length' in the http response header. That's why for example a get(Content-Type) fails and a page is detected as zip using the magic content type detection mechanism. Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. Sample: returns text/HTML or application/PDF or Content-length or this url: http://www.lanka.info/dictionary/EnglishToSinhala.jsp Solution: First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well. e.g.: HttpResponse.java, line 353: use lower cases here and for all keys used to query header properties (also content-length) change: String key = line.substring(0, colonIndex); to String key = line.substring(0, colonIndex) .toLowerCase(); Problem: MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header. see: public Content toContent() { String contentType = getHeader(Content-Type); if (contentType == null) { MimeType type = null; if (MAGIC) { type = MIME.getMimeType(orig, content); } else { type = MIME.getMimeType(orig); } if (type != null) { contentType = type.getName(); } else { contentType = ; } } return new Content(orig, base, content, contentType, headers); } Solution: Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory. Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status. Problem: Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case. Solution: Fetcher.java, line 243. Change: if (!Fetcher.this.parsing ) { .. to if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) { // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false outputPage(new FetcherOutput(fle, hash, protocolStatus), content, new ParseText(), new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new Outlink[0], new Properties())); return null; } Problem: Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so normally a plugin
[jira] Commented: (NUTCH-133) ParserFactory does not work as expected
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359647 ] Jerome Charron commented on NUTCH-133: -- Stefan: 1. url extentions and also magic content type detection are used. This is the only way protocol-file and protocol-ftp can guess the content-type of a document (see FileResponse.java and FtpResponse.java). So, the problem is only for HTTP. I suggest to you for a ASAP solution to patch the HTTP related plugins by systematically using the mime-type resolver. But what is the policy to apply if you have both a mime-type from the protocol layer and another one from the mime-type solver? Which one to use (we have not yet stated on this...) What do you think about it? 2. I'm ok with Doug. This issue should be splitted in six separate issues. 3. Unit Tests: I'm ok to commit tests provided by Stefan about the content-type case. But I'm not sure that the TestParseUtil is the right place for this. It doesn't test the ParseUtil itself, but the way meta-data keys are stored in nutch. 4. I think we can use case-insensitive metadata keys. I don't know any protocol for which the case sensitivity is really used for headers or metadata keys (even if the specification says they are case sensitive). ParserFactory does not work as expected --- Key: NUTCH-133 URL: http://issues.apache.org/jira/browse/NUTCH-133 Project: Nutch Type: Bug Versions: 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Stefan Groschupf Priority: Blocker Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source. From our point of view this described problems could be the source for many other problems daily described in the mailing lists. Find a conclusion of the problems below. Problem: Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length' in the http response header. That's why for example a get(Content-Type) fails and a page is detected as zip using the magic content type detection mechanism. Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. Sample: returns text/HTML or application/PDF or Content-length or this url: http://www.lanka.info/dictionary/EnglishToSinhala.jsp Solution: First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well. e.g.: HttpResponse.java, line 353: use lower cases here and for all keys used to query header properties (also content-length) change: String key = line.substring(0, colonIndex); to String key = line.substring(0, colonIndex) .toLowerCase(); Problem: MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header. see: public Content toContent() { String contentType = getHeader(Content-Type); if (contentType == null) { MimeType type = null; if (MAGIC) { type = MIME.getMimeType(orig, content); } else { type = MIME.getMimeType(orig); } if (type != null) { contentType = type.getName(); } else { contentType = ; } } return new Content(orig, base, content, contentType, headers); } Solution: Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory. Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status. Problem: Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case. Solution: Fetcher.java, line 243. Change: if (!Fetcher.this.parsing ) { .. to if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) { // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false outputPage(new FetcherOutput(fle, hash, protocolStatus), content, new ParseText(), new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new Outlink[0], new Properties())); return null; } Problem: Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so normally a plugin can provide several parser, but this is no limited just wrong values are used in the
[jira] Closed: (NUTCH-112) Link in cached.jsp page to cached content is an absolute link
[ http://issues.apache.org/jira/browse/NUTCH-112?page=all ] Jerome Charron closed NUTCH-112: Fix Version: 0.8-dev Resolution: Fixed Committed to trunk and mapred. http://svn.apache.org/viewcvs?rev=354575view=rev http://svn.apache.org/viewcvs?rev=354582view=rev Thanks Chris. Link in cached.jsp page to cached content is an absolute link - Key: NUTCH-112 URL: http://issues.apache.org/jira/browse/NUTCH-112 Project: Nutch Type: Bug Components: web gui Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Windows XP Professional SP2, Intel Pentium M 2.0 Ghz, 512 MB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Fix For: 0.8-dev Attachments: NUTCH-112.Mattmann.patch.txt The link in the cached.jsp page that points to the cached content uses an absolute link, of the form /servlet/cached?idx=xxxid=yyy. This causes an error when the user goes to click on the link and the Nutch war is not deployed at the root context of the application server. The link should be of the form ./servlet/cached?idx=xxxid=yyy, i.e., a relative link to correct this problem. I've attached a small patch that fixes the error. I've tested the patch in my local environment and it fixes the error. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-133) ParserFactory does not work as expected
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359487 ] Jerome Charron commented on NUTCH-133: -- Thanks for this really very good description. Just a quick note: I'm currently in the final steps of a new mime-type repository implementation (compliant with freedesktop specification). So, I suggest not to focus on the mime-type issues for now. About the MimeResolution moved to the parser factory: +1. (As you probably notice by looking at the comments in the code, it was planned... when the new mime type repository will be available. But unfortunaly, the it takes more time than excpected). ParserFactory does not work as expected --- Key: NUTCH-133 URL: http://issues.apache.org/jira/browse/NUTCH-133 Project: Nutch Type: Bug Versions: 0.8-dev, 0.7.1, 0.7.2-dev Reporter: Stefan Groschupf Priority: Blocker Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, Parserutil_test_patch.txt Marcel Schnippe detect a set of problems until working with different content and parser types, we worked together to identify the problem source. From our point of view this described problems could be the source for many other problems daily described in the mailing lists. Find a conclusion of the problems below. Problem: Some servers returns mixed case but correct header keys like 'Content-type' or 'content-Length' in the http response header. That's why for example a get(Content-Type) fails and a page is detected as zip using the magic content type detection mechanism. Also we note that this a common reason why pdf parsing fails since Content-Length does return the correct value. Sample: returns text/HTML or application/PDF or Content-length or this url: http://www.lanka.info/dictionary/EnglishToSinhala.jsp Solution: First just write only lower case keys into the properties and later convert all keys that are used to query the metadata to lower case as well. e.g.: HttpResponse.java, line 353: use lower cases here and for all keys used to query header properties (also content-length) change: String key = line.substring(0, colonIndex); to String key = line.substring(0, colonIndex) .toLowerCase(); Problem: MimeTypes based discovery (magic and url based) is only done in case the content type was not delivered by the web server, this happens not that often, mostly this was a problem with mixed case keys in the header. see: public Content toContent() { String contentType = getHeader(Content-Type); if (contentType == null) { MimeType type = null; if (MAGIC) { type = MIME.getMimeType(orig, content); } else { type = MIME.getMimeType(orig); } if (type != null) { contentType = type.getName(); } else { contentType = ; } } return new Content(orig, base, content, contentType, headers); } Solution: Use the content-type information as it is from the webserver and move the content type discovering from Protocol plugins to the Component where the parsing is done - to the ParseFactory. Than just create a list of parsers for the content type returned by the server and the custom detected content type. In the end we can iterate over all parser until we got a successfully parsed status. Problem: Content will be parsed also if the protocol reports a exception and has a non successful status, in such a case the content is new byte[0] in any case. Solution: Fetcher.java, line 243. Change: if (!Fetcher.this.parsing ) { .. to if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) { // TODO we may should not write out here emthy parse text and parse date, i suggest give outputpage a parameter parsed true / false outputPage(new FetcherOutput(fle, hash, protocolStatus), content, new ParseText(), new ParseData(new ParseStatus(ParseStatus.NOTPARSED), , new Outlink[0], new Properties())); return null; } Problem: Actually the configuration of parser is done based on plugin id's, but one plugin can have several extentions, so normally a plugin can provide several parser, but this is no limited just wrong values are used in the configuration process. Solution: Change plugin id to extension id in the parser configuration file and also change this code in the parser factory to use extension id's everywhere. Problem: there is not a clear differentiation between content type and mime type. I'm notice that some plugins call metaData.get(Content-Type) or content.getContentType(); Actually in theory this can return different values, since the content type could be detected by the MimesTypes util and is not the same as delivered in the http response header. As mentioned actually content type is only detected by the MimeTypes
[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332568 ] Jerome Charron commented on NUTCH-88: - Corrections are committed (http://svn.apache.org/viewcvs.cgi?rev=326889view=rev). Sorry for the delay, but I do my best... (thanks Chris for proposing your help) Implementation Note: In this implementation, the MimeType.clean(String) method constructs a new MimeType object (the MimeType constructor clean the content-type) each type it is called. It was the speedest way for solving this issue. But it is not optimal code, since it will better for performance (avoid instantiating very short time life objects) that: 1. The clean method really contains the cleaning code. 2. The MimeType constructors uses the clean method. Regards Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=all ] Jerome Charron resolved NUTCH-88: - Resolution: Fixed Assign To: Jerome Charron Second step implementation details: http://svn.apache.org/viewcvs.cgi?rev=292865view=rev And final step implementation details: http://svn.apache.org/viewcvs.cgi?rev=321231view=rev (some unit tests corrections: http://svn.apache.org/viewcvs.cgi?rev=321250view=rev) Big thanks to Chris Mattmann and Sébastien Le Callonnec. Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323007 ] Jerome Charron commented on NUTCH-88: - Dawid, Thanks for your pointers on IE MimeType resolution. We have in Nutch a MimeType resolver based on both file-ext and files magic sequences to find the content-type of a file. It is actually underused, and perhaps some enhancement must be added: such as the content-type mapping: allow to map a content-type to a normalized one (ie mapping for instance application/powerpoint to application/vnd.ms-powerpoint, so that only the normalized version must be registered in the plugin.xml file). Chris, Thanks in advance for your futur work. Could you please synchronize your efforts with Sébastien, since he seems very interested to contribute. Andrzej, The way to express a preference of one plugin over another, if both support the same content type is to activate the plugin you want to handle a content-type and deactivate onther ones. No? Note: Since the MimeResolver handles associations between file-extensions and content-types, the path-suffix in plugin.xml (and in ParserFactory policy for choosing a Parser) could certainly be removed in order to have only one central point for storing this knowledge. Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy
Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-53) Parser plugin for Zip files
[ http://issues.apache.org/jira/browse/NUTCH-53?page=all ] Jerome Charron resolved NUTCH-53: - Fix Version: 0.8-dev Resolution: Fixed Parser committed after some minor refactoring due to some API changes. (http://svn.apache.org/viewcvs.cgi?rev=278626view=rev) Thanks to Rohit Kulkarni. Parser plugin for Zip files --- Key: NUTCH-53 URL: http://issues.apache.org/jira/browse/NUTCH-53 Project: Nutch Type: Improvement Components: fetcher Reporter: Rohit Kulkarni Priority: Trivial Fix For: 0.8-dev Attachments: parse-zip.zip Nutch plugin to parse Zip files (using java.util.zip) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-21) parser plugin for MS PowerPoint slides
[ http://issues.apache.org/jira/browse/NUTCH-21?page=all ] Jerome Charron closed NUTCH-21: --- Fix Version: 0.8-dev Resolution: Fixed Commited to trunk (http://svn.apache.org/viewcvs.cgi?rev=267226view=rev) Thanks to Stephan Strittmatter. Note: Take care of the patches attached to this issue since the unit tests are platform dependent (only successed on windows). The committed code is platform independent (I hope). I tested it on Linux, so if someone can test it on other platforms it would be a good idea. parser plugin for MS PowerPoint slides -- Key: NUTCH-21 URL: http://issues.apache.org/jira/browse/NUTCH-21 Project: Nutch Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.8-dev Attachments: MSPowerPointParser.java, build.xml.patch.txt, parse-mspowerpoint.zip, parse-mspowerpoint.zip transfered from: http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356 submitted by: Stephan Strittmatter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-65) index-more plugin can't parse large set of modification-date
[ http://issues.apache.org/jira/browse/NUTCH-65?page=all ] Jerome Charron closed NUTCH-65: --- Resolution: Fixed Patch committed (http://svn.apache.org/viewcvs.cgi?rev=265794view=rev) index-more plugin can't parse large set of modification-date - Key: NUTCH-65 URL: http://issues.apache.org/jira/browse/NUTCH-65 Project: Nutch Type: Bug Components: indexer Versions: 0.7, 0.8-dev Environment: nutch 0.7, java 1.5, linux Reporter: Lutischán Ferenc Fix For: 0.8-dev Attachments: MoreIndexingFilter.diff, MoreIndexingFilter.java, commons-lang-2.1.jar I found a problem in MoreIndexingFilter.java. When I indexing segments, I get large list of error messages: can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT I modifiing source code (I don't make a 'patch'): Original (lines 137-138): DateFormat df = new SimpleDateFormat(EEE MMM dd HH:mm:ss zzz); Date d = df.parse(date); New: DateFormat df = new SimpleDateFormat(EEE, MMM dd HH:mm:ss , Locale.US); Date d = df.parse(date.substring(0,25)); The modified code works fine. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-21) parser plugin for MS PowerPoint slides
[ http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_12320717 ] Jerome Charron commented on NUTCH-21: - Want to commit it, but unit tests failed. parser plugin for MS PowerPoint slides -- Key: NUTCH-21 URL: http://issues.apache.org/jira/browse/NUTCH-21 Project: Nutch Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Trivial Attachments: build.xml.patch.txt, parse-mspowerpoint.zip, parse-mspowerpoint.zip transfered from: http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356 submitted by: Stephan Strittmatter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-20) Extract urls from plain texts
[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ] Jerome Charron closed NUTCH-20: --- Fix Version: 0.8-dev Resolution: Fixed Revision 233559 - http://svn.apache.org/viewcvs.cgi?rev=233559view=rev * Add utility to extract urls from plain text (thanks to Stephan Strittmatter) * Uses the OutlinkExtractor in parse plugins PDF, MSWord, Text, RTF, Ext Note: Take a look at the JSParseFilter in order to use the OutlinkExtractor in it. Extract urls from plain texts -- Key: NUTCH-20 URL: http://issues.apache.org/jira/browse/NUTCH-20 Project: Nutch Type: Improvement Components: fetcher Reporter: Stefan Grroschupf Priority: Trivial Fix For: 0.8-dev Attachments: OutlinkExtractor.java, OutlinkExtractor.java, OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt Some parsers have no Outlinks returned. E.g. the Word-Parser. This class is able to extract (absolute) hyperlinks from a plain String (content) and generates outlinks from them. This would be very usful for parser which have no explicite extraction of hyperlinks. Excample: Outlink[] links = OutlinkExtractor.getOutlinks(Nutch is located at http://www.apache.org and ...); Will return an array of Outlinks containing the one element of http://www.apache.org;. transfered from: http://sourceforge.net/tracker/index.php?func=detailaid=1109328group_id=59548atid=491356 submitted by: Stephan Strittmatter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-71) Search web page doesn't not focus on query input
[ http://issues.apache.org/jira/browse/NUTCH-71?page=all ] Jerome Charron closed NUTCH-71: --- Fix Version: 0.8-dev Resolution: Fixed Assign To: Jerome Charron Thanks Christophe for reporting it and for your piece of code. Search web page doesn't not focus on query input Key: NUTCH-71 URL: http://issues.apache.org/jira/browse/NUTCH-71 Project: Nutch Type: Bug Components: searcher Reporter: Christophe Noel Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Attachments: searchQueryFocus.patch In search.html and search.jsp , keyboard cursor does not focus in the form query input. I've made a patch for en and fr search.html and for search.jsp. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-74) French Analyzer Plugin
[ http://issues.apache.org/jira/browse/NUTCH-74?page=all ] Jerome Charron updated NUTCH-74: Component: indexer Fix Version: 0.8-dev Version: 0.7 0.6 0.8-dev French Analyzer Plugin -- Key: NUTCH-74 URL: http://issues.apache.org/jira/browse/NUTCH-74 Project: Nutch Type: New Feature Components: indexer Versions: 0.7, 0.8-dev, 0.6 Environment: Nutch Reporter: Christophe Noel Assignee: Jerome Charron Fix For: 0.8-dev Attachments: analyze-french.zip, analyzers-050705.patch This is DRAFT for a new plugin for French Analysis (all java file come from Lucene project sandbox)... This includes ISO LATIN1 accent filter, plurial forms removing, ... Analyze-frech should be used instead of NutchDocumentAnalysis as described by Jerome Charron in New Language Identifier project. It should be used also as a query-parser in Nutch searcher. We miss an EXTENSION-POINT to include this kind of plugin in Nutch. Could anyone help me to build this new Extension Point please ? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira