[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ] Jerome Charron commented on NUTCH-139: -- Ok, Chris and me will implement MetadataNames in this way. Just some few comments: I plan to move the MetadataNames to a class rather than an interface. Two reasons: 1.1 I don't like the design of implementing an interface in order to import some constants in a class: It gives some javadoc with a lot of class with many public constants defined without any really needs to show these constants in the javadoc. 1.2 I want to add an utility method in MetadataNames that tries to find the appropriate Nutch normalized metadata name from a string. It will be based on the Levenshtein Distance (available in commons-lang). More about Levenshtein Distance at http://www.merriampark.com/ld.htm Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: nutch-0.8-dev *mapred.input.subdir* problem ?
Lukas, the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g. /tmp to write in some data) If this not solve the problem, just run the commands manually step by step, there is a tutorial in the wiki how to run the map rd commands step by step. Stefan Am 21.12.2005 um 06:56 schrieb Lukas Vlcek: Hi, I am trying to use nutch-0.8-dev and I have a problem with crawl run. I did checkout from SVN and prepared fresh package (ant package - all went fine). Then I installed nutch on linux and made only minor changes to nutch-site.xml file (turned on some plugins and increased several constansts), prepared file with urls and started bin/nutch crawl. This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the following exception in log file: 051220 204248 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 crawl started in: ./crawl.test 051220 204249 rootUrlDir = urls 051220 204249 threads = 10 051220 204249 depth = 6 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 Injector: starting 051220 204249 Injector: crawlDb: ./crawl.test/crawldb 051220 204249 Injector: urlDir: urls 051220 204249 Injector: Converting injected urls to crawl db entries. 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ job_4zwds6.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml , nutch-site.xml at org.apache.nutch.mapred.InputFormatBase.listFiles (InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits (InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:63) 051220 204249 Running job: job_4zwds6 Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 308) at org.apache.nutch.crawl.Injector.inject(Injector.java:102) at org.apache.nutch.crawl.Crawl.main(Crawl.java:101) It seems that the problem is that Nutch is not able to find mapred.input.subdir setting in neither of config files. I found that there is mapred.input.dir property defined in config for particular job (job_4zwds6.xml) with value equal to the name of my urls file but I don't understand where should I define mapred.input.subdir property and what value to assign to it (if it needs to be defined manually - note that mapred.input.dir seems to be configured automatically). Does anybody know the answer? p.s: Note that number of lines it the exception trace above for InputFormatBase.java file (85,95) can differ a bit as I tried to insert some more LOG.debug() commands there in search of the root cause and then I removed them again but it is possible that I left some extra empty lines there. Thanks, Lukas
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361043 ] Andrzej Bialecki commented on NUTCH-139: - Regarding the move to a class with public static fields: I don't have any problem with that. Regarding the Levenshtein distance: I think we can do even better, before we resort to such generic methods: 1) bring all property names to lowercase 2) remove any non-letters Example: Content-type vs. ContentType: 1) content-type vs. contenttype. 2) contenttype vs. contenttype - match These two steps could be simply implemented in a custom Comparator for the ContentProperties. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ] Jerome Charron commented on NUTCH-139: -- Andrzej, Do you read in my mind? Yes of course, that's the way I want to do it: First checks for the most common cases (lower cases + keeps only letters), then use the Levenshtein distance if needed (last chance). Regards Jérôme Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: nutch-0.8-dev *mapred.input.subdir* problem ?
Stefan, Nutch created folders in /tmp so I think if it should able to create files there as well. I also tried to change all /tmp* in conf file to my home folder with the same result (i.e.: folders were created and several files were dumped there but it yielded the same exception). Are you able to run nutch from up-to-date trunk package build? May be I didn't explain it clearly - I am using untch-0.8-dev which I get from nutch-trunk. Regards, Lukas On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Lukas, the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g. /tmp to write in some data) If this not solve the problem, just run the commands manually step by step, there is a tutorial in the wiki how to run the map rd commands step by step. Stefan Am 21.12.2005 um 06:56 schrieb Lukas Vlcek: Hi, I am trying to use nutch-0.8-dev and I have a problem with crawl run. I did checkout from SVN and prepared fresh package (ant package - all went fine). Then I installed nutch on linux and made only minor changes to nutch-site.xml file (turned on some plugins and increased several constansts), prepared file with urls and started bin/nutch crawl. This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the following exception in log file: 051220 204248 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 crawl started in: ./crawl.test 051220 204249 rootUrlDir = urls 051220 204249 threads = 10 051220 204249 depth = 6 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 Injector: starting 051220 204249 Injector: crawlDb: ./crawl.test/crawldb 051220 204249 Injector: urlDir: urls 051220 204249 Injector: Converting injected urls to crawl db entries. 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ job_4zwds6.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml , nutch-site.xml at org.apache.nutch.mapred.InputFormatBase.listFiles (InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits (InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:63) 051220 204249 Running job: job_4zwds6 Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 308) at org.apache.nutch.crawl.Injector.inject(Injector.java:102) at org.apache.nutch.crawl.Crawl.main(Crawl.java:101) It seems that the problem is that Nutch is not able to find mapred.input.subdir setting in neither of config files. I found that there is mapred.input.dir property defined in config for particular job (job_4zwds6.xml) with value equal to the name of my urls file but I don't understand where should I define mapred.input.subdir property and what value to assign to it (if it needs to be defined manually - note that mapred.input.dir seems to be configured automatically). Does anybody know the answer? p.s: Note that number of lines it the exception trace above for InputFormatBase.java file (85,95) can differ a bit as I tried to insert some more LOG.debug() commands there in search of the root cause and then I removed them again but it is possible that I left some extra empty lines there. Thanks, Lukas
Re: nutch-0.8-dev *mapred.input.subdir* problem ?
Yes, I'm able to run it, no problem but I'm using the step by step commands not the crawl (allinOne) command. Can you try a ant test - do all test pass? Am 21.12.2005 um 12:52 schrieb Lukas Vlcek: Stefan, Nutch created folders in /tmp so I think if it should able to create files there as well. I also tried to change all /tmp* in conf file to my home folder with the same result (i.e.: folders were created and several files were dumped there but it yielded the same exception). Are you able to run nutch from up-to-date trunk package build? May be I didn't explain it clearly - I am using untch-0.8-dev which I get from nutch-trunk. Regards, Lukas On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Lukas, the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g. /tmp to write in some data) If this not solve the problem, just run the commands manually step by step, there is a tutorial in the wiki how to run the map rd commands step by step. Stefan Am 21.12.2005 um 06:56 schrieb Lukas Vlcek: Hi, I am trying to use nutch-0.8-dev and I have a problem with crawl run. I did checkout from SVN and prepared fresh package (ant package - all went fine). Then I installed nutch on linux and made only minor changes to nutch-site.xml file (turned on some plugins and increased several constansts), prepared file with urls and started bin/nutch crawl. This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the following exception in log file: 051220 204248 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 crawl started in: ./crawl.test 051220 204249 rootUrlDir = urls 051220 204249 threads = 10 051220 204249 depth = 6 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 Injector: starting 051220 204249 Injector: crawlDb: ./crawl.test/crawldb 051220 204249 Injector: urlDir: urls 051220 204249 Injector: Converting injected urls to crawl db entries. 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml 051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ job_4zwds6.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml , nutch-site.xml at org.apache.nutch.mapred.InputFormatBase.listFiles (InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits (InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:63) 051220 204249 Running job: job_4zwds6 Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 308) at org.apache.nutch.crawl.Injector.inject(Injector.java:102) at org.apache.nutch.crawl.Crawl.main(Crawl.java:101) It seems that the problem is that Nutch is not able to find mapred.input.subdir setting in neither of config files. I found that there is mapred.input.dir property defined in config for particular job (job_4zwds6.xml) with value equal to the name of my urls file but I don't understand where should I define mapred.input.subdir property and what value to assign to it (if it needs to be defined manually - note that mapred.input.dir seems to be configured automatically). Does anybody know the answer? p.s: Note that number of lines it the exception trace above for InputFormatBase.java file (85,95) can differ a bit as I tried to insert some more LOG.debug() commands there in search of the root cause and then I removed them again but it is possible that I left some extra empty lines there. Thanks, Lukas
IndexSorter optimizer
Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. The reason why result quality seems better is quite interesting, and it shows that the simple top-N measures that I was using in my benchmarks may have been too simplistic. Using the original index, it was possible for pages with high tf/idf of a term, but with a low boost value (the OPIC score), to outrank pages with high boost but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as junk, e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars. Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). This again brings to attention the importance of the OPIC score: it represents a query-independent opinion about the quality of the page - whichever way you calculate it. If you use PageRank, it (allegedly) corresponds to other people's opinions about the page, thus providing an objective quality opinion. If you use a simple list of white/black-listed sites that you like/dislike, then it represents your own subjective opinion on the quality of the site; etc, etc... In this way, running a search engine that provides good results is not just a plain precision, recall, tf/idf and other tangible measures, it's also a sort of political statement of the engine's operator. ;-) To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Crawling a nutch index with Lucene
Hi, I'm rather new to nutch, but is there something wrong with the idea of creating an index with nutch (using the intranet search from the nutch tutorial) and searching this index with Lucene? I.e. doing something like this: import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.index.IndexReader; ... Searcher searcher = new IndexSearcher(IndexReader.open(indexDir)); For my setup this leads to the exception pasted at the end of the message. :-( I found a similar question on another list (http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/2641952.html) but it looks like people there didn't really get the question and hence this doesn' help much. Any help for this is greatly appreciated! Best, Oliver java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.get(Unknown Source) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:14 at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:8 at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:45) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:89) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at Lucy.main(Lucy.java:21) search: (ioe) -1
Re: Crawling a nutch index with Lucene
On Mittwoch 21 Dezember 2005 17:13, Oliver Hummel wrote: java.lang.ArrayIndexOutOfBoundsException: -1 That's the error you get when you open a Lucene 1.9 index with Lucene 1.4. But I don't know if that's also the case here. Regards Daniel -- http://www.danielnaber.de
Re: Crawling a nutch index with Lucene
Yep, that's it. Nutch has Lucene 1.9 in its lib. Many thanks! Oliver Daniel Naber wrote: On Mittwoch 21 Dezember 2005 17:13, Oliver Hummel wrote: java.lang.ArrayIndexOutOfBoundsException: -1 That's the error you get when you open a Lucene 1.9 index with Lucene 1.4. But I don't know if that's also the case here. Regards Daniel
Re: nutch-0.8-dev *mapred.input.subdir* problem ?
You can ignore mapred.input.subdir; I find it is an unneeded option. Now that the mapred branch is merged to be the trunk, there is a need to clarify the documentation since the a change was made to have the input be specified as a directory and then all files in that directory are considered input files (no wildcard needed). I will put that on my ToDo list. mapred.input.dir is an abstract path that is either the OS filesystem or NDFS, depending on which is in use (if fs.default.name is local then the local OS fs is being used, otherwise fs.default.name is something like domainOfMyMasterNode:port). To use NDFS, you need to copy your input file(s) from your local fs to NDFS: bin/nutch ndfs -put /home/peb/urls_localfs/oneFILENAME /urls The destination path /urls is arbitrary and is created as a side effect of the file -put. Repeat this for each file you have. Paul Lukas Vlcek wrote: java.io.IOException: No input directories specified in: NutchConf: nutch-default.xml , mapred-default.xml , /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml , nutch-site.xml at org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63) 051220 204249 Running job: job_4zwds6 Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.crawl.Injector.inject(Injector.java:102) at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
Re: IndexSorter optimizer
Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). As far we experiment it would be good to have booth. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D Great work! Cheers, Stefan
Re: IndexSorter optimizer
I've got 400mill db i can run this against over the next few days. -byron --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Andrzej, wow are really great news! Using the optimized index, I reported previously that some of the top-scoring results were missing. As it happens, the missing results were typically the junk pages with high tf/idf but low boost. Since we collect up to N hits, going from higher to lower boost values, the junk pages with low boost value were automatically eliminated. So, overall the subjective quality of results was improved. On the other hand, some of the legitimate results with a decent boost values were also skipped because they didn't fit within the fixed number of hits... ah, well. Perhaps we should limit the number of hits in LimitedCollector using a cutoff boost value, and not the maximum number of hits (or maybe both?). As far we experiment it would be good to have booth. To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ... May someone out there in the community has a commercial search engine running (e.g. google appliance or similar) so we may can setup a nutch with the same pages and compare the results. I guess it will be difficult to compare nutch with yahoo or google since nobody of us has a 4 billion index up and running. I would run one on my laptop but I do not have the bandwidth to fetch until next two days. :-D Great work! Cheers, Stefan
Re: Static initializers
Andrzej, well I'm not ready with digging into the problem but want to ask some more questions. BTW I counted 195 places that use NutchConf.get(), so this will be a bigger patch. :) As I mentioned I would love to go the inversion of control way, so not using nutchConf in the constructor but make classes implementing the Configurable interface. This for example would be sensefully for all classed realizing a extension. But there are also classes where this makes no sense. For example I would suggest to change the PluginRegestry from a singleton to a 'normal' object, in this case I guess it make sense to use the nutchConf in the constructor, since the configuration here only need to know the include and exclude regex for the plugins. So: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) This makes sense, here we can check if the class implements the configurable interface and if so instantiate the object and set the configuration. ExtensionPoint.getExtensions() - getExtensions(NutchConf) We don't need NutchConf here since if I understand it correct this is only needed to identify the activated plugins and this is done until regestry instantiation that in this case take a NutchConf as parameter. PluginRepository.getExtensionPoint(String) - getExtensionPoint (String, NutchConf) We don't need it here as well, since we use NutchConf until regestry instantiation. The other case would be that we have to build up the plugin dependency graph for each method call. Would you agree to have a several plugin regestries with may be different NutchConf's but instantiate extensions with nutchConf but not query ExtentsionPoints etc? etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. I guess this is difficult. First we have the plugin class instances, most or may all plugins I know do not have a plugin class implementation, second we have the extensions classes that at least do not need to implement a specific interface from the plugin regestry point of view (only such things like index filter interface etc.) Caching plugin class instances makes sense since actually there is only one plugin class instance per plugin in the jvm. However there will be many instances for each extension class, since e.g. the parser or protocoll runs multithreaded. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. In general my point of view is that: In case we touch this issue anyway I would love to do a radical solution, since i have a other understanding of handling parameters than collect them in a kind of map and make the map general accessible. Instead of giving any object access to the configuration object and handle properties like a bazar I would prefer handle configuration only in the first object in the stack, that would be in our case for example the indexing tool. Than the indexing tool instantiate the plugin registry only with the required properties that would be part of the constructor, e.g. pluginFolders, include, exlude reg ex and autoactivation flag. Later the extension instances can be also get some more values injected, but has in general no access to the configuration object. This would first of all make things better testable but also allows much much more flexibility to run several different fetchers in one jvm. Anyway this would be may a imporvement suggestion from me for nutch 2.0 or 3.0 for now we would be some steps forward just changing NutchConf access to non static style. I hopefully found some time until next days to do some experiments and will come back with some more details. However we should found a general agreement about the way we go, since changing code in 195 places and lines that depends on it for just nothing is not that funny. Stefan
Re: IndexSorter optimizer
Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. This is very interesting. What's the computational complexity and disk I/O for index sorting as compared other operations on an index (e.g. adding/deleting N documents and running optimize)?
Re: IndexSorter optimizer
American Jeff Bowden wrote: Andrzej Bialecki wrote: Hi, I'm happy to report that further tests performed on a larger index seem to show that the overall impact of the IndexSorter is definitely positive: performance improvements are significant, and the overall quality of results seems at least comparable, if not actually better. This is very interesting. What's the computational complexity and disk I/O for index sorting as compared other operations on an index (e.g. adding/deleting N documents and running optimize)? Comparable to optimize(). All index data needs to be read and copied, so the whole process is I/O bound. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com