[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323007 ] Jerome Charron commented on NUTCH-88: - Dawid, Thanks for your pointers on IE MimeType resolution. We have in Nutch a MimeType resolver based on both file-ext and files magic sequences to find the content-type of a file. It is actually underused, and perhaps some enhancement must be added: such as the content-type mapping: allow to map a content-type to a normalized one (ie mapping for instance application/powerpoint to application/vnd.ms-powerpoint, so that only the normalized version must be registered in the plugin.xml file). Chris, Thanks in advance for your futur work. Could you please synchronize your efforts with Sébastien, since he seems very interested to contribute. Andrzej, The way to express a preference of one plugin over another, if both support the same content type is to activate the plugin you want to handle a content-type and deactivate onther ones. No? Note: Since the MimeResolver handles associations between file-extensions and content-types, the path-suffix in plugin.xml (and in ParserFactory policy for choosing a Parser) could certainly be removed in order to have only one central point for storing this knowledge. Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy
Jerome: Give me a shout if you need a hand on this. I'll be happy to help and as it happens, I'll be available in the next few weeks. Sébastien, Great! As I mentioned in my last comment on JIRA, please synchronize with Chris on this point. I'm currently coding on other subjects and don't have time to code on this issue. But I can participate on the reflexion and I'm ok to review the proposal. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy
[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323009 ] Dawid Weiss commented on NUTCH-88: -- Yep, I know about byte-magic mime detector. I'm just pointing out Internet Explorer doesn't use it... or at least, it doesn't always use it the way you would expect it to. Whether Nutch should mimic IE in this behaviour is another question. Enhance ParserFactory plugin selection policy - Key: NUTCH-88 URL: http://issues.apache.org/jira/browse/NUTCH-88 Project: Nutch Type: Improvement Components: indexer Versions: 0.7, 0.8-dev Reporter: Jerome Charron Fix For: 0.8-dev The ParserFactory choose the Parser plugin to use based on the content-types and path-suffix defined in the parsers plugin.xml file. The selection policy is as follow: Content type has priority: the first plugin found whose contentType attribute matches the beginning of the content's type is used. If none match, then the first whose pathSuffix attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose pathSuffix is the empty string is used. This policy has a lot of problems when no matching is found, because a random parser is used (and there is a lot of chance this parser can't handle the content). On the other hand, the content-type associated to a parser plugin is specified in the plugin.xml of each plugin (this is the value used by the ParserFactory), AND the code of each parser checks itself in its code if the content-type is ok (it uses an hard-coded content-type value, and not uses the value specified in the plugin.xml = possibility of missmatches between content-type hard-coded and content-type delcared in plugin.xml). A complete list of problems and discussion aout this point is available in: * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
bug in bin/nutch?
Trying to get mapred stuff to work, and I find it hard to believe that this is a bug, but just trying to go through the tutorial, I enter bin/nutch admin db -create and get Exception in thread main java.lang.NoClassDefFoundError: admin Looking through bin/nutch, sure enough there isn't a chunk for admin. But there is in trunk. If I add it back in as per my patch below, then it seems to work. But that sure seems like it would be broken for every person that walks through the tutorial on mapred. Earl ~/nutch/branches/mapred $ svn diff bin/nutch Index: bin/nutch === --- bin/nutch (revision 279726) +++ bin/nutch (working copy) @@ -124,6 +124,8 @@ # figure out which class to run if [ $COMMAND = crawl ] ; then CLASS=org.apache.nutch.crawl.Crawl +elif [ $COMMAND = admin ] ; then + CLASS=org.apache.nutch.tools.WebDBAdminTool elif [ $COMMAND = inject ] ; then CLASS=org.apache.nutch.crawl.Injector elif [ $COMMAND = generate ] ; then __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
tutorial suggestion
Walking through the tutorial http://lucene.apache.org/nutch/tutorial.html and just a little suggestion. For the s1=`ls -d segments/2* | tail -1` s2=`ls -d segments/2* | tail -1` s3=`ls -d segments/2* | tail -1` I suggest using \ls just in case users have an alias like alias ls='ls -lFa' like me. Such an alias, without the \ls means that echo $s1 gives something like drwxr-xr-x 8 nutch nutch 4096 Sep 9 03:08 segments/20050909030535/ which isn't going to work so hot. Yeah, kind of dumb, I know, but pretty well any ls alias would break it. Only took me a couple minutes to figure out, but I don't see a reason to not have \ls. Thanks, Earl __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: bug in bin/nutch?
The DB format in mapred branch is completely different. So, what you create with admin db -create is the old DB format, not used in the mapred branch. Please study the code to the Crawl command, this should help... Mapred stuff is powerful, but it is also very different from the current way of doing things, so there will be alot to learn... Guess I figured as much. Can I suggest that someone typing bin/nutch admin ... in the mappred branch, should get pointed to the proper command, or at least a message saying that admin doesn't exist in the mapred branch, just to save some confusion. There is a dumb patch below that would change the usage line. I think such differences are all the more reason to have a nice mapred tutorial, which I would be more than willing to help with. I thought I was close, but I have yet to get a mapred crawl/index/search completed. Your comment makes me think I am still aways off. Thanks, Earl Index: bin/nutch === --- bin/nutch (revision 279734) +++ bin/nutch (working copy) @@ -29,7 +29,7 @@ echo Usage: nutch COMMAND echo where COMMAND is one of: echo crawl one-step crawler for intranets - echo admin database administration, including creation + echo admin not used in mapred echo injectinject new urls into the database echo generate generate new segments to fetch echo fetch fetch a segment's pages __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: bug in bin/nutch?
Earl Cahill wrote: Guess I figured as much. Can I suggest that someone typing bin/nutch admin ... in the mappred branch, should get pointed to the proper command, or at least a message saying that There is no separate command - for now the DB is created when you run Injector or Crawl (which calls Injector as the first step). Other commands from the script should work very similarly, even though they use now different implementations: * inject - runs Injector to add urls from a plaintext file (one url per line, there may be many input files, and they must be placed inside a directory). This creates the CrawlDB in the destination directory if it didn't exist before, or updates the existing one. Note that the new CrawlDB does NOT contain links - they are stored separately in a LinkDB, and CrawlDB just stores the equivalents of Page in the former WebDB. * generate - runs Generate to create new fetchlists to be fetched * fetch - runs the modified Fetcher to fetch segments * updatedb - runs CrawlDB.update() to update the CrawlDB with new page information, and to add new unfetched pages. * invertlinks - creates or updates a LinkDB, containing incoming link information. Note that it takes as an argument the top level dir, where the new segments are contained, and not the dir names of segments... * index - runs the new modified Indexer to create an index of the fetched segments. The above commands read the mapred configuration, and for now it defaults to local, which means that all Jobs execute within the same JVM, and NDFS also defaults to local. The rest of the commands in bin/nutch have to do with a distributed setup. admin doesn't exist in the mapred branch, just to save some confusion. There is a dumb patch below that would change the usage line. I think such differences are all the more reason to have a nice mapred tutorial, which I would be more than willing to help with. I thought I was close, but Yes, I agree. But there are still some command-line tools missing, or not yet ported to use mapred. At this point a general tutorial would be difficult... unless it would be simply you need to run ./nutch crawl ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch 0.7 bug?
Hi Michael, I going back to a nigthly build. I think this problem is related to 'fetcher.threads.per.host' value, when it is bigger than 1. There is another possible sources: fetcher.threads.fetch or fetcher.threads.per.host or parser.threads.parse. Best Regards, Ferenc Hi Ferenc, I see the same errors. As I've seen a running installation yesterday, I think it's a configuration mistake. By now I have no idea where. Have you made any progress? Regards Michael [EMAIL PROTECTED] wrote: Dear Developers! I tested nutch 0.7 with all the parser plugins, and found the followings: - The fetch broken by with e.g. followings: - 050901 110915 fetch okay, but can't parse http://www.dienes-eu.sulinet.hu/informatika/2005/tantervek/hpp/9.doc, reason: failed (2,200): org.apache.nutch.parse.msword.FastSavedException: Fast-saved files are unsupported at this time 050901 110915 fetching http://en.mimi.hu/fishing/scad.html 050901 110917 SEVERE error writing output:java.lang.NullPointerException java.lang.NullPointerException at org.apache.nutch.parse.ParseData.write(ParseData.java:109) at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110917 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319 java.io.IOException: key out of order: 319 after 319 at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) Exception in thread main java.lang.RuntimeException: SEVERE error logged. Exiting fetcher. at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) 050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319 java.io.IOException: key out of order: 319 after 319 at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319 java.io.IOException: key out of order: 319 after 319 at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281) at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148) 050901 110921 SEVERE error writing output:java.io.IOException: key out of order: 319 after 319 java.io.IOException: key out of order: 319 after 319 at org.apache.nutch.io.MapFile$Writer.checkKey(MapFile.java:134) at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:120) at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39) etc. --- There are the differences between nutch-site.xml and nutch-default.xml: --- * nutch-default.xml namehttp.timeout/name value1/value descriptionThe default network timeout, in milliseconds./description * NUTCH-SITE.XML namehttp.timeout/name value3/value descriptionThe default network timeout, in milliseconds./description * * nutch-default.xml namehttp.max.delays/name value3/value descriptionThe number of times a thread will delay when trying to * NUTCH-SITE.XML namehttp.max.delays/name value6/value descriptionThe number of times a thread will delay when trying to *
Re: fetch performance
AJ wrote: I tried to run 10 cycles of fetch/updatabs. In the 3rd cycle, the fetch list had 8810 urls. Fetch ran pretty fast on my laptop before 4000 pages were fetched. After 4000 pages, it suddenly switched to very slow speed, about 30 mins for just 100 pages. My laptop also started to run at 100% CPU all the time. Is there a threshold for fetch list size, above which fetch performance will be degraded? Or it was because my laptop? I know -topN option can control the fetch size. But, topN=4000 seems too small because it will end up thousands of segments. Is there a good rule of thumb for topN setting ? A related question is how big a segment should be in order to keep the number of segments small without hitting fetch performance too much. For example, to crawl 1 million pages in one run (has many fetch cycles), what will be a good limit for each fetch list? There are no artificial limits like that - I'm routinely fetching segments of 1 mln pages. Most likely what happened to you is that: * you are using Nutch version with PDFBox 0.7.1 or below * you fetched a rare kind of PDF, which puts PDFBox in a tight loop * the thread that got stuck is consuming 99% of your CPU. :-) Solution: upgrade PDFBox to the yet unreleased 0.7.2 . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com