Re: Luke and Indexes
Bryan Woliner wrote: I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does Index version refer to? It's the time (as in System.currentTimeMillis()) when the index was last modified. 2. Also in the overview tab, if Has Deletions? is equal to yes, where are the possible sources of deletions? Dedup? Manual deletions through luke? Either. Both. 3. Is there any way (w/ Luke or otherwise) to get a file listing all of the docs in an index. Basically is there an index equivalent of this command (which outputs all the URLs in a segment): bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir You can browse through documents on the Document tab. But there is no option to dump all documents to a file. Besides, some fields which are not stored are no longer accessible, so you cannot retrieve them from the index (you may be able to reconstruct them, but it's a lossy operation). 4. Finally, my last question is the one I'm most perplexed by: I called bin/nutch segread -list -dir for a particular segments directory and found out that one directory had 93 entries. BUT, when I opened up the index of that segment in Luke, there were only 23 documents (and 3 deletions)! Where did the rest of the URLs go?? Do a segread -dump and check what is the protocol status and parse status for the pages that didn't make it to the index. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
how to
how to use nutch support chinese? -- My Web: http://kukusky.8800.org MSN: [EMAIL PROTECTED] E-mail: [EMAIL PROTECTED]
Plugin path in Nutch web
Hi everyone, I'm writing an JSP program to allow crawling via web. My JSP script follows nutch.tools.CrawlTool, which try to create database, inject database, fecth and index. I have difficulty of identifying the plugins. Creating database is fine, because it doesn't require any plugin. But when it comes to injection, I got this error: java.lang.ExceptionInInitializerError org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437) org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378) org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535) org.apache.jsp.crawl_jsp._jspService(crawl_jsp.java:151) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) From the ExceptionInInitializerError, I guess the problem is in loading the plugin (since loading is static). Could anyone suggest me how to set the path to plugin in nutch-default.xml? Thanks a lot. Regards, Giang
Crawling - dynamically generate web pages with paginations
Hi all, I need to crawl a page that has a pagination which is more that the depth i specify.. could i crawl all the pages listed in the pagination.. or is their any way in Nutch to change the depth if the page has paginations in that... Any ideas would help greatly.. Thanks in advance regards -Hussain.
Too many open file error -while searching using Nutch
HI all, I get an error like Too Many open files while i try to search my segments which is in hundreds in count. Is there any way to solve this issue ? Do Nutch dont close the segmens afte the search ? kindly send your suggestion for the above issue .. Thanks in advance. Regards -Hussain.
Crawling listing (pagination) pages.
HI all, Do Nutch crawl pages in any listing pages( pages with pagination as in search engines) While crawling through nutch i need to get the pages that gets displayed by the pagination unless i increase the depth of the whole crawling. Do nutch provide any plugin for the above issue ? Is there anyway to solve the above issue ? Any help is greatly appreciated Thanks in advance regards -Hussain
Re: Crawling listing (pagination) pages.
Hi I am facing the same problem. However my crawl only focuses on some website and I recognize the paganition url ursing regexp and inject them in every fetch cycle. /Jack On 12/8/05, K.A.Hussain Ali [EMAIL PROTECTED] wrote: HI all, Do Nutch crawl pages in any listing pages( pages with pagination as in search engines) While crawling through nutch i need to get the pages that gets displayed by the pagination unless i increase the depth of the whole crawling. Do nutch provide any plugin for the above issue ? Is there anyway to solve the above issue ? Any help is greatly appreciated Thanks in advance regards -Hussain -- Keep Discovering ... ... http://www.jroller.com/page/jmars
How to refresh the application context - to use the merged index
HI all, ..while crawling using Nutch,i do segment merging and indexing ,but the search doesnt look into the new mergedsegment unless i restart the server.. Is there any way to refresh the application context to look into the new mergedindex without stopping the server ? Is there anyway to do the above issue ? Do Nutch provide any option to solve the same ? Any help would greatly help. Thanks in advance. regards -Hussain.
After mergesegs
Hi, just wanted to be sure - after I merge segments via the mergesegs tool, I need to use the updatedb tool before dropping the new indexes in, correct? And, as just posted, I need to shutdown and restart Tomcat, too, yes? Thanks, DaveG
Re: Luke and Indexes
Thank you very much for the helpful answers. Most of the pages that didn't make it into the index were indeed due to protocol errors (mostly exceeding http.max.delay). One quick side note. When I was looking at the Nutch wiki page for bin/nutch segread, I noticed an error on the page and wasn't sure how to go about fixing it, or alerting someone who can. The page currently reads: ... -nocontent ignore content data -noparsedata ignore parse_data data -nocontent ignore parse_text data ... The 2nd -nocontent should probably be -noparsetext, right? Thanks again for the help, Bryan On 12/8/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Bryan Woliner wrote: I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does Index version refer to? It's the time (as in System.currentTimeMillis()) when the index was last modified. 2. Also in the overview tab, if Has Deletions? is equal to yes, where are the possible sources of deletions? Dedup? Manual deletions through luke? Either. Both. 3. Is there any way (w/ Luke or otherwise) to get a file listing all of the docs in an index. Basically is there an index equivalent of this command (which outputs all the URLs in a segment): bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir You can browse through documents on the Document tab. But there is no option to dump all documents to a file. Besides, some fields which are not stored are no longer accessible, so you cannot retrieve them from the index (you may be able to reconstruct them, but it's a lossy operation). 4. Finally, my last question is the one I'm most perplexed by: I called bin/nutch segread -list -dir for a particular segments directory and found out that one directory had 93 entries. BUT, when I opened up the index of that segment in Luke, there were only 23 documents (and 3 deletions)! Where did the rest of the URLs go?? Do a segread -dump and check what is the protocol status and parse status for the pages that didn't make it to the index. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Problem with fetching segment
I have followed the media-style.com quick tutorial, but when I try to fetch my segment the fetch is killed! Have tried to set the system timer + 30 days, no anti-virus is running on the systems. System SUSE 9.2 and SUSE 10 # bin/nutch fetch segments/20060109014654/ 060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-default.xml 060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-site.xml 060109 014715 No FS indicated, using default:local 060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/plugins 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/query-more 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-site/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-html/plugin.xml 060109 014715 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-text/plugin.xml 060109 014715 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-ext 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-pdf 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-rss 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-basic/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/index-more 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-js 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml 060109 014715 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-ftp 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/parse-msword 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/creativecommons 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/ontology 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-file 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-http/plugin.xml 060109 014715 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/clustering-carrot2 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/language-identifier 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/urlfilter-prefix 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-url/plugin.xml 060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index-basic/plugin.xml 060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/protocol-httpclient 060109 014715 logging at INFO 060109 014715 fetching http://www.sourceforge.net/ 060109 014715 fetching http://www.apache.org/ 060109 014715 fetching http://www.nutch.org/ 060109 014715 http.proxy.host = null 060109 014715 http.proxy.port = 8080 060109 014715 http.timeout = 1 060109 014715 http.content.limit = -1 060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org) 060109 014715 fetcher.server.delay = 5000 060109 014715 http.max.delays = 52 060109 014718 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060109 014724 status: segment 20060109014654, 3 pages, 0 errors, 51033 bytes, 8309 ms 060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0 bytes/page
Re: Plugin path in Nutch web
Make sure that you have specify correct name/path for plugins directory( plugin.folders). Also both plugin and nutch source must be from same version. Another solution is download latest trunk version of nutch from svn. On 12/8/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: Hi everyone, I'm writing an JSP program to allow crawling via web. My JSP script follows nutch.tools.CrawlTool, which try to create database, inject database, fecth and index. I have difficulty of identifying the plugins. Creating database is fine, because it doesn't require any plugin. But when it comes to injection, I got this error: java.lang.ExceptionInInitializerError org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437) org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java :378) org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535) org.apache.jsp.crawl_jsp._jspService(crawl_jsp.java:151) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service( JspServletWrapper.java:324) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java :292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) From the ExceptionInInitializerError, I guess the problem is in loading the plugin (since loading is static). Could anyone suggest me how to set the path to plugin in nutch-default.xml? Thanks a lot. Regards, Giang