Parsed Text and Re-parsing
Hi all, As suggested in JIRA discussion, I would like to ask for 2 issue, 1. Can I change the default neko parser behaviour, so it add tab instead of space for block level element? 2. Can I do the re-parsing for the crawled page? Does the re-parsing make change the segments? Thank you for any suggestion and comment. Vinci -- View this message in context: http://www.nabble.com/Parsed-Text-and-Re-parsing-tp16392741p16392741.html Sent from the Nutch - User mailing list archive at Nabble.com.
Delete document from segment/index
Hi all, Does it possible to delete the document from nutch index and segment? Thank you, Vinci -- View this message in context: http://www.nabble.com/Delete-document-from-segment-index-tp16254945p16254945.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: RSS parser plugin bug?
Hi sishen, You should, atom feed is broken for quite a long time. If you don't want to replace the origin plugin, just use another name. Especially when your plugin only work for atom feed, I think you should use name like parse-atom or atom-parserplease refer to the rss parser plugin for the naming convention Follow up: After I check more feeds I crawled, except the broken character, I found not all the title get the mis-parsing: some text is parsed correctly, some doesn't, but both are well-formed... Thank you, Vinci sishen wrote: > > I also prefer title than description. > > Also, I found there is some problems to parse the atom feed with the lib > "commons-feedparser". > I have implemented a new plugin to fix the problem with > rome<https://rome.dev.java.net/>. > > > But i doubt whether should I submit it to the nutch trunk? > > Best regards. > > sishen > > On Mon, Mar 24, 2008 at 3:36 PM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi all, >> I found that the rss parser plugin is using the content text in >> as anchor text but not the - so that it always >> index >> the description, but the title text is always not indexed or used as >> anchor >> text. >> >> But actually the title is much more valuable and should be used as anchor >> text. >> >> Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody >> post in JIRA ? >> >> Thank you for your attention. >> -- >> View this message in context: >> http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16249932.html Sent from the Nutch - User mailing list archive at Nabble.com.
Broken crawled content?
Hi all, I am trying to dump the content by the segment reader(bin/nutch -dump). The output text contain 2 encoding, utf-8 and a multi-byte character-encoding. When I open the dump page, I found the multi-byte encoding is broken - even I convert to the correct encoding, the text displayed is broken. How can I fix the text? Thank you. -- View this message in context: http://www.nabble.com/Broken-crawled-content--tp16246942p16246942.html Sent from the Nutch - User mailing list archive at Nabble.com.
RSS parser plugin bug?
Hi all, I found that the rss parser plugin is using the content text in as anchor text but not the - so that it always index the description, but the title text is always not indexed or used as anchor text. But actually the title is much more valuable and should be used as anchor text. Is this a bug or a misunderstanding of RSS? If this is a bug, can anybody post in JIRA ? Thank you for your attention. -- View this message in context: http://www.nabble.com/RSS-parser-plugin-bug--tp16246578p16246578.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch crawled page status code explanation needed
Hi all, I begin with working with nutch fetched page. When I try to dump segments, I see there is many status code, e.g. 67 (linked), 65(signature), 33(fetch_success)...etc. I googled but no more clue, can anyone give a list of those status code and explain their different? Thank you. -- View this message in context: http://www.nabble.com/Nutch-crawled-page-status-code-explanation-needed-tp16237183p16237183.html Sent from the Nutch - User mailing list archive at Nabble.com.
RE: Recrawling without deleting crawl directory
Hi, Seems you need to mention what is "modified document". Which case would it be? Case 1: you dump the crawled page from nutch segment and do what you like on them If this is the case, you need to think of which action you want: I. modified the document and then ask nutch to crawl the modified directory? II. modified the document, write back to segment (the crawl DB), then do the indexing? Case 2: Keep track the document update For this case, when you keep on doing re-crawl based on the same crawl DB (what you need to tune the the day of re-crawl), then nutch will do the update for you. Hope it help :) Jean-Christophe Alleman wrote: > > > > Hi, > > I have nothing said. This works fine ! It's morning and I'm still not woke > up :-D > > I just want to know if it was possible to re index modified documents ? Or > re index documents which are already in database ? > > Thank's in advance ! > > Jisay > > >> >> Hi Susam Pal and thank's for your help ! >> >> The solution you give to me doesn't work... I have still an error with >> Hadoop... And if I download an older version of the API, will this patch >> work ? I have Nutch-0.9 and I don't know if I compile with an oder Hadoop >> API, this patch will work. But if it will work where can I find an older >> version of Hadoop API ? >> >> Thank's in advance for your help, >> >> Jisay >> >> >>> >>> I am not sure but it seems that this is because of an older version of >>> Hadoop. I don't have older versions of Nutch or Hadoop with me to >>> confirm this. Just try omitting the second argument in: >>> fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()) and see if it >>> compiles? >>> >>> I guess, fs.listPaths(indexes) should work since I can find such a >>> method (though it is deprecated now) in the latest Hadoop API. >>> >>> Regards, >>> Susam pal >>> >>> On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman >>> wrote: Thank's for your reply Susam Pal ! I have run ant and I have an error I can't resolve... Look at this : debian:~/nutch-0.9# ant Buildfile: build.xml init: [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into /root/nutch-0.9/build/hadoop [untar] Expanding: /root/nutch-0.9/build/hadoop/bin.tgz into /root/nutch-0.9/bin [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into /root/nutch-0.9/build compile-core: [javac] Compiling 133 source files to /root/nutch-0.9/build/classes [javac] /root/nutch-0.9/src/java/org/apache/nutch/crawl/Crawl.java:150: cannot find symbol [javac] symbol : variable HadoopFSUtil [javac] location: class org.apache.nutch.crawl.Crawl [javac] merger.merge(fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()), [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 error BUILD FAILED /root/nutch-0.9/build.xml:106: Compile failed; see the compiler error output for details. Total time: 8 seconds I have already corrected 3errors but I can't correct this one... I don't know what's HadoopFSUtil and so I can't correct the error... Help me please, Thank's for your help ! Jisay > > The patch was generated for Nutch 1.0 development version which is > currently in trunk. So, it is unable to patch your older version > cleanly. > > I also see that you are using NUTCH-601v0.3.patch. However, > NUTCH-601v1.0.patch is the recommended patch. If this patch fails, you > can make the modifications manually. This patch is extremely simple > and if you just open the patch using a text editor, you would find > that 3 lines have been removed from the original source code > (indicated by leading minus signs) and 11 new lines have been added > (indicated by plus signs). You have to make these changes manually to > your Nutch 0.9 source code directory. > > Once you make the changes, just build your project again with ant and > you would be ready for recrawl. > > Regards, > Susam Pal > > On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman > wrote: >> >> >> Hi, I'm interested by this patch but I can't patch it. I have some >> problems when I try to patch... >> >> Here is what I do : >> >> debian:~/patch# patch -p0> can't find file to patch at input line 5 >> Perhaps you used the wrong -p or --strip option? >> The text leading up to this was: >> -- >> |Index: src/java/org/apache/nutch/crawl/Crawl.java >> |=== >> |--- s
Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
Hi, congrat:) btw, unless you set permission other then 755, no much permission thing you need to care if you use tomcat. one question: did you changed the plugin list? What plugin are you using? I wonder how can you get the language of your query... John Mendenhall wrote: > >> please check the path of the search.dir in property file (nutch-site.xml) >> located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is >> accessable or not. >> >> if you use absolute path then this will be another problem > > Super! Thanks a bunch! That was it. > The property is actually serverer.dir. > We always use absolute paths since it helps tremendously > not having to worry about where one is when the process is > started. > > We had moved it from one matchine to another and had > forgotten to make sure the tomcat process owner 'tomcat' > was in the nutch group 'nutch'. Fixed that and it works > like a charm. > > Thanks again! > > JohnM > > -- > john mendenhall > [EMAIL PROTECTED] > surf utopia > internet services > > -- View this message in context: http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075816.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
Hi, please check the path of the search.dir in property file located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or not. if you use absolute path then this will be another problem Hope it help John Mendenhall wrote: > > I am running nutch 0.9, with tomcat 6.0.14. > When I use the NutchBean to search the index, > it works fine. I get back results, no errors. > I have used tomcat before and it has worked > fine. > > Now I am getting an error searching through > tomcat. This is the tomcat error I am seeing > in the catalina.out log file: > > - > 2008-03-15 15:38:38,715 INFO NutchBean - query request from > 192.168.245.58 > 2008-03-15 15:38:38,717 INFO NutchBean - query: penasquitos > 2008-03-15 15:38:38,717 INFO NutchBean - lang: en > Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve > invoke > SEVERE: Servlet.service() for servlet jsp threw exception > java.lang.NullPointerException > at > org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159) > at > org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177) > - > > When I run a search using the NutchBean, I > see debug log entries in the hadoop.log. > When I run the search using Tomcat, I never > see any hadoop.log entires. > > We have 1.4 million indexed pages, taking > up 31gb for the nutch/crawl directory. > > The search term doesn't matter. > > My guess is it may be a memory error, > but I am not seeing it anywhere. > Is there a place where I can set the memory > footprint for tomcat to use more memory? > > Or, is there another place I should be looking? > > Thanks in advance for any pointers or assistance. > > JohnM > > -- > john mendenhall > [EMAIL PROTECTED] > surf utopia > internet services > > -- View this message in context: http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075186.html Sent from the Nutch - User mailing list archive at Nabble.com.
incorrect Query tokenization
Hi all, I have change the NutchAnalyzer in the indexing phase by plugin (plug-in based on anaylsis-fr or analysis-fr), but I found the query tokenized in its old way - look like the tokenizer did not parse the query with the same tokenizer index them... I checked the index, they are indexed as I want. I also checked the hadoop log, all plugin loaded (Include the one changed the Indexer). However, both from the nutchBean and webapps, the tokenization is not correct. How can I fix it? (*The fastest solution Look like assign the language [by plugin language-identifier] of query, but I don't know where to start...) -- View this message in context: http://www.nabble.com/incorrect-Query-tokenization-tp16070144p16070144.html Sent from the Nutch - User mailing list archive at Nabble.com.
Missing zh.ngp for zh locate support for language Identifier
Hi all, I found there is missing zh.ngp for zh locate. I have seen this file via a screenshot and then I googled the filename return nothing for me...can anyone provide this file for me? Thank you -- View this message in context: http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Confusion of -depth parameter
Hi all, [This is a follow up post] I found this is my fault so I need to crawl one more level that I expected. Thank you Vinci wrote: > > Hi all, > > I have a confusion of the keyword depth... > > -seed.txt url1 -link1 >-link2 >-link3 >-link4 > url2 -link5 > ...etc > > However, I found the second level link (begin with -link) cannot be > crawled unless I set the depth is 3 but not 2, why? Does the depth 1 is > the seed url file? > -- View this message in context: http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16067808.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Change of analyzer for specific language
Hi all, [Follow up post] I found the method by myself. 1. Write a plugin for your own language. The method can refer to the analysis-de and analysis-fr to wrap the luence analyzer into your plugin. 2. Then you need to add them to your plugin-include list in nutch-site.xml or nutch-sites.xml . Also you need to add the language-identifier 3. [For those language is not supported by language identifier or think language identifier is too slow] OK, their is 50% chance you will fail if you are writing for eurpoean lanuguage, and 100% fail if you writing for Eastern Asia Language. The reason for that is , language-identifier fail - your language is not supported and you will see the default indexer do the indexing task for you. There is 2 method A. Hack the plugin language-identifier. i. hack all the class except the LanguageIdentifier.java: The detail will not mention here, because this is too many step and I write in rush. But 2 principle here is: a. remove all the reference to a LanguageIdentifier object, include declaration and call of this method via this reference. This is much easier if you have an IDE like NetBeans or Eclipse b. remember to change the language variable inner class of HTMLLanguageParser or Change the default return language when all the case fail. ii. change the langmappings.properties to the acutal encoding of your language - include all possible combination, in lower case. e.g. za = za, zah, utf, utf8 For the full list you can refer to the list of Iconv language support list - most system will support everything and you will see your language variance (well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include the first part if the target encoding has - or _ , like utf-8 written in utf and utf8 in example. then build the language-identifier again *XML is you need to create your own Parser based on HTMLLanguageParser . But you will fail in to default case quite soon if the xml witten bad enough that using UTF-8 as encoding but no lang element here. B. Hack the Indexer.java , mentioned by this post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html *For CJK, the default CJKAnalyzer can handle most of the case (especially you change documents to unicode...), just let zh/ja/kr go as default case. Vinci wrote: > > Hi all, > > How can I change the analyzer which is used by the indexer for specific > language? Also, can I use all the analyzer that I see in luke? > > Thank you. > -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html Sent from the Nutch - User mailing list archive at Nabble.com.
Change of analyzer for specific language
Hi all, How can I change the analyzer which is used by the indexer for specific language? Also, can I use all the analyzer that I see in luke? Thank you. -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html Sent from the Nutch - User mailing list archive at Nabble.com.
Where is the crawled/cached page html?
Hi all, After looked several materials, I find that Nutch indexing based on the parsed text - so If I don't want something to be indexed, I most likely need to remove the thing I don't want to indexed before parsing to pure text... Also, where is the cached page html file located? Is it the pre-parsed html or another html file stored in somewhere? Thank you for any answer or discussion -- View this message in context: http://www.nabble.com/Where-is-the-crawled-cached-page-html--tp16048280p16048280.html Sent from the Nutch - User mailing list archive at Nabble.com.
Indexing problem - not to index some word appear in link?
Hi all, I found that the related topic is affecting the search performance. besides I remove the hyperlinks in the parsing stage, can I not to index the word inside the element? -- View this message in context: http://www.nabble.com/Indexing-problem---not-to-index-some-word-appear-in-link--tp16047313p16047313.html Sent from the Nutch - User mailing list archive at Nabble.com.
Confusion of -depth parameter
Hi all, I have a confusion of the keyword depth... -seed.txt url1 -link1 -link2 -link3 -link4 url2 -link5 ...etc However, I found the second level link (begin with -link) cannot be crawled unless I set the depth is 3 but not 2, why? Does the depth 1 is the seed url file? -- View this message in context: http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16047305.html Sent from the Nutch - User mailing list archive at Nabble.com.
Crawler javascript handling, retrieve crawled HTML and modify the html structure?
Hi all, I found there is post that is talking about how to retrieve the parsed text, but how can I get back the html version, especially with bin/nutch (like readseg or readdb)? If no command available, what class should I deal with? Also, If I need to modify the html structure (add or remove tag), is it better for me to do the transform on the dumped html, than ask nutch to crawl them back or ask another tool to do the indexing for me? *Does nutch turn on the javascript parsing by default? If so, how can I turn it off? -- View this message in context: http://www.nabble.com/Crawler-javascript-handling%2C-retrieve-crawled-HTML-and-modify-the-html-structure--tp16023197p16023197.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Search server bin/nutch server?
Hi, I see your point and I understand the usage - after I started the server I still need nutch webapp to receive the query for me. so there is noway other than I using the webapp for query processing, or call the searcher in command-line? Thank you Tomislav Poljak wrote: > > Hi, > I'm not sure if I understand the question, but you can start server in > the background (bin/nutch server 4321 crawl/ &) and use it from Nutch > search web application on the same or any other machine. > > Tomislav > > On Tue, 2008-03-11 at 18:39 -0700, Vinci wrote: >> Hi, >> >> Thank you for the usage of this. >> One more question: If I started a search server to background, can I use >> it >> for receiving direct query from other webpage? >> Thank you >> >> >> Tomislav Poljak wrote: >> > >> > Hi, >> > this is used for Distributed Search, so if you want to use it start >> > server(s): >> > >> > bin/nutch server >> > >> > on the machine(s) where you have index(es) (you can put any free port >> > and crawl dir should point to your crawl folder). Then you should >> > configure Nutch search web app to use this server(s): you have to edit >> > nutch-site.xml in Nutch web application: point searcher.dir to folder >> > containing text file: search-servers.txt and in this file put >> server(s): >> > >> > server_host server_port >> > >> > and start/restart servlet container (Tomcat/Jetty/...) >> > >> > >> > Hope this helps, >> > >> > Tomislav >> > >> > >> > On Tue, 2008-03-11 at 03:06 -0700, Vinci wrote: >> >> How should I use this command to set up a search server to receive >> query? >> > >> > >> > >> > > > -- View this message in context: http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p16002053.html Sent from the Nutch - User mailing list archive at Nabble.com.
Crawling Domain limited the url listed in seed file
Hi, to save the resources, I want the crawler not crawling the link outside the domain of the url - so it focus on the current website (the seed url domain as well as its subdomain)? If I want to do so, what should I do? -- View this message in context: http://www.nabble.com/Crawling-Domain-limited-the-url-listed-in-seed-file-tp16001433p16001433.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: About link analysis and filter usage, and Recrawling
Hi, Thank you so much. I think most likely it will be the final post so that I will get my work done. Enis Soztutar wrote: > >> I need to remove the unnecessary html, xslt transformation(which will >> deal >> with the encoding issue for me), as well as file generation. >> For the program I have, dump out everything and not write back is much >> preferred, but look like If I do so I will lose some information of the >> page >> crawled? >> >> > You can write a parse-html plugin for this, or you can manually > manipulate the parse data by writing a > mapreduce program. > I see... As I seen before, the parsing phase is done for link analysis. If I do processing in this point, will I slow down the crawling? *map reduce looks interesting, but I don't have too much to to go depth. Enis Soztutar wrote: > >> Enis Soztutar wrote: >> >>> With the adaptive crawl, after several cycles the fetch frequency of a >>> page will be >>> automatically adjusted. >>> >>> >> so If I keep on crawling based on same crawldb, I will get this effect? >> > yes, exactly. > I see your point...One more question for this: after I look some of the config file, I found the default recrawl is 15 day. However, I want only the url in the seed url file to be recrawled but not the url it found while crawling. (because what I crawl is static page which will not be updated its main content, once it crawled the recrawl is unnecessary). Does it possible to do with updatedb but not starting a new crawl and merge? And, Which part of nutch is related to the url recrawl schedule? link db or injection? Also, what will nutch do if similar/same url is found while crawling? Thank you Little Off topic: Can nutch using Luke to stmulate the indexer management like Solr? -- View this message in context: http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p16001325.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Search server bin/nutch server?
Hi, Thank you for the usage of this. One more question: If I started a search server to background, can I use it for receiving direct query from other webpage? Thank you Tomislav Poljak wrote: > > Hi, > this is used for Distributed Search, so if you want to use it start > server(s): > > bin/nutch server > > on the machine(s) where you have index(es) (you can put any free port > and crawl dir should point to your crawl folder). Then you should > configure Nutch search web app to use this server(s): you have to edit > nutch-site.xml in Nutch web application: point searcher.dir to folder > containing text file: search-servers.txt and in this file put server(s): > > server_host server_port > > and start/restart servlet container (Tomcat/Jetty/...) > > > Hope this helps, > > Tomislav > > > On Tue, 2008-03-11 at 03:06 -0700, Vinci wrote: >> How should I use this command to set up a search server to receive query? > > > -- View this message in context: http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15996260.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: About link analysis and filter usage, and Recrawling
Hi, please see below for the follow up question Enis Soztutar wrote: > >> 3. If I need to processing the crawled page in more flexible way, Is it >> better I dump the document to process but not write back, or I write my >> plugin on the some phase? If I need to write plugin, which pharse is the >> best point for me to implement my own extension? >> > This depends on what you want to do with want kind of data. You should > be more specific. > I need to remove the unnecessary html, xslt transformation(which will deal with the encoding issue for me), as well as file generation. For the program I have, dump out everything and not write back is much preferred, but look like If I do so I will lose some information of the page crawled? Enis Soztutar wrote: > >> 4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of >> the >> crawling? >> > no, linkdb is used in the indexing phase. > So If I use other Indexer like Solr, I need to do additional processing on the page in order to keep the source link information? (like add the source link infomation) Enis Soztutar wrote: > >> 5. Is there any method to avoid nutch recrawl a page in recrawling >> script? >> (e.g. not to crawl a page since no update from last time) Any information >> can provided me to implement this? >> > With the adaptive crawl, after several cycles the fetch frequency of a > page will be > automatically adjusted. > so If I keep on crawling based on same crawldb, I will get this effect? Enis Soztutar wrote: > >> >> Thank you for reading this long post, and any answer or suggestion >> > You're welcome. > > Enis > Thank you for your kindly help, it really help a lots :) -- View this message in context: http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p15996240.html Sent from the Nutch - User mailing list archive at Nabble.com.
Search server bin/nutch server?
How should I use this command to set up a search server to receive query? -- View this message in context: http://www.nabble.com/Search-server-bin-nutch-server--tp15975737p15975737.html Sent from the Nutch - User mailing list archive at Nabble.com.
About link analysis and filter usage, and Recrawling
Hi everybody, I am trying to use nutch to implement my spider algorithm...I need to get information from specific resources, then schedule the crawling based on the link it found (i.e. nutch will be an link analyzer as well as crawler) Question here: 1. How can I get the links in linkdb? Is there any other method other than bin/nutch readlinkdb -dump? 2. I want all of my page crawled not begin updated, but I know I will do the recrawling based on the those crawled page, Is there any other method other than I dump the crawldb every time? 3. If I need to processing the crawled page in more flexible way, Is it better I dump the document to process but not write back, or I write my plugin on the some phase? If I need to write plugin, which pharse is the best point for me to implement my own extension? 4. If I set the crawl depth = 1, is linkdb be meaningless in the rest of the crawling? 5. Is there any method to avoid nutch recrawl a page in recrawling script? (e.g. not to crawl a page since no update from last time) Any information can provided me to implement this? Thank you for reading this long post, and any answer or suggestion -- View this message in context: http://www.nabble.com/About-link-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p15975729.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Error when request cache page in 1.0-dev
hi, anwser by myself again: the tika jar is not placed in the tomcat webapp in 1.0-dev that cause this exception thank you for your attention, Vinci Vinci wrote: > > Hi all, > > finally I make nutch can crawl and search, but when I click the cache > page, it throw a http 500 to me: > > > screen dump > > type Exception report > > message > > description The server encountered an internal error () that prevented it > from fulfilling this request. > > exception > > org.apache.jasper.JasperException: Exception in JSP: /cached.jsp:63 > > 60: } > 61: } > 62: else > 63: content = new String(bean.getContent(details)); > 64: } > 65: %> > 66: > > > thing I found in log > --- > 2008-01-31 19:04:46,324 INFO NutchBean - cache request from 127.0.0.1 > 2008-01-31 19:04:46,358 ERROR [jsp] - Servlet.service() for servlet jsp > threw exception > java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:247) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) > at > org.apache.hadoop.io.WritableName.getClass(WritableName.java:72) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1405) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1360) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1349) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1344) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:254) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:242) > at > org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.java:91) > at > org.apache.nutch.searcher.FetchedSegments$Segment.getReaders(FetchedSegments.java:90) > at > org.apache.nutch.searcher.FetchedSegments$Segment.getContent(FetchedSegments.java:68) > at > org.apache.nutch.searcher.FetchedSegments.getContent(FetchedSegments.java:139) > at > org.apache.nutch.searcher.NutchBean.getContent(NutchBean.java:347) > at org.apache.jsp.cached_jsp._jspService(cached_jsp.java:107) > at > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:331) > at > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) > at > org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) > at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) > at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) > at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) > at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) > at java.lang.Thread.run(Thread.java:619) > -- View this message in context: http://www.nabble.com/Error-when-request-cache-page-in-1.0-dev-tp15202557p15205147.html Sent from the Nutch - User mailing list archive at Nabble.com.
Error when request cache page in 1.0-dev
Hi all, finally I make nutch can crawl and search, but when I click the cache page, it throw a http 500 to me: screen dump type Exception report message description The server encountered an internal error () that prevented it from fulfilling this request. exception org.apache.jasper.JasperException: Exception in JSP: /cached.jsp:63 60: } 61: } 62: else 63: content = new String(bean.getContent(details)); 64: } 65: %> 66:
Cannot parse atom feed with plugin feed installed
Hi, I already add the plugin name into nutch-default.xml, but it still throw exception "ParseException: parser not found for contentType=application/atom+xml" while the rss feed work fine after I added parse-rss. I checked the feed support atom feed with mime-type application/atom+xml, did I miss any setting? -- View this message in context: http://www.nabble.com/Cannot-parse-atom-feed-with-plugin-feed-installed-tp15191469p15191469.html Sent from the Nutch - User mailing list archive at Nabble.com.
Can Nutch use part of the url found for the next crawling?
hi, I have some trouble with a site that doing content redirection: nutch can't crawl this site but can crawl its rss, but unfortunately the link in rss is redirect to the site -- this is the bad thing, but I found the link i want is appear in the link as an get parameter: http://site/disallowpart?url=the_link_i_want i see there is something call url-filter and regex-filter, which one can help me to extract the_link_i_want? Thank you. -- View this message in context: http://www.nabble.com/Can-Nutch-use-part-of-the-url-found-for-the-next-crawling--tp15190975p15190975.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: What is that mean? robots_denied(18)
hi, I found the anwserthis is generated because robots.txt disallowed crawling of the current url. hope it can help. Vinci wrote: > > hi, > > I finally make the crawler running without exception by build from trunk, > but I found the linkdb cannot crawl anything...and then I dump the crawl > db and seeing this in the metadata: > > _pst_:robots_denied(18) > > any idea? > -- View this message in context: http://www.nabble.com/What-is-that-mean--robots_denied%2818%29-tp15188811p15189990.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetch issue with Feeds (SOLVED)
Hi, finally I figure out the solution: go to conf/ rename the old mime-types.xml into anyting else, then copy tika-mimetypes.xml into the same directory with name mime-types.xml the crawler should work now. in short, this is because 1.0-dev using tika, but old-day mime detection config file is loaded. Vinci wrote: > > Hi, > > Here is the additional information: before the exception appear, nutch > advertise 2 message: > > fetching http://cnn.com > org.apache.tika.mime.MimeUtils load > INFO loading [mime-types.xml] > fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException > Fetcher: done > > Seems mime-type has problemdid I need to config the file it loaded? > > > > Vinci wrote: >> >> Hi All, >> >> I get the same exception when I trying with the nightly build for a >> static page, any one can help? >> >> >> Vicious wrote: >>> >>> Hi All, >>> >>> Using the latest nightly build I am trying to run a crawl. I have set >>> the agent property and all relevant plugin. However as soon as I run the >>> crawl I get the following error in hadoop.log. I read all the post here >>> and the only suggestion was the http.agent property should not be empty. >>> Well in my case it isnt and yet I see the error. Any help will be >>> appreciated. >>> >>> Thanks- >>> >>> fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed >>> with: java.lang.NullPointerE >>> http.Http - java.lang.NullPointerException >>> http.Http - at >>> org.apache.nutch.protocol.Content.getContentType(Content.java:327) >>> http.Http - at >>> org.apache.nutch.protocol.Content.(Content.java:95) >>> http.Http - at >>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) >>> http.Http - at >>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164) >>> >> >> > > -- View this message in context: http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189897.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetch issue with Feeds
Hi, Here is the additional information: before the exception appear, nutch advertise 2 message: fetching http://cnn.com org.apache.tika.mime.MimeUtils load INFO loading [mime-types.xml] fetch of http://www.cnn.com/ failed with: java.lang.NullPointerException Fetcher: done Seems mime-type has problemdid I need to config the file it loaded? Vinci wrote: > > Hi All, > > I get the same exception when I trying with the nightly build for a static > page, any one can help? > > > Vicious wrote: >> >> Hi All, >> >> Using the latest nightly build I am trying to run a crawl. I have set the >> agent property and all relevant plugin. However as soon as I run the >> crawl I get the following error in hadoop.log. I read all the post here >> and the only suggestion was the http.agent property should not be empty. >> Well in my case it isnt and yet I see the error. Any help will be >> appreciated. >> >> Thanks- >> >> fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed with: >> java.lang.NullPointerE >> http.Http - java.lang.NullPointerException >> http.Http - at >> org.apache.nutch.protocol.Content.getContentType(Content.java:327) >> http.Http - at org.apache.nutch.protocol.Content.(Content.java:95) >> http.Http - at >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) >> http.Http - at >> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164) >> > > -- View this message in context: http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189590.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetch issue with Feeds
Hi All, I get the same exception when I trying with the nightly build for a static page, any one can help? Vicious wrote: > > Hi All, > > Using the latest nightly build I am trying to run a crawl. I have set the > agent property and all relevant plugin. However as soon as I run the crawl > I get the following error in hadoop.log. I read all the post here and the > only suggestion was the http.agent property should not be empty. Well in > my case it isnt and yet I see the error. Any help will be appreciated. > > Thanks- > > fetcher.Fetcher - fetch of http://feeds.wired.com/CultOfMac failed with: > java.lang.NullPointerE > http.Http - java.lang.NullPointerException > http.Http - at > org.apache.nutch.protocol.Content.getContentType(Content.java:327) > http.Http - at org.apache.nutch.protocol.Content.(Content.java:95) > http.Http - at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:226) > http.Http - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:164) > -- View this message in context: http://www.nabble.com/Fetch-issue-with-Feeds-tp15114911p15189123.html Sent from the Nutch - User mailing list archive at Nabble.com.
What is that mean? robots_denied(18)
hi, I finally make the crawler running without exception by build from trunk, but I found the linkdb cannot crawl anything...and then I dump the crawl db and seeing this in the metadata: _pst_:robots_denied(18) any idea? -- View this message in context: http://www.nabble.com/What-is-that-mean--robots_denied%2818%29-tp15188811p15188811.html Sent from the Nutch - User mailing list archive at Nabble.com.
Dedup: Job Failed and crawl stopped at depth 1
I run the 0.9 crawler with parameter -depth 2 -threads 1, and I get the job failed message for a dynamic-content site: Dedup: starting Dedup: adding indexes in: /var/crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) in the hadoop.log: 2008-01-30 15:08:12,402 INFO indexer.Indexer - Optimizing index. 2008-01-30 15:08:12,601 INFO indexer.Indexer - Indexer: done 2008-01-30 15:08:12,602 INFO indexer.DeleteDuplicates - Dedup: starting 2008-01-30 15:08:12,622 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: /var/crawl/indexes 2008-01-30 15:08:12,882 WARN mapred.LocalJobRunner - job_b5nenb java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) Also the crawling stop at depth=1 2008-01-30 15:08:10,083 WARN crawl.Generator - Generator: 0 records selected for fetching, exiting ... 2008-01-30 15:08:10,084 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I checked the index is work in luke, it only fetch the page of url in the list. I tried the search in luke and it seems work well, but the nutch searcher return nothing to me..did I miss some setting or this is the problem of aborted indexing? -- View this message in context: http://www.nabble.com/Dedup%3A-Job-Failed-and-crawl-stopped-at-depth-1-tp15176806p15176806.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi, thank you.:) Seems I need to write a Java program to write out the file and do the transformation. Another question to the dumped linkdb: I find escaped html appear in the end of the link, is it the fault of the parser (the html most likely not valid, but I really don't need the chunk of the invalid code)? If I want to change the link parser, what do I need to do (especially I prefer the change it by plugins)? Martin Kuen wrote: > > Hi there, > > On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> Thank you :) >> One more question for the fetched page reading: I prefer I can dump the >> fetched page into a single html file. > > You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to > create a seperate file for each downloaded file. > You could modify the SegmentReader class ( > org.apache.nutch.segment.SegmentReader) if you want to do that. > > No other way besides invert the >> inverted file? >> > The index is not inverted if you use the "readseg" command. The fetched > content (e.g html pages) is stored in the "crawl/segments" folder. The > lucene index is stored in "crawl/indexes". This (lucene) index is created > after all crawling has finished. The readseg command (SegmentReader class) > only accesses "crawl/segments", so the lucene index is not touched. lucene > index --> the inverted index > > Best Regards, > > Martin > > >> >> Martin Kuen wrote: >> > >> > Hi, >> > >> > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: >> > >> >> >> >> Hi, >> >> >> >> I am new to nutch and I am trying to run a nutch to fetch something >> from >> >> specific websites. Currently I am running 0.9. >> >> >> >> As I have limited resources, I don't want nutch be too aggressive, so >> I >> >> want >> >> to set some delay, but I am confused with the value of >> http.max.delays, >> >> does >> >> it use milliseconds insteads of seconds? (Some people said it is in 3 >> >> second >> >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) >> >> >> > >> > "http.max.delays" doesn't specify a timespan - read the description >> more >> > carefully. I think "fetcher.server.delay" is what you are looking for. >> It >> > is >> > the amount of time the fetcher will at least wait until it fetches >> another >> > url from the same host. Keep in mind that the fetcher obeys robots.txt >> > files >> > (by default) - so if a robots.txt file is present the crawling will >> occur >> > "polite enough". >> > >> > >> >> Also, I need to read the fetched page so that I can do some >> modification >> >> on >> >> the html structure for future parsing, where is the files located? Are >> >> they >> >> store in pure html or they are breaken down into multiple file? if >> this >> >> is >> >> not html file, how can I read the fetched page? >> >> >> > >> > If you are looking for a way to programmatically read the fetched >> content >> > ( >> > e.g. html pages) have a look at the IndexReader class. >> > If you are looking for a way to dump the whole downloaded content to a >> > Text >> > file or want to see some statistical information about it, try the >> > "readseg" >> > command. >> > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions >> > >> >> >> >> And will the cached page losing all the original html attribute when >> it >> >> viewed in cached page? >> >> >> > The page will be stored character by character, including html tags. >> > >> >> >> >> Also, how can I read the link that nutch found and how can I control >> the >> >> crawling sequence? (change it to breadth-first search at the top >> level, >> >> then >> >> depth-first one by one) >> >> >> > Crawling always occurs breadth-first. If you want fine-grained control >> > over >> > the crawling sequence you should follow the procedure in the nutch >> > tutorial >> > for "whole internet crawling". Nevertheless the crawling occurs >> > breath-first. >> > >> >> >> >> Sorry for many questions. >> > >> > >> > HTH, >> > >> > Martin >> > >> > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . >> > (nice semester abroad . . . hehe ;) >> > >> > >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Tomcat query
Hi, Here is the anwser for q1 and q3, 1. the tomcat is for the online search interface. If you won't include the documentation in the release product, you don't need to include it in the package, just setup the tomcat on the server where has the index file located, modify the config file and deploy the war file in tomcat mnager. Also, make sure the html file is here and everybody can read it by a url started with http. 3. setup a tomcat server mentioned for everybody can enter the search page and submit the query by their web browser, and find a correct places in the shipped package to tell the user the url of online help :) Jaya Ghosh wrote: > > Hello, > > > > I have a query. > > > > I have created an index of our online documentation files (htmls). > Therefore > it is more like an intranet search, that is, the search will be performed > on > static documents only. Now I need to test it. My machine does not have > Tomcat installed. The IT department has informed me that as a Tomcat user > I > need to have root permissions and they need permission from the higher > authority to assign me the same. > > > > My queries are: > > > > 1. If I succeed implementing Nutch in our tool will we have to ship > Tomcat/provide URL to the end-users? > > 2. Is there an alternative to above? > > 3. Am I right in assuming that in static documents the index is built only > once and that is what we would be shipping with the tool? Therefore, the > end > user will not need any permissions as such to perform the search? > > > > As mentioned earlier, I am a writer and hence not technical. > > > > Thanks in advance for any help/response. > > > > Regards, > > Ms.Jaya > > > -- View this message in context: http://www.nabble.com/Tomcat-query-tp15131352p15164964.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. No other way besides invert the inverted file? Martin Kuen wrote: > > Hi, > > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> I am new to nutch and I am trying to run a nutch to fetch something from >> specific websites. Currently I am running 0.9. >> >> As I have limited resources, I don't want nutch be too aggressive, so I >> want >> to set some delay, but I am confused with the value of http.max.delays, >> does >> it use milliseconds insteads of seconds? (Some people said it is in 3 >> second >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) >> > > "http.max.delays" doesn't specify a timespan - read the description more > carefully. I think "fetcher.server.delay" is what you are looking for. It > is > the amount of time the fetcher will at least wait until it fetches another > url from the same host. Keep in mind that the fetcher obeys robots.txt > files > (by default) - so if a robots.txt file is present the crawling will occur > "polite enough". > > >> Also, I need to read the fetched page so that I can do some modification >> on >> the html structure for future parsing, where is the files located? Are >> they >> store in pure html or they are breaken down into multiple file? if this >> is >> not html file, how can I read the fetched page? >> > > If you are looking for a way to programmatically read the fetched content > ( > e.g. html pages) have a look at the IndexReader class. > If you are looking for a way to dump the whole downloaded content to a > Text > file or want to see some statistical information about it, try the > "readseg" > command. > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions > >> >> And will the cached page losing all the original html attribute when it >> viewed in cached page? >> > The page will be stored character by character, including html tags. > >> >> Also, how can I read the link that nutch found and how can I control the >> crawling sequence? (change it to breadth-first search at the top level, >> then >> depth-first one by one) >> > Crawling always occurs breadth-first. If you want fine-grained control > over > the crawling sequence you should follow the procedure in the nutch > tutorial > for "whole internet crawling". Nevertheless the crawling occurs > breath-first. > >> >> Sorry for many questions. > > > HTH, > > Martin > > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . > (nice semester abroad . . . hehe ;) > > >> -- >> View this message in context: >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html Sent from the Nutch - User mailing list archive at Nabble.com.
Newbie Questions: http.max.delays, view fetched page, view link db
Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use milliseconds insteads of seconds? (Some people said it is in 3 second by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) Also, I need to read the fetched page so that I can do some modification on the html structure for future parsing, where is the files located? Are they store in pure html or they are breaken down into multiple file? if this is not html file, how can I read the fetched page? And will the cached page losing all the original html attribute when it viewed in cached page? Also, how can I read the link that nutch found and how can I control the crawling sequence? (change it to breadth-first search at the top level, then depth-first one by one) Sorry for many questions. -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html Sent from the Nutch - User mailing list archive at Nabble.com.