Re: newbie questions
Hello Brian, Getting a response from another newbie here, so I could be wrong (do excuse if I am). If you are attempting to run a search index from the filesystem you need to have the following in your nutch-site.xml : property namefs.default.name/name valuefile:value /property The fs.default.name is require by the nutch-site.xml when you build your .war file for deployment to tomcat. This should be accompanied by the below config, which should point to the direct where your index has been copied to, in my case it looks something like below : property namesearcher.dir/name value/home/nutch/nutch/service/crawl/value description Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory index containing merged indexes, or the directory segments containing segment indexes. /description /property Regarding your second question : bin/nutch readdb yourcrawldir/crawldb -dump -format csv Gives you a nice flat file serialisation of your crawl database. I hope this helps, Mischa On 1 Dec 2009, at 08:44, brian wrote: also, I would like to know how to extract flat text files of the crawl data. ___ Mischa Tuffield Email: mischa.tuffi...@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi there, On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote: Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to create a seperate file for each downloaded file. You could modify the SegmentReader class ( org.apache.nutch.segment.SegmentReader) if you want to do that. No other way besides invert the inverted file? The index is not inverted if you use the readseg command. The fetched content (e.g html pages) is stored in the crawl/segments folder. The lucene index is stored in crawl/indexes. This (lucene) index is created after all crawling has finished. The readseg command (SegmentReader class) only accesses crawl/segments, so the lucene index is not touched. lucene index -- the inverted index Best Regards, Martin Martin Kuen wrote: Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use milliseconds insteads of seconds? (Some people said it is in 3 second by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) http.max.delays doesn't specify a timespan - read the description more carefully. I think fetcher.server.delay is what you are looking for. It is the amount of time the fetcher will at least wait until it fetches another url from the same host. Keep in mind that the fetcher obeys robots.txt files (by default) - so if a robots.txt file is present the crawling will occur polite enough. Also, I need to read the fetched page so that I can do some modification on the html structure for future parsing, where is the files located? Are they store in pure html or they are breaken down into multiple file? if this is not html file, how can I read the fetched page? If you are looking for a way to programmatically read the fetched content ( e.g. html pages) have a look at the IndexReader class. If you are looking for a way to dump the whole downloaded content to a Text file or want to see some statistical information about it, try the readseg command. Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions And will the cached page losing all the original html attribute when it viewed in cached page? The page will be stored character by character, including html tags. Also, how can I read the link that nutch found and how can I control the crawling sequence? (change it to breadth-first search at the top level, then depth-first one by one) Crawling always occurs breadth-first. If you want fine-grained control over the crawling sequence you should follow the procedure in the nutch tutorial for whole internet crawling. Nevertheless the crawling occurs breath-first. Sorry for many questions. HTH, Martin PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . (nice semester abroad . . . hehe ;) -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. No other way besides invert the inverted file? Martin Kuen wrote: Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use milliseconds insteads of seconds? (Some people said it is in 3 second by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) http.max.delays doesn't specify a timespan - read the description more carefully. I think fetcher.server.delay is what you are looking for. It is the amount of time the fetcher will at least wait until it fetches another url from the same host. Keep in mind that the fetcher obeys robots.txt files (by default) - so if a robots.txt file is present the crawling will occur polite enough. Also, I need to read the fetched page so that I can do some modification on the html structure for future parsing, where is the files located? Are they store in pure html or they are breaken down into multiple file? if this is not html file, how can I read the fetched page? If you are looking for a way to programmatically read the fetched content ( e.g. html pages) have a look at the IndexReader class. If you are looking for a way to dump the whole downloaded content to a Text file or want to see some statistical information about it, try the readseg command. Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions And will the cached page losing all the original html attribute when it viewed in cached page? The page will be stored character by character, including html tags. Also, how can I read the link that nutch found and how can I control the crawling sequence? (change it to breadth-first search at the top level, then depth-first one by one) Crawling always occurs breadth-first. If you want fine-grained control over the crawling sequence you should follow the procedure in the nutch tutorial for whole internet crawling. Nevertheless the crawling occurs breath-first. Sorry for many questions. HTH, Martin PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . (nice semester abroad . . . hehe ;) -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use milliseconds insteads of seconds? (Some people said it is in 3 second by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) http.max.delays doesn't specify a timespan - read the description more carefully. I think fetcher.server.delay is what you are looking for. It is the amount of time the fetcher will at least wait until it fetches another url from the same host. Keep in mind that the fetcher obeys robots.txt files (by default) - so if a robots.txt file is present the crawling will occur polite enough. Also, I need to read the fetched page so that I can do some modification on the html structure for future parsing, where is the files located? Are they store in pure html or they are breaken down into multiple file? if this is not html file, how can I read the fetched page? If you are looking for a way to programmatically read the fetched content ( e.g. html pages) have a look at the IndexReader class. If you are looking for a way to dump the whole downloaded content to a Text file or want to see some statistical information about it, try the readseg command. Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions And will the cached page losing all the original html attribute when it viewed in cached page? The page will be stored character by character, including html tags. Also, how can I read the link that nutch found and how can I control the crawling sequence? (change it to breadth-first search at the top level, then depth-first one by one) Crawling always occurs breadth-first. If you want fine-grained control over the crawling sequence you should follow the procedure in the nutch tutorial for whole internet crawling. Nevertheless the crawling occurs breath-first. Sorry for many questions. HTH, Martin PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . (nice semester abroad . . . hehe ;) -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie Questions: http.max.delays, view fetched page, view link db
Hi, thank you.:) Seems I need to write a Java program to write out the file and do the transformation. Another question to the dumped linkdb: I find escaped html appear in the end of the link, is it the fault of the parser (the html most likely not valid, but I really don't need the chunk of the invalid code)? If I want to change the link parser, what do I need to do (especially I prefer the change it by plugins)? Martin Kuen wrote: Hi there, On Jan 29, 2008 5:23 PM, Vinci [EMAIL PROTECTED] wrote: Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to create a seperate file for each downloaded file. You could modify the SegmentReader class ( org.apache.nutch.segment.SegmentReader) if you want to do that. No other way besides invert the inverted file? The index is not inverted if you use the readseg command. The fetched content (e.g html pages) is stored in the crawl/segments folder. The lucene index is stored in crawl/indexes. This (lucene) index is created after all crawling has finished. The readseg command (SegmentReader class) only accesses crawl/segments, so the lucene index is not touched. lucene index -- the inverted index Best Regards, Martin Martin Kuen wrote: Hi, On Jan 29, 2008 11:11 AM, Vinci [EMAIL PROTECTED] wrote: Hi, I am new to nutch and I am trying to run a nutch to fetch something from specific websites. Currently I am running 0.9. As I have limited resources, I don't want nutch be too aggressive, so I want to set some delay, but I am confused with the value of http.max.delays, does it use milliseconds insteads of seconds? (Some people said it is in 3 second by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) http.max.delays doesn't specify a timespan - read the description more carefully. I think fetcher.server.delay is what you are looking for. It is the amount of time the fetcher will at least wait until it fetches another url from the same host. Keep in mind that the fetcher obeys robots.txt files (by default) - so if a robots.txt file is present the crawling will occur polite enough. Also, I need to read the fetched page so that I can do some modification on the html structure for future parsing, where is the files located? Are they store in pure html or they are breaken down into multiple file? if this is not html file, how can I read the fetched page? If you are looking for a way to programmatically read the fetched content ( e.g. html pages) have a look at the IndexReader class. If you are looking for a way to dump the whole downloaded content to a Text file or want to see some statistical information about it, try the readseg command. Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions And will the cached page losing all the original html attribute when it viewed in cached page? The page will be stored character by character, including html tags. Also, how can I read the link that nutch found and how can I control the crawling sequence? (change it to breadth-first search at the top level, then depth-first one by one) Crawling always occurs breadth-first. If you want fine-grained control over the crawling sequence you should follow the procedure in the nutch tutorial for whole internet crawling. Nevertheless the crawling occurs breath-first. Sorry for many questions. HTH, Martin PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . (nice semester abroad . . . hehe ;) -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html Sent from the Nutch - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Newbie questions about followed links
Sir: On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote: Surely these links look ordinary enough to be seen and followed by nutch? Could someone please tell me what could be causing these links not be followed? conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?' and the links will be followed. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Newbie questions about followed links
exactly what I was going to say! Cheers Paul On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote: Sir: On 08/03/07, Jeroen Verhagen [EMAIL PROTECTED] wrote: Surely these links look ordinary enough to be seen and followed by nutch? Could someone please tell me what could be causing these links not be followed? conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?' and the links will be followed. -- Cheers, Hasan Diwan [EMAIL PROTECTED]
Re: Newbie questions about followed links
Hi Hasan, On 3/8/07, Hasan Diwan [EMAIL PROTECTED] wrote: conf/urlfilter.txt.template contains the line: [EMAIL PROTECTED] Remove the '?' and the links will be followed. Thanks, that made it work. I had to comment out the whole line '[EMAIL PROTECTED]' to make it work though ? Even though there do not seem to be @ charachter in the links for example? -- regards, Jeroen
Re: Newbie questions
Hi Vacuum I hope nutch wiki will help you much:) http://wiki.apache.org/nutch/ Regards /Jack On 7/6/05, Vacuum Joe [EMAIL PROTECTED] wrote: Hello Nutch-gurus, I have some very straightforward and yet totally newbie questions which I hope some kind person would answer. First of all, what is a db? It seems like I have to inject links into the db to get the process started. So the links are in the db, and then I run fetch on them. That brings me to the next question: what's a segment? I notice that it creates timestamped segment directories. What's in them? Does the running Nutch web application automatically pick up new segment files when they are added, or do I have to restart it? I'm trying to figure this out because I want to get started with automated crawling, so I'll have one or two machines crawling all the time, and then have a cluster of web server machines. I assume that the web server front-end machines need the segments and the crawlers need the db, but I'm not sure exactly what the functions of these are. Thanks for your help and thanks for the awesome piece of software. Hopefully as we do some work on it, we'll have some code to return to the source. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Newbie questions
I hope nutch wiki will help you much:) http://wiki.apache.org/nutch/ Hello Jack, Yes, I have been reading it. The db file contains a database of all the link structure and pages of the web. But what is a segment in this case? I assume a segment contains page content? And then there is the updatedb command which takes the newly-discovered links in a segment and puts them back in the db, so the new links can be followed in the next segment the next time there is a crawl? I am more confused about segments than I am about dbs, I guess. Do I need to keep old segments after generating a new one? __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com