Difference between Feed parser and Rss Parser
hi I waht is difference between feedParser and RssParser. I have RssFeedURLs in seed.txt. Nutch will call feedparser or RssParser tp parse it. -- View this message in context: http://www.nabble.com/Difference-between-Feed-parser-and-Rss-Parser-tp24529176p24529176.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Difference between Feed parser and Rss Parser
On Fri, Jul 17, 2009 at 09:21, Saurabh Sumansaurabhsuman...@rediff.com wrote: hi I waht is difference between feedParser and RssParser. I have RssFeedURLs in seed.txt. Nutch will call feedparser or RssParser tp parse it. Depends on which plugin is included in your conf. Feed plugin extract all items in an rss to its separate entries, parse-rss will create one page for all content. -- View this message in context: http://www.nabble.com/Difference-between-Feed-parser-and-Rss-Parser-tp24529176p24529176.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan Güney
recrawling
I want my crawl to crawl the updated contents of a web page as soon as the website gets updated. I have used page info of web page but it's not 100% reliable, can anyone suggest any other ways of doing that. Then plz help it's urjent. -- View this message in context: http://www.nabble.com/recrawling-tp24530967p24530967.html Sent from the Nutch - User mailing list archive at Nabble.com.
How segment depends on depth
As i observed , Nutch makes new folder with the current timestamp in the segments directory for each depths.Does new folder under segments directory made while crawling for depth2 contains all url and parsedText of previous depth or it just overwrite previous? If i will search for a query string , it will search from depth1 or depth2? -- View this message in context: http://www.nabble.com/How-segment-depends-on-depth-tp24532471p24532471.html Sent from the Nutch - User mailing list archive at Nabble.com.
Issue with Parse metaData while crawling RSSFeed URL
hi I am crawling a feed url. http://blog.taragana.com/n/c/india/feed/. I have set depth =2. I am using FeedParser.java for parsing it. For depth 1 in parseData in segments folder Parse Metadata for a url http://blog.taragana.com/n/30-child-labourers-rescued-in-agra-and-firozabad-111417/ is like this Parse Metadata :author=Ani CharEncodingForConversion=utf-8 tag=Agra tag=Firozabad tag=Uttar Pradesh tag=India OriginalCharEncoding=utf-8 feed=http://blog.taragana.com/n published=1247778368000 . As we can see it contains author. but for depth 2 parsemetadata for same url is like this: Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 when i search i am not getting author. i have following question regarding this- (1)Does Nutch overwrite Parsed metadata of depth 1 with that of depth 2 for this URL or does it merge the two? If it overwrites, then how can I stop it from doing the same as I need the author and other information obtained by parsing the RSS feed. -- View this message in context: http://www.nabble.com/Issue-with-Parse-metaData-while-crawling-RSSFeed-URL-tp24532613p24532613.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Issue with Parse metaData while crawling RSSFeed URL
On Fri, Jul 17, 2009 at 14:15, Saurabh Sumansaurabhsuman...@rediff.com wrote: hi I am crawling a feed url. http://blog.taragana.com/n/c/india/feed/. I have set depth =2. I am using FeedParser.java for parsing it. For depth 1 in parseData in segments folder Parse Metadata for a url http://blog.taragana.com/n/30-child-labourers-rescued-in-agra-and-firozabad-111417/ is like this Parse Metadata :author=Ani CharEncodingForConversion=utf-8 tag=Agra tag=Firozabad tag=Uttar Pradesh tag=India OriginalCharEncoding=utf-8 feed=http://blog.taragana.com/n published=1247778368000 . As we can see it contains author. but for depth 2 parsemetadata for same url is like this: Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 when i search i am not getting author. i have following question regarding this- (1)Does Nutch overwrite Parsed metadata of depth 1 with that of depth 2 for this URL or does it merge the two? If it overwrites, then how can I stop it from doing the same as I need the author and other information obtained by parsing the RSS feed. Searching for rss data such as author, etc is not yet implemented. I hope to implement it before next release. -- View this message in context: http://www.nabble.com/Issue-with-Parse-metaData-while-crawling-RSSFeed-URL-tp24532613p24532613.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan Güney
Re: Why cant I inject a google link to the database?
http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from this server. (Client IP address: XX.XX.XX.XX)brbr Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html if you set the user agent properties to a client such as firefox, google will serve your request. reinhard schwab schrieb: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Why cant I inject a google link to the database?
you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory reinhard schwab schrieb: it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from this server. (Client IP address: XX.XX.XX.XX)brbr Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html if you set the user agent properties to a client such as firefox, google will serve your request. reinhard schwab schrieb: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote: Any workaround for this? Making nutch identify as something else or something similar? Also note that nutch does not crawl anything with '?', or '' in URL. Check out crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you use crawl command or inject/generate/fetch/parse etc. commands). reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan Güney
Re: Why cant I inject a google link to the database?
2009/7/17 Doğacan Güney doga...@gmail.com: On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote: Any workaround for this? Making nutch identify as something else or something similar? Also note that nutch does not crawl anything with '?', or '' in URL. Check out Oops. I mean nutch does not crawl any such URL *by default*. crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you use crawl command or inject/generate/fetch/parse etc. commands). reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan Güney -- Doğacan Güney
Re: Why cant I inject a google link to the database?
identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the robots.txt directives. Dennis reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
I think I need more help on how to do this. I tried using property namehttp.robots.agents/name valueMozilla/5.0*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property If I dont have the star in the end I get the same as earlier, No URLs to fetch. And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Why cant I inject a google link to the database?
Larsson85, Please read past responses. Google is blocking all crawlers, not just yours from indexing their search results. Because of their robots.txt file directives you will not be able to do this. If you place a sign on your house, DO NOT ENTER, and I entered, you would be very upset. That is what the robots.txt file does for a site. It tells visiting bots what they can enter and what they can't enter. Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/jakecjacobson http://twitter.com/jakejacobson Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS On Fri, Jul 17, 2009 at 9:32 AM, Larsson85kristian1...@hotmail.com wrote: I think I need more help on how to do this. I tried using property namehttp.robots.agents/name valueMozilla/5.0*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property If I dont have the star in the end I get the same as earlier, No URLs to fetch. And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Why cant I inject a google link to the database?
your are right. robots.txt clearly disallows this page. this page will not be fetched. i remember google has some APIs to access the search. http://code.google.com/intl/de-DE/apis/soapsearch/index.html http://code.google.com/intl/de-DE/apis/ajaxsearch/ reinhard Dennis Kubes schrieb: This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the robots.txt directives. Dennis reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls On Fri, 17 Jul 2009 02:32 -0700, Larsson85 kristian1...@hotmail.com wrote: I think I need more help on how to do this. I tried using property namehttp.robots.agents/name valueMozilla/5.0*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property If I dont have the star in the end I get the same as earlier, No URLs to fetch. And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Brian Ulicny bulicny at alum dot mit dot edu home: 781-721-5746 fax: 360-361-5746
Re: Why cant I inject a google link to the database?
Brian Ulicny wrote: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls Please note, we are not saying this is impossible to do this with Nutch (e.g. by setting the agent string to mimick a browser), but we insist on saying that it's RUDE to do this. Anyway, Google monitors such attempts and after you issue too many requests your IP will be blocked for a duration - so no matter if you go the polite or the impolite way you won't be able to do this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Why cant I inject a google link to the database?
you can also use commons-httpclient or htmlunit to access the search of google. these tools are not crawlers. with htmlunit it would be easy to get the outlinks. i strongly advice you not to misuse google search by too many requests. google will block you i assume. by using a search api, you are allowed to request it 1000 times per day if i remember correct, it is mentioned there in the terms of use or elsewhere in the documentation. google returns a maximum of 1000 links in a search result and a maximum of 100 links in one page. if you set this search parameter, num=100 you will get 100 links per result page. Brian Ulicny schrieb: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls On Fri, 17 Jul 2009 02:32 -0700, Larsson85 kristian1...@hotmail.com wrote: I think I need more help on how to do this. I tried using property namehttp.robots.agents/name valueMozilla/5.0*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property If I dont have the star in the end I get the same as earlier, No URLs to fetch. And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all I get is no more URLs to fetch The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: java heap space problem when using the language identifier
never applied a patch so far... so I will do my best. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30, MilleBiimille...@gmail.com wrote: Just trying indexing a smaller segment 300k URLs ... and the memory is just going up and up... but it does NOT hit the physical boundary limit. Sounds like a memory leak ??? How come I thought Java was doing the garbage collection automatically Can you try the patch at https://issues.apache.org/jira/browse/NUTCH-356 (try cache_classes.patch) 2009/7/16 MilleBii mille...@gmail.com I get more details now for my error. What can I do about it, I have 4GB of memory, but it is not fully used (I think). I use cygwin/windows/local filesystem java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:498) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) -- Forwarded message -- From: MilleBii mille...@gmail.com Date: 2009/7/15 Subject: Errorr when using language-identifier plugin ? To: nutch-user@lucene.apache.org I decided to add the language-identifier plugin... but I get the following error when I start indexing my crawldb. Not really explicit. If I remove it works just fine. I tried on a smaller crawl database that I use for testing and it works fine too. Any idea where to look for ? 2009-07-15 16:19:54,875 WARN mapred.LocalJobRunner - job_local_0001 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) -- -MilleBii- -- -MilleBii- -- -MilleBii- -- Doğacan Güney -- -MilleBii-
Re: java heap space problem when using the language identifier
actually the question I had when looking at the logs : why there are so many plugin loading, I miss the logic ? 2009/7/17 MilleBii mille...@gmail.com never applied a patch so far... so I will do my best. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30, MilleBiimille...@gmail.com wrote: Just trying indexing a smaller segment 300k URLs ... and the memory is just going up and up... but it does NOT hit the physical boundary limit. Sounds like a memory leak ??? How come I thought Java was doing the garbage collection automatically Can you try the patch at https://issues.apache.org/jira/browse/NUTCH-356 (try cache_classes.patch) 2009/7/16 MilleBii mille...@gmail.com I get more details now for my error. What can I do about it, I have 4GB of memory, but it is not fully used (I think). I use cygwin/windows/local filesystem java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:498) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) -- Forwarded message -- From: MilleBii mille...@gmail.com Date: 2009/7/15 Subject: Errorr when using language-identifier plugin ? To: nutch-user@lucene.apache.org I decided to add the language-identifier plugin... but I get the following error when I start indexing my crawldb. Not really explicit. If I remove it works just fine. I tried on a smaller crawl database that I use for testing and it works fine too. Any idea where to look for ? 2009-07-15 16:19:54,875 WARN mapred.LocalJobRunner - job_local_0001 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) -- -MilleBii- -- -MilleBii- -- -MilleBii- -- Doğacan Güney -- -MilleBii- -- -MilleBii-
Re: How segment depends on depth
when you run the nutch index and give it the list of segments it will in one single index. segments are different chunks of your crawldb. I guess what is less clear to me, is once the expiry date has gone. url's will be recrawled and be duplicated into different segments, not sure how it is taken care of. 2009/7/17 Saurabh Suman saurabhsuman...@rediff.com As i observed , Nutch makes new folder with the current timestamp in the segments directory for each depths.Does new folder under segments directory made while crawling for depth2 contains all url and parsedText of previous depth or it just overwrite previous? If i will search for a query string , it will search from depth1 or depth2? -- View this message in context: http://www.nabble.com/How-segment-depends-on-depth-tp24532471p24532471.html Sent from the Nutch - User mailing list archive at Nabble.com. -- -MilleBii-
Re: java heap space problem when using the language identifier
Looks great my indexing is now working and I observe a constant memory usage instead of the ever-growing slope. Thx a lot, why is this patch not in the standard build ? I just get some weird message in ANT/eclipse [jar] Warning: skipping jar archive C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar because no files were included. [jar] Building MANIFEST-only jar: C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar Not sure what that means. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30, MilleBiimille...@gmail.com wrote: Just trying indexing a smaller segment 300k URLs ... and the memory is just going up and up... but it does NOT hit the physical boundary limit. Sounds like a memory leak ??? How come I thought Java was doing the garbage collection automatically Can you try the patch at https://issues.apache.org/jira/browse/NUTCH-356 (try cache_classes.patch) 2009/7/16 MilleBii mille...@gmail.com I get more details now for my error. What can I do about it, I have 4GB of memory, but it is not fully used (I think). I use cygwin/windows/local filesystem java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:498) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) -- Forwarded message -- From: MilleBii mille...@gmail.com Date: 2009/7/15 Subject: Errorr when using language-identifier plugin ? To: nutch-user@lucene.apache.org I decided to add the language-identifier plugin... but I get the following error when I start indexing my crawldb. Not really explicit. If I remove it works just fine. I tried on a smaller crawl database that I use for testing and it works fine too. Any idea where to look for ? 2009-07-15 16:19:54,875 WARN mapred.LocalJobRunner - job_local_0001 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) -- -MilleBii- -- -MilleBii- -- -MilleBii- -- Doğacan Güney -- -MilleBii-
Re: wrong outlinks
On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin. Javascript parser does a best effort to extract outlinks but there will be many outlinks that are broken. they seem obvious to be wrong and have status db_gone in crawldb. URL:: http://www.weissenkirchen.at/kirchenwirt/+((110-pesp)/100)+ URL:: http://www.weissenkirchen.at/kirchenwirt//A URL:: http://www.weissenkirchen.at/kirchenwirt//A/TD URL:: http://www.weissenkirchen.at/kirchenwirt//DIV URL:: http://www.weissenkirchen.at/kirchenwirt//FONT URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iarw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iarw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+i.ids+ URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iicw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iicw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/kirchenwirt.js URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/ILAYER/FONT/TD URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/LAYER URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.height+ URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.width+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.maln+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+(nVER=5.5?(pehd!= URL:: http://www.weissenkirchen.at/kirchenwirt/+(nVER5.5?psds:0)+ URL:: http://www.weissenkirchen.at/kirchenwirt/:p.efhd+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.isst URL:: http://www.weissenkirchen.at/kirchenwirt/+p.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.plmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppad+ URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppi+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.prmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.pspc+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver?ssiz: URL:: http://www.weissenkirchen.at/kirchenwirt/+(s?p.efsh+ URL:: http://www.weissenkirchen.at/kirchenwirt/+stgme(i).mbnk+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+stittx(i)+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt//STYLE URL:: http://www.weissenkirchen.at/kirchenwirt/STYLE\n.st_tbcss,.st_tdcss,.st_divcss,.st_ftcss{border:none;padding:0px;margin:0px;}\n/STYLE URL:: http://www.weissenkirchen.at/kirchenwirt//TABLE more than 10 % of the tried pages have status db_gone and many of them are from wrong extracted outlinks. reinh...@thord:bin/dump crawl/dump CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 7199 retry 0: 7048 retry 1: 67 retry 10: 1 retry 12: 1 retry 15: 3 retry 17: 2 retry 18: 2 retry 19: 1 retry 2: 56 retry 4: 1 retry 7: 14 retry 9: 3 min score: 0.0 avg score: 0.014402139 max score: 2.513 status 1 (db_unfetched): 38 status 2 (db_fetched): 6250 status 3 (db_gone): 737 status 4 (db_redir_temp): 148 status 5 (db_redir_perm): 25 status 6 (db_notmodified): 1 CrawlDb statistics: done -- Doğacan Güney
Re: java heap space problem when using the language identifier
On Sat, Jul 18, 2009 at 00:02, MilleBiimille...@gmail.com wrote: Looks great my indexing is now working and I observe a constant memory usage instead of the ever-growing slope. Thx a lot, why is this patch not in the standard build ? Because I never tested it very well so I never got to commit the patch. I will try to review it before 1.1 and hopefully include it in next release. Anyway, I am glad it solves your problem. I just get some weird message in ANT/eclipse [jar] Warning: skipping jar archive C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar because no files were included. [jar] Building MANIFEST-only jar: C:\xxx\workspace\nutch\build\nutch-extensionpoints\nutch-extensionpoints.jar Not sure what that means. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30, MilleBiimille...@gmail.com wrote: Just trying indexing a smaller segment 300k URLs ... and the memory is just going up and up... but it does NOT hit the physical boundary limit. Sounds like a memory leak ??? How come I thought Java was doing the garbage collection automatically Can you try the patch at https://issues.apache.org/jira/browse/NUTCH-356 (try cache_classes.patch) 2009/7/16 MilleBii mille...@gmail.com I get more details now for my error. What can I do about it, I have 4GB of memory, but it is not fully used (I think). I use cygwin/windows/local filesystem java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:498) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) -- Forwarded message -- From: MilleBii mille...@gmail.com Date: 2009/7/15 Subject: Errorr when using language-identifier plugin ? To: nutch-user@lucene.apache.org I decided to add the language-identifier plugin... but I get the following error when I start indexing my crawldb. Not really explicit. If I remove it works just fine. I tried on a smaller crawl database that I use for testing and it works fine too. Any idea where to look for ? 2009-07-15 16:19:54,875 WARN mapred.LocalJobRunner - job_local_0001 2009-07-15 16:19:54,891 FATAL indexer.Indexer - Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.indexer.Indexer.index(Indexer.java:72) at org.apache.nutch.indexer.Indexer.run(Indexer.java:92) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.Indexer.main(Indexer.java:101) -- -MilleBii- -- -MilleBii- -- -MilleBii- -- Doğacan Güney -- -MilleBii- -- Doğacan Güney
Re: wrong outlinks
Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin. Javascript parser does a best effort to extract outlinks but there will be many outlinks that are broken. i have looked at JSParseFilter. heuristic is private static final String STRING_PATTERN = (*(?:\|\'))([^\\s\\']+?)(?:\\1); // A simple pattern. This allows also invalid URL characters. private static final String URI_PATTERN = (^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*); // Alternative pattern, which limits valid url characters. if the two patterns match, and if the constructed url is accepted by the url constructor without MalformedURLException, the url is collected. if i understand it correct, the second pattern matches everything with non whitespaces and dot. in the urls below i see html code and parts of arithmetic expressions. may be the heuristic can be improved by checking for both cases. i also would appreciate some test code. especially heuristics needs to be tested. until now there is only one main method to test it. they seem obvious to be wrong and have status db_gone in crawldb. URL:: http://www.weissenkirchen.at/kirchenwirt/+((110-pesp)/100)+ URL:: http://www.weissenkirchen.at/kirchenwirt//A URL:: http://www.weissenkirchen.at/kirchenwirt//A/TD URL:: http://www.weissenkirchen.at/kirchenwirt//DIV URL:: http://www.weissenkirchen.at/kirchenwirt//FONT URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iarw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iarw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+i.ids+ URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iicw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iicw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/kirchenwirt.js URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/ILAYER/FONT/TD URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/LAYER URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.height+ URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.width+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.maln+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+(nVER=5.5?(pehd!= URL:: http://www.weissenkirchen.at/kirchenwirt/+(nVER5.5?psds:0)+ URL:: http://www.weissenkirchen.at/kirchenwirt/:p.efhd+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.isst URL:: http://www.weissenkirchen.at/kirchenwirt/+p.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.plmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppad+ URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppi+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.prmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.pspc+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver?ssiz: URL:: http://www.weissenkirchen.at/kirchenwirt/+(s?p.efsh+ URL:: http://www.weissenkirchen.at/kirchenwirt/+stgme(i).mbnk+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+stittx(i)+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt//STYLE URL:: http://www.weissenkirchen.at/kirchenwirt/STYLE\n.st_tbcss,.st_tdcss,.st_divcss,.st_ftcss{border:none;padding:0px;margin:0px;}\n/STYLE URL:: http://www.weissenkirchen.at/kirchenwirt//TABLE more than 10 % of the tried pages have status db_gone and many of them are from wrong extracted outlinks. reinh...@thord:bin/dump crawl/dump CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 7199 retry 0:7048 retry 1:67 retry 10: 1 retry 12: 1 retry 15: 3 retry 17: 2 retry 18: 2 retry 19: 1 retry 2:56 retry 4:1 retry 7:14 retry 9:3 min score: 0.0 avg score: 0.014402139 max score: 2.513 status 1 (db_unfetched):38 status 2 (db_fetched): 6250 status 3 (db_gone): 737 status 4 (db_redir_temp): 148 status 5 (db_redir_perm): 25 status 6 (db_notmodified): 1 CrawlDb statistics: done
Re: wrong outlinks
reinhard schwab schrieb: Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin. Javascript parser does a best effort to extract outlinks but there will be many outlinks that are broken. i have looked at JSParseFilter. heuristic is private static final String STRING_PATTERN = (*(?:\|\'))([^\\s\\']+?)(?:\\1); // A simple pattern. This allows also invalid URL characters. private static final String URI_PATTERN = (^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*); // Alternative pattern, which limits valid url characters. if the two patterns match, and if the constructed url is accepted by the url constructor without MalformedURLException, the url is collected. if i understand it correct, the second pattern matches everything with non whitespaces and dot. in the urls below i see html code and parts of arithmetic expressions. the html code may come from document.write statements. may be the heuristic can be improved by checking for both cases. i also would appreciate some test code. especially heuristics needs to be tested. until now there is only one main method to test it. they seem obvious to be wrong and have status db_gone in crawldb. URL:: http://www.weissenkirchen.at/kirchenwirt/+((110-pesp)/100)+ URL:: http://www.weissenkirchen.at/kirchenwirt//A URL:: http://www.weissenkirchen.at/kirchenwirt//A/TD URL:: http://www.weissenkirchen.at/kirchenwirt//DIV URL:: http://www.weissenkirchen.at/kirchenwirt//FONT URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iarw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iarw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+i.ids+ URL:: http://www.weissenkirchen.at/kirchenwirt/):(i.iicw0 URL:: http://www.weissenkirchen.at/kirchenwirt/+(i.iicw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/kirchenwirt.js URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/ILAYER/FONT/TD URL:: http://www.weissenkirchen.at/kirchenwirt//LAYER/LAYER URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.height+ URL:: http://www.weissenkirchen.at/kirchenwirt/+ls[2].clip.width+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.maln+ URL:: http://www.weissenkirchen.at/kirchenwirt/+m.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+(nVER=5.5?(pehd!= URL:: http://www.weissenkirchen.at/kirchenwirt/+(nVER5.5?psds:0)+ URL:: http://www.weissenkirchen.at/kirchenwirt/:p.efhd+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.isst URL:: http://www.weissenkirchen.at/kirchenwirt/+p.mei+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.plmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppad+ URL:: http://www.weissenkirchen.at/kirchenwirt/+p.ppi+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.prmw+2): URL:: http://www.weissenkirchen.at/kirchenwirt/+p.pspc+ URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt/+(p.pver?ssiz: URL:: http://www.weissenkirchen.at/kirchenwirt/+(s?p.efsh+ URL:: http://www.weissenkirchen.at/kirchenwirt/+stgme(i).mbnk+ URL:: http://www.weissenkirchen.at/kirchenwirt/)+stittx(i)+(p.pver URL:: http://www.weissenkirchen.at/kirchenwirt//STYLE URL:: http://www.weissenkirchen.at/kirchenwirt/STYLE\n.st_tbcss,.st_tdcss,.st_divcss,.st_ftcss{border:none;padding:0px;margin:0px;}\n/STYLE URL:: http://www.weissenkirchen.at/kirchenwirt//TABLE more than 10 % of the tried pages have status db_gone and many of them are from wrong extracted outlinks. reinh...@thord:bin/dump crawl/dump CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 7199 retry 0:7048 retry 1:67 retry 10: 1 retry 12: 1 retry 15: 3 retry 17: 2 retry 18: 2 retry 19: 1 retry 2:56 retry 4:1 retry 7:14 retry 9:3 min score: 0.0 avg score: 0.014402139 max score: 2.513 status 1 (db_unfetched):38 status 2 (db_fetched): 6250 status 3 (db_gone): 737 status 4 (db_redir_temp): 148 status 5 (db_redir_perm): 25 status 6 (db_notmodified): 1 CrawlDb statistics: done
Re: dump all outlinks
You can dump segment info to a directory, let's say tmps, $NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent Then, go to the directory, you should see a file dump grep outlink: dump | cut -f5 -d outlinks On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote: is any tool available to dump all outlinks (filtered outlinks included)? (i know the tools to dump crawldb, linkdb and segments) or do i have to implement such a tool and if, how? i want to know them to adapt/manage the url filters. parse the contents with urlfilters disabled? reinhard