Difference between Feed parser and Rss Parser

2009-07-17 Thread Saurabh Suman
hi I waht is difference between feedParser and RssParser. I have RssFeedURLs in seed.txt. Nutch will call feedparser or RssParser tp parse it. -- View this message in context: http://www.nabble.com/Difference-between-Feed-parser-and-Rss-Parser-tp24529176p24529176.html Sent from the Nutch -

Re: Difference between Feed parser and Rss Parser

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 09:21, Saurabh Sumansaurabhsuman...@rediff.com wrote: hi I waht is difference between feedParser and RssParser. I have RssFeedURLs in seed.txt. Nutch will call feedparser or RssParser tp parse it. Depends on which plugin is included in your conf. Feed plugin extract

recrawling

2009-07-17 Thread Neeti Gupta
I want my crawl to crawl the updated contents of a web page as soon as the website gets updated. I have used page info of web page but it's not 100% reliable, can anyone suggest any other ways of doing that. Then plz help it's urjent. -- View this message in context:

How segment depends on depth

2009-07-17 Thread Saurabh Suman
As i observed , Nutch makes new folder with the current timestamp in the segments directory for each depths.Does new folder under segments directory made while crawling for depth2 contains all url and parsedText of previous depth or it just overwrite previous? If i will search for a query

Issue with Parse metaData while crawling RSSFeed URL

2009-07-17 Thread Saurabh Suman
hi I am crawling a feed url. http://blog.taragana.com/n/c/india/feed/. I have set depth =2. I am using FeedParser.java for parsing it. For depth 1 in parseData in segments folder Parse Metadata for a url http://blog.taragana.com/n/30-child-labourers-rescued-in-agra-and-firozabad-111417/ is

Re: Issue with Parse metaData while crawling RSSFeed URL

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 14:15, Saurabh Sumansaurabhsuman...@rediff.com wrote: hi I am  crawling a feed url.  http://blog.taragana.com/n/c/india/feed/. I have set depth =2. I am using FeedParser.java for parsing it. For depth 1 in parseData in segments  folder  Parse Metadata for a url

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from this server. (Client IP address: XX.XX.XX.XX)brbr Please see Google's Terms of Service posted at

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85
Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory reinhard schwab schrieb: it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote: Any workaround for this? Making nutch identify as something else or something similar? Also note that nutch does not crawl anything with '?', or '' in URL. Check out crawl-urlfilter.txt or regex-urlfilter.txt (depending

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
2009/7/17 Doğacan Güney doga...@gmail.com: On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote: Any workaround for this? Making nutch identify as something else or something similar? Also note that nutch does not crawl anything with '?', or '' in URL. Check out Oops.

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Dennis Kubes
This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85
I think I need more help on how to do this. I tried using property namehttp.robots.agents/name valueMozilla/5.0*/value descriptionThe agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Jake Jacobson
Larsson85, Please read past responses. Google is blocking all crawlers, not just yours from indexing their search results. Because of their robots.txt file directives you will not be able to do this. If you place a sign on your house, DO NOT ENTER, and I entered, you would be very upset. That

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
your are right. robots.txt clearly disallows this page. this page will not be fetched. i remember google has some APIs to access the search. http://code.google.com/intl/de-DE/apis/soapsearch/index.html http://code.google.com/intl/de-DE/apis/ajaxsearch/ reinhard Dennis Kubes schrieb: This isn't

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Brian Ulicny
1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls On Fri, 17 Jul 2009 02:32 -0700, Larsson85 kristian1...@hotmail.com wrote: I think I need more help on how to do this. I tried using property

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Andrzej Bialecki
Brian Ulicny wrote: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls Please note, we are not saying this is impossible to do this with Nutch (e.g. by setting the agent string to mimick a browser), but we

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can also use commons-httpclient or htmlunit to access the search of google. these tools are not crawlers. with htmlunit it would be easy to get the outlinks. i strongly advice you not to misuse google search by too many requests. google will block you i assume. by using a search api, you are

Re: java heap space problem when using the language identifier

2009-07-17 Thread MilleBii
never applied a patch so far... so I will do my best. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30, MilleBiimille...@gmail.com wrote: Just trying indexing a smaller segment 300k URLs ... and the memory is just going up and up... but it does NOT hit the physical

Re: java heap space problem when using the language identifier

2009-07-17 Thread MilleBii
actually the question I had when looking at the logs : why there are so many plugin loading, I miss the logic ? 2009/7/17 MilleBii mille...@gmail.com never applied a patch so far... so I will do my best. 2009/7/17 Doğacan Güney doga...@gmail.com On Fri, Jul 17, 2009 at 00:30,

Re: How segment depends on depth

2009-07-17 Thread MilleBii
when you run the nutch index and give it the list of segments it will in one single index. segments are different chunks of your crawldb. I guess what is less clear to me, is once the expiry date has gone. url's will be recrawled and be duplicated into different segments, not sure how it is taken

Re: java heap space problem when using the language identifier

2009-07-17 Thread MilleBii
Looks great my indexing is now working and I observe a constant memory usage instead of the ever-growing slope. Thx a lot, why is this patch not in the standard build ? I just get some weird message in ANT/eclipse [jar] Warning: skipping jar archive

Re: wrong outlinks

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin. Javascript parser does a best effort to

Re: java heap space problem when using the language identifier

2009-07-17 Thread Doğacan Güney
On Sat, Jul 18, 2009 at 00:02, MilleBiimille...@gmail.com wrote: Looks great my indexing is now working and I observe a constant memory usage instead of the ever-growing slope. Thx a lot, why is this patch not in the standard build ? Because I never tested it very well so I never got to

Re: wrong outlinks

2009-07-17 Thread reinhard schwab
Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin.

Re: wrong outlinks

2009-07-17 Thread reinhard schwab
reinhard schwab schrieb: Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are

Re: dump all outlinks

2009-07-17 Thread kevin chen
You can dump segment info to a directory, let's say tmps, $NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent Then, go to the directory, you should see a file dump grep outlink: dump | cut -f5 -d outlinks On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote: is any tool available