Re: http.redirect.max
Hi Alex, Can you please have a look at NUTCH-1042? Might it be the case that your redirect possibly has a crawl-delay which then falls into the boundary case we witness in the issue above? You may want to chabge your log properties to debug for a while and run some small crawls on your problem URLs, maybe try adding in some LOG.debug statements to see what kind of conditions are being satisfied around the fetcher areas mentioned in NUTCH-1042. hth On Thu, Mar 1, 2012 at 8:09 PM, alx...@aim.com wrote: Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex. -Original Message- From: xuyuanme xuyua...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B as the seed page 2. And try to crawl one of the link ( http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT ) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change response = getResponse(u, datum, *false*) call to response = getResponse(u, datum, *true*) in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: http.redirect.max
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex. -Original Message- From: xuyuanme xuyua...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change response = getResponse(u, datum, *false*) call to response = getResponse(u, datum, *true*) in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change response = getResponse(u, datum, *false*) call to response = getResponse(u, datum, *true*) in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme xuyua...@gmail.com wrote: Just checked the latest code in 1.4 but it's the same. See code line 138 in below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup The method just call getResponse() and set followRedirects parameter to *false*. So I guess the http.redirect.max setting is not working on it? remi tassing wrote Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi On Thursday, February 23, 2012, xuyuanme lt;xuyuanme@gt; wrote: Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler still can't get redirect pages. And with further digging, I found the plugin lib-http (in Nutch 1.1) contains following code: Java file: org.apache.nutch.protocol.http.api.HttpBase public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { .. response = getResponse(u, datum, */false/*); // make a request .. } protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException; After I changed the call to getResponse(u, datum, */true/*) and recompile the plugin, the crawler fetches redirected pages as expected. So is this a bug in lib-http library or I had some misunderstanding on how redirect works? -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis*
Re: http.redirect.max
Thanks! The config file can be get here: http://dl.dropbox.com/u/6614015/temp/config.zip http://dl.dropbox.com/u/6614015/temp/config.zip lewis john mcgibbney wrote Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme lt;xuyuanme@gt; wrote: Just checked the latest code in 1.4 but it's the same. See code line 138 in below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup The method just call getResponse() and set followRedirects parameter to *false*. So I guess the http.redirect.max setting is not working on it? -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Additionally in your nutch-site.xml we don't maintain any query-(plugins), and there is no parse-text plugin either. On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: OK, for starters we don't use crawl-urlfilter.txt anymore, this is deprecated as of Nutch 1.2 iirc. Secondly, what are you trying to achieve here? Your url filter includes +^http://www \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_BrowseDrugInitial=B$ +^http://www \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.OverviewDrugName=BACIGUENT$ Your seed urls are also not exactly what I would expect for a seed list. One last thing, your fetcher.threads.per.host is pretty aggressive, I wouldn't personally set it this high unless it was my own server I was communicating with. So what exactly is it that you are having problems with? Lewis On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme xuyua...@gmail.com wrote: Thanks! The config file can be get here: http://dl.dropbox.com/u/6614015/temp/config.zip http://dl.dropbox.com/u/6614015/temp/config.zip lewis john mcgibbney wrote Hi, Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme lt;xuyuanme@gt; wrote: Just checked the latest code in 1.4 but it's the same. See code line 138 in below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup The method just call getResponse() and set followRedirects parameter to *false*. So I guess the http.redirect.max setting is not working on it? -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html Sent from the Nutch - User mailing list archive at Nabble.com. -- *Lewis* -- *Lewis* -- *Lewis*
Re: http.redirect.max
Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler still can't get redirect pages. And with further digging, I found the plugin lib-http (in Nutch 1.1) contains following code: Java file: org.apache.nutch.protocol.http.api.HttpBase public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { .. response = getResponse(u, datum, */false/*); // make a request .. } protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException; After I changed the call to getResponse(u, datum, */true/*) and recompile the plugin, the crawler fetches redirected pages as expected. So is this a bug in lib-http library or I had some misunderstanding on how redirect works? Thanks! lewis john mcgibbney wrote Hi Rafael, The page we are talking about will be added on the link below. http://wiki.apache.org/nutch/InternalDocumentation and will be available here http://wiki.apache.org/nutch/RedirectHandling -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi On Thursday, February 23, 2012, xuyuanme xuyua...@gmail.com wrote: Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler still can't get redirect pages. And with further digging, I found the plugin lib-http (in Nutch 1.1) contains following code: Java file: org.apache.nutch.protocol.http.api.HttpBase public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { .. response = getResponse(u, datum, */false/*); // make a request .. } protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException; After I changed the call to getResponse(u, datum, */true/*) and recompile the plugin, the crawler fetches redirected pages as expected. So is this a bug in lib-http library or I had some misunderstanding on how redirect works? Thanks! lewis john mcgibbney wrote Hi Rafael, The page we are talking about will be added on the link below. http://wiki.apache.org/nutch/InternalDocumentation and will be available here http://wiki.apache.org/nutch/RedirectHandling -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
Just checked the latest code in 1.4 but it's the same. See code line 138 in below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup The method just call getResponse() and set followRedirects parameter to *false*. So I guess the http.redirect.max setting is not working on it? remi tassing wrote Would you give Nucth-1.4 a try? Maybe this bug is already solved? Remi On Thursday, February 23, 2012, xuyuanme lt;xuyuanme@gt; wrote: Thanks for the information. But I found the wiki page http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler still can't get redirect pages. And with further digging, I found the plugin lib-http (in Nutch 1.1) contains following code: Java file: org.apache.nutch.protocol.http.api.HttpBase public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { .. response = getResponse(u, datum, */false/*); // make a request .. } protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException; After I changed the call to getResponse(u, datum, */true/*) and recompile the plugin, the crawler fetches redirected pages as expected. So is this a bug in lib-http library or I had some misunderstanding on how redirect works? -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http.redirect.max
Hi Rafael, The page we are talking about will be added on the link below. http://wiki.apache.org/nutch/InternalDocumentation and will be available here http://wiki.apache.org/nutch/RedirectHandling I guess the poor documentation of nutch/hadoop is the biggest problem for beginners like me. I started with nutch ~4-6 month ago (not full time, but several hours every week). At first I wrote some plugins (parser/indexer). This was a bit tricky because i had learn directly from the source. Because most of the tutorials/documents were outdated (1.0) or simply wrong. Please note we are trying to remove as much duplication documentation regarding Nutch Hadoop as possible. The Nutch wiki has been updated recently and this is ongoing work so hopefully we can improve this more in the near future. As Nutch focuses purely on web crawling the Hadoop material can be viewed directly in the Hadoop wiki. I've added a link to this on our wiki Nutch Hadoop Tutorial. My crawler is now running and I need to scale it up. The current version runs in local mode but thats not really fast. So I started to setup a hadoop cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and my current questions are: - i will buy some new hardware for the hadoop cluster, but i'm shure about the configuration. Is nutch i/o or cpu heavy? On a brand new hardware configuration I have not hard of anyone blowing gaskets or anything similar. If thereis something wrong, it can usually be fixed by improving configuration. - what is the difference between protocol-httpclient and protocol-http? Just ssl and authentication? What about performance? protocol-httpclient is broken, please see the jira issue that has been filed. You will also need to have a look at the code for this as I am by no means an expert with the protocol-httpclient material. - what is a good value for the following configuration parameter: - fetcher.threads.fetch - fetcher.threads.per.queue - mapred.tasktracker.map.tasks.maximum - mapred.tasktracker.reduce.tasks.maximum - mapred.map.tasks - mapred.reduce.tasks Impossible to say, this varies significantly from crawl/network/nature of crawl data etc. You simply need to experiment and read as much existing documentation as possible. Sorry about this one. My current hardware is a 4 Node Cluster of dual CPU (quad core xeon), 32GB RAM, 2*2TB SATA HDD. I know it's impossible to define the always right value. But a rule of the thumb, to use as start value, would be very a great thing and would save me a lot of try-and-error investigation. Unfortunately this open source software you are using. Maybe Cloudera or some of the other commercially motivated experts can help you with this stuff. This is outwith my experience. Try here http://wiki.apache.org/nutch/Support - what's the difference fetcher.threads.fetch from the configuration an the -threads option from the crawl command? This depends on how you wish to monitor/schedule your Nutch crawls. As you know, running individual commands gives you more flexibility/control over how Nutch does the work for you. - is it possible to follow external links only on 301 redirects? Not got a clue but will definitely include this type of material in the wiki page I created above. Mayeb you can do a bit of investigation and halp me out when I get round to writing up on this stuff. - what is happening if a page is marked as db_redir_temp / db_redir_perm? Refetch after db.fetch.interval.default? Again we will need to work together to get our heads around this, if you have a look at the code then maybe we can get somethign written up in due course. Sorry about the vague answers however its a pretty large task to answer everything fully considering there are ~5-10 questions all in. I'm sure there must be some material on the user@ archives so please have a look there as well. hth Lewis
Re: http.redirect.max
Hi Alex, this is not really a bug. It's a undocumented feature. db.ignore.external.links prevents the fetcher from breaking out of your set of domains. And this is what you need, if you won't crawl the whole web. Best regards, Rafael. On 17/Nov/ 2011, at 23:05 , alx...@aim.com wrote: Hi, Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044 for the case when db.ignore.external.links set to true ? Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Thu, Nov 17, 2011 6:01 am Subject: Re: http.redirect.max Thanks for updating the list. On 11/17/2011 02:52 PM, Rafael Pappert wrote: Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb is still showing 1 page is unfetched / db_redir_perm. And I can't find the redirection target in the crawldb. How does nutch handle redirects? Thanks in advance, Rafael.
Re: http.redirect.max
Hi Lewis, The honest truth is that there needs to be comprehensive documentation on the wiki for the way that Nutch handles redirects. This is a question that has gone fully unanswered for sometime. That's true. In the meantime, can you adivise if there is anything over and above the files in nutch-default.xml and o.a.n.protocol package which you would like to see documented? I guess the poor documentation of nutch/hadoop is the biggest problem for beginners like me. I started with nutch ~4-6 month ago (not full time, but several hours every week). At first I wrote some plugins (parser/indexer). This was a bit tricky because i had learn directly from the source. Because most of the tutorials/documents were outdated (1.0) or simply wrong. My crawler is now running and I need to scale it up. The current version runs in local mode but thats not really fast. So I started to setup a hadoop cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and my current questions are: - i will buy some new hardware for the hadoop cluster, but i'm shure about the configuration. Is nutch i/o or cpu heavy? http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ - what is the difference between protocol-httpclient and protocol-http? Just ssl and authentication? What about performance? - what is a good value for the following configuration parameter: - fetcher.threads.fetch - fetcher.threads.per.queue - mapred.tasktracker.map.tasks.maximum - mapred.tasktracker.reduce.tasks.maximum - mapred.map.tasks - mapred.reduce.tasks My current hardware is a 4 Node Cluster of dual CPU (quad core xeon), 32GB RAM, 2*2TB SATA HDD. I know it's impossible to define the always right value. But a rule of the thumb, to use as start value, would be very a great thing and would save me a lot of try-and-error investigation. - what's the difference fetcher.threads.fetch from the configuration an the -threads option from the crawl command? - is it possible to follow external links only on 301 redirects? - what is happening if a page is marked as db_redir_temp / db_redir_perm? Refetch after db.fetch.interval.default? I found loads tutorials and all of them have the same content, only the the very very basics (how to do your first crawl). I guess a comprehensive documentation would be a big step for the amazing nutch/hadoop project. Thanks in advance, Rafael. Thanks On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert r...@fwpsystems.com wrote: Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb is still showing 1 page is unfetched / db_redir_perm. And I can't find the redirection target in the crawldb. How does nutch handle redirects? Thanks in advance, Rafael. -- *Lewis*
Re: http.redirect.max
Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb is still showing 1 page is unfetched / db_redir_perm. And I can't find the redirection target in the crawldb. How does nutch handle redirects? Thanks in advance, Rafael.
Re: http.redirect.max
Thanks for updating the list. On 11/17/2011 02:52 PM, Rafael Pappert wrote: Hi, after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: Hello List, is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb is still showing 1 page is unfetched / db_redir_perm. And I can't find the redirection target in the crawldb. How does nutch handle redirects? Thanks in advance, Rafael.
Re: http.redirect.max and duplicate fetch/parse
That sounds creepy indeed. It would still need a similar amount of RAM plus network overhead. Would a bloom filter be useful at all? It takes a lot less space and i can live with a non-deterministic approach. On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote: Hi I think some external key-value storage may replace map. They are fast enough and overhead will be unsignificant (for many threads) But this is very creepy solution. Sergey Volkov. On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote: Anyone? Hi, With a 0 value for http.redirect.max there's a possibility for fetching and parsing duplicates, this is especially true for fetch lists with many domains, even with just a few (+10) records per domain/host queue. Assuming there's only one thread per queue, how can we use http.redirect.max and prevent fetch and parse of duplicates? I'm not a big fan of keeping a map of fetched records in memory as it'll blow up the heap. We can also not safely remove a record from the fetch queue as the queue feeder may not have finished and duplicates may still enter a queue. Any thoughts? Thanks, Markus -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: http.redirect.max and duplicate fetch/parse
Actually some kv storages use bloom filter for similar purpose. What is your queue size? And what is redirect rate? If most redirects are not crossdomain and average number of urls per domain is not very big some fixed size chache in FetchItemQueue may help. But this leads to lots of changes in fetcher. On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote: That sounds creepy indeed. It would still need a similar amount of RAM plus network overhead. Would a bloom filter be useful at all? It takes a lot less space and i can live with a non-deterministic approach. On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote: Hi I think some external key-value storage may replace map. They are fast enough and overhead will be unsignificant (for many threads) But this is very creepy solution. Sergey Volkov. On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote: Anyone? Hi, With a 0 value for http.redirect.max there's a possibility for fetching and parsing duplicates, this is especially true for fetch lists with many domains, even with just a few (+10) records per domain/host queue. Assuming there's only one thread per queue, how can we use http.redirect.max and prevent fetch and parse of duplicates? I'm not a big fan of keeping a map of fetched records in memory as it'll blow up the heap. We can also not safely remove a record from the fetch queue as the queue feeder may not have finished and duplicates may still enter a queue. Any thoughts? Thanks, Markus