DummySSLProtocolSocketFactory problem, please help me!!!! 2
---BeginMessage--- I have a problem with nutch 0.8.1 in DummySSLProtocolSocketFactory class (org.apache.nutch.protocol.httpclient plugin). I have to index pages from a web site on https protocol and that it uses authentication and sessions. My problem is about the management of the sessions. DummySSLProtocolSocketFactory class implements the ProtocolSocketFactory interface in the HttpClient library. If I modify DummySSLProtocolSocketFactory so that implements the SecureProtocolSocketFactory interface all it works. Should anyone tell me if it is ok, or if there is another way? please help me!!!1 ---BeginMessage--- I have a problem with nutch 0.8.1 in DummySSLProtocolSocketFactory class (org.apache.nutch.protocol.httpclient plugin). I have to index pages from a web site on https protocol and that it uses authentication and sessions. My problem is about the management of the sessions. DummySSLProtocolSocketFactory class implements the ProtocolSocketFactory interface in the HttpClient library. If I modify DummySSLProtocolSocketFactory so that implements the SecureProtocolSocketFactory interface all it works. Should anyone tell me if it is ok, or if there is another way? ---End Message--- ---End Message---
Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce.
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Hi, wangxu. You wrote 13 апреля 2007 г., 1:03:31: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. You think MySQL would give you higher speed? :) Just try DataPark Search for large number of urls :) and you will see the difference ;)
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Please make the following test using your favorite relational DB: * create a table with 300 mln rows and 10 columns of mixed type * select 1 mln rows, sorted by some value * update 1 mln rows to different values If you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Once when I was young and restless I went through the path with relational db. It kind of worked with few million records. I am not trying to do it anymore. Perhaps your problem is that you process too few records at the time? Quite often I see examples where people fetch few hundred of few thousand pages at a time. It might be good amount for small crawls, but if your goal is bigger you need bigger segments to get there. -- Sami Siren
Runing a nutch crawler on Eclipse
Hi . I am having certain problems in running the nutch crawler on eclipse after having followed the tutorial on Nutch wiki. It says canot build project. Can anyone suggest a good tool? Tanmoy
problem parsing HTML
Hi. I'm trying to figure out how nutch actually extracts the links out of a piece of HTML. I'm getting confused in what parts TagSoup, NekoHTML, and parse-html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not sure where it actually does this bit. the fragment in question is this: a href=#| onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID + :NewsMaker: National, Political, World, Breaking News and More : + nm_cur[newsmaker80631] + of 8;t=s_account.split(,);s_account2= (t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs (s_account2);return false;' id=newsmaker80631.preimg border=0 src=http://cdn...com/ch_news/backbtn; width=25 height=21 alt=Prev//a and it is attempting to find ;s_account2=(t[0].indexOf( TIA Ian -- Ian Holsman [EMAIL PROTECTED]
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Andrzej Bialecki wrote: wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Please make the following test using your favorite relational DB: * create a table with 300 mln rows and 10 columns of mixed type * select 1 mln rows, sorted by some value * update 1 mln rows to different values If you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) That is so funny.
Re: problem parsing HTML
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running? Dennis Kubes Ian Holsman wrote: Hi. I'm trying to figure out how nutch actually extracts the links out of a piece of HTML. I'm getting confused in what parts TagSoup, NekoHTML, and parse-html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not sure where it actually does this bit. the fragment in question is this: a href=#| onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID + :NewsMaker: National, Political, World, Breaking News and More : + nm_cur[newsmaker80631] + of 8;t=s_account.split(,);s_account2=(t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return false;' id=newsmaker80631.preimg border=0 src=http://cdn...com/ch_news/backbtn; width=25 height=21 alt=Prev//a and it is attempting to find ;s_account2=(t[0].indexOf( TIA Ian -- Ian Holsman [EMAIL PROTECTED]
Re: Runing a nutch crawler on Eclipse
I run the crawler through Nutch all the time. What are the specific errors that you are getting? Dennis Kubes Tanmoy Kumar Mukherjee wrote: Hi . I am having certain problems in running the nutch crawler on eclipse after having followed the tutorial on Nutch wiki. It says canot build project. Can anyone suggest a good tool? Tanmoy
Re: problem parsing HTML
Hi Dennis, thanks for the fast response. I'm running the SVN head. I'll try narrowing it down a bit further. What led me to believe it was this was looking at what the fetcher was fetching. It could have been we had some bad html on our servers, but it's a standard header area. regards Ian On 13/04/2007, at 11:17 AM, Dennis Kubes wrote: It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running? Dennis Kubes Ian Holsman wrote: Hi. I'm trying to figure out how nutch actually extracts the links out of a piece of HTML. I'm getting confused in what parts TagSoup, NekoHTML, and parse- html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not sure where it actually does this bit. the fragment in question is this: a href=#| onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID + :NewsMaker: National, Political, World, Breaking News and More : + nm_cur[newsmaker80631] + of 8;t=s_account.split (,);s_account2=(t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co (this);s_gs(s_account2);return false;' id=newsmaker80631.preimg border=0 src=http:// cdn...com/ch_news/backbtn width=25 height=21 alt=Prev//a and it is attempting to find ;s_account2=(t[0].indexOf( TIA Ian -- Ian Holsman [EMAIL PROTECTED]
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Please make the following test using your favorite relational DB:* create a table with 300 mln rows and 10 columns of mixed type* select 1 mln rows, sorted by some value* update 1 mln rows to different valuesIf you find that these operations take less time than with the current crawldb then we will have to revisit this issue. :) That is so funny.I think the original question and the above answer shows the big difference in the ways that Nutch is being used. For a small niche searchengine with fewer than a few million pages, it would probably be performant to use a relational DB. I have a webdb with 5 million records, and usually fetch 20k pagesat a time. It takes me about 1 hour to do an updatedb. To inject just a few dozen new urls takes about 20 minutes. On a relational DB, I know the injecting would be *much* faster, and I think the updatedb step would be also.Also for smaller engines, the raw throughput doesn't matter as much, and other considerations like robustness and flexibility could be more important. With a relational DB, I could recover from a crashed crawl with a simple SQL update. Or I could remove a set of bogus URLs from thedb just as easily. Now when I want to tweak the webdb in an unanticipated way, I have to write a custom piece of Java to do it. Just thought I'd throw in a perspective from a niche search guy.Howie _ Your friends are close to you. Keep them that way. http://spaces.live.com/signup.aspx
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01