DummySSLProtocolSocketFactory problem, please help me!!!! 2

2007-04-12 Thread Gavino Marras
---BeginMessage--- I have a problem with nutch 0.8.1 in DummySSLProtocolSocketFactory class (org.apache.nutch.protocol.httpclient plugin). I have to index pages from a web site on https protocol and that it uses authentication and sessions. My problem is about the management of the sessions.

Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread wangxu
Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce.

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Nuther
Hi, wangxu. You wrote 13 апреля 2007 г., 1:03:31: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Andrzej Bialecki
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Please make the following

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Sami Siren
wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce. Once when I was young

Runing a nutch crawler on Eclipse

2007-04-12 Thread Tanmoy Kumar Mukherjee
Hi . I am having certain problems in running the nutch crawler on eclipse after having followed the tutorial on Nutch wiki. It says canot build project. Can anyone suggest a good tool? Tanmoy

problem parsing HTML

2007-04-12 Thread Ian Holsman
Hi. I'm trying to figure out how nutch actually extracts the links out of a piece of HTML. I'm getting confused in what parts TagSoup, NekoHTML, and parse-html play in all this. from what I can see the regular expression it is using to extract the link is slightly off, but i'm not

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Dennis Kubes
Andrzej Bialecki wrote: wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce.

Re: problem parsing HTML

2007-04-12 Thread Dennis Kubes
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running? Dennis Kubes Ian Holsman wrote: Hi. I'm

Re: Runing a nutch crawler on Eclipse

2007-04-12 Thread Dennis Kubes
I run the crawler through Nutch all the time. What are the specific errors that you are getting? Dennis Kubes Tanmoy Kumar Mukherjee wrote: Hi . I am having certain problems in running the nutch crawler on eclipse after having followed the tutorial on Nutch wiki. It says canot build

Re: problem parsing HTML

2007-04-12 Thread Ian Holsman
Hi Dennis, thanks for the fast response. I'm running the SVN head. I'll try narrowing it down a bit further. What led me to believe it was this was looking at what the fetcher was fetching. It could have been we had some bad html on our servers, but it's a standard header area. regards

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Please make the following test using your favorite relational DB:* create a table with 300 mln rows and 10 columns of mixed type* select 1 mln rows, sorted by some value* update 1 mln rows to different valuesIf you find that these operations take less time than with

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie _