DummySSLProtocolSocketFactory problem, please help me!!!! 2

2007-04-12 Thread Gavino Marras


---BeginMessage---
I have a problem with nutch 0.8.1 in DummySSLProtocolSocketFactory class 
(org.apache.nutch.protocol.httpclient  plugin).
I have to index pages from a web site on https protocol and that it uses 
authentication and sessions.

My problem is about the management of the sessions.
DummySSLProtocolSocketFactory class implements the ProtocolSocketFactory 
interface in the HttpClient library.
If I modify DummySSLProtocolSocketFactory so that implements the 
SecureProtocolSocketFactory interface all it works.


Should anyone tell me if it is ok, or if there is another way?

please help me!!!1
---BeginMessage---
I have a problem with nutch 0.8.1 in DummySSLProtocolSocketFactory class 
(org.apache.nutch.protocol.httpclient  plugin).
I have to index pages from a web site on https protocol and that it uses 
authentication and sessions.

My problem is about the management of the sessions.
DummySSLProtocolSocketFactory class implements the ProtocolSocketFactory 
interface in the HttpClient library.
If I modify DummySSLProtocolSocketFactory so that implements the 
SecureProtocolSocketFactory interface all it works.


Should anyone tell me if it is ok, or if there is another way?




---End Message---
---End Message---


Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread wangxu
Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Nuther
Hi, wangxu.

You wrote 13 апреля 2007 г., 1:03:31:

 Have anybody thought of replacing CrawlDb with any kind of Rational
 DB,mysql,for example?

 Crawldb is so difficult to manipulate.
 I often have the requirements to edit several entries in crawdb;
 But that would cost too much waiting for the mapReduce.
You think MySQL would give you higher speed? :)
Just try DataPark Search for large number of urls :)
and you will see the difference ;)





Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Andrzej Bialecki

wangxu wrote:

Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Please make the following test using your favorite relational DB:

* create a table with 300 mln rows and 10 columns of mixed type

* select 1 mln rows, sorted by some value

* update 1 mln rows to different values

If you find that these operations take less time than with the current 
crawldb then we will have to revisit this issue. :)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Sami Siren
wangxu wrote:
 Have anybody thought of replacing CrawlDb with any kind of Rational
 DB,mysql,for example?
 
 Crawldb is so difficult to manipulate.
 I often have the requirements to edit several entries in crawdb;
 But that would cost too much waiting for the mapReduce.
 

Once when I was young and restless I went through the path with
relational db. It kind of worked with few million records. I am not
trying to do it anymore.

Perhaps your problem is that you process too few records at the time?
Quite often I see examples where people fetch few hundred of few
thousand pages at a time. It might be good amount for small crawls, but
if your goal is bigger you need bigger segments to get there.

--
 Sami Siren




Runing a nutch crawler on Eclipse

2007-04-12 Thread Tanmoy Kumar Mukherjee
Hi .
I am having certain problems in running the nutch crawler on eclipse 
after having followed the tutorial on Nutch wiki. It says canot build 
project. Can anyone suggest a good tool?

Tanmoy


problem parsing HTML

2007-04-12 Thread Ian Holsman

Hi.

I'm trying to figure out how nutch actually extracts the links out of  
a piece of HTML.


I'm getting confused in what parts TagSoup, NekoHTML, and parse-html   
play in all this.


from what I can see the regular expression it is using to extract the  
link is slightly off, but i'm not sure

where it actually does this bit.

the fragment in question is this:

a href=#|  
onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID +  
:NewsMaker: National, Political, World, Breaking News and More : +  
nm_cur[newsmaker80631] +  of 8;t=s_account.split(,);s_account2= 
(t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs 
(s_account2);return false;' id=newsmaker80631.preimg border=0  
src=http://cdn...com/ch_news/backbtn; width=25 height=21  
alt=Prev//a


and it is attempting to find ;s_account2=(t[0].indexOf(



TIA
Ian

--
Ian Holsman
[EMAIL PROTECTED]


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Dennis Kubes



Andrzej Bialecki wrote:

wangxu wrote:

Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?

Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.


Please make the following test using your favorite relational DB:

* create a table with 300 mln rows and 10 columns of mixed type

* select 1 mln rows, sorted by some value

* update 1 mln rows to different values

If you find that these operations take less time than with the current 
crawldb then we will have to revisit this issue. :)


That is so funny.





Re: problem parsing HTML

2007-04-12 Thread Dennis Kubes
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() 
which is called from org.apache.nutch.parse.html.HtmlParser.  Running 
some simple tests on your fragment below I get non outlink for this. 
What version of Nutch are you running?


Dennis Kubes

Ian Holsman wrote:

Hi.

I'm trying to figure out how nutch actually extracts the links out of a 
piece of HTML.


I'm getting confused in what parts TagSoup, NekoHTML, and parse-html  
play in all this.


from what I can see the regular expression it is using to extract the 
link is slightly off, but i'm not sure

where it actually does this bit.

the fragment in question is this:

a href=#| 
onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID + 
:NewsMaker: National, Political, World, Breaking News and More : + 
nm_cur[newsmaker80631] +  of 
8;t=s_account.split(,);s_account2=(t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co(this);s_gs(s_account2);return 
false;' id=newsmaker80631.preimg border=0 
src=http://cdn...com/ch_news/backbtn; width=25 height=21 
alt=Prev//a


and it is attempting to find ;s_account2=(t[0].indexOf(



TIA
Ian

--
Ian Holsman
[EMAIL PROTECTED]


Re: Runing a nutch crawler on Eclipse

2007-04-12 Thread Dennis Kubes
I run the crawler through Nutch all the time.  What are the specific 
errors that you are getting?


Dennis Kubes

Tanmoy Kumar Mukherjee wrote:

Hi .
I am having certain problems in running the nutch crawler on eclipse 
after having followed the tutorial on Nutch wiki. It says canot build 
project. Can anyone suggest a good tool?


Tanmoy


Re: problem parsing HTML

2007-04-12 Thread Ian Holsman

Hi Dennis,
thanks for the fast response.


I'm running the SVN head.
I'll try narrowing it down a bit further.
What led me to believe it was this was looking at what the fetcher  
was fetching. It could have been we had some bad html on our servers,  
but it's a standard header area.


regards
Ian

On 13/04/2007, at 11:17 AM, Dennis Kubes wrote:

It happens in  
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is  
called from org.apache.nutch.parse.html.HtmlParser.  Running some  
simple tests on your fragment below I get non outlink for this.  
What version of Nutch are you running?


Dennis Kubes

Ian Holsman wrote:

Hi.
I'm trying to figure out how nutch actually extracts the links out  
of a piece of HTML.
I'm getting confused in what parts TagSoup, NekoHTML, and parse- 
html  play in all this.
from what I can see the regular expression it is using to extract  
the link is slightly off, but i'm not sure

where it actually does this bit.
the fragment in question is this:
a href=#|  
onclick='s_linkTrackVars=None;s_linkType=o;s_linkName=s_pfxID  
+ :NewsMaker: National, Political, World, Breaking News and  
More : + nm_cur[newsmaker80631] +  of 8;t=s_account.split 
(,);s_account2=(t[0].indexOf(aolsvc)==-1?t[0]:t[1]);s_lnk=s_co 
(this);s_gs(s_account2);return false;'  
id=newsmaker80631.preimg border=0 src=http:// 
cdn...com/ch_news/backbtn width=25 height=21  
alt=Prev//a

and it is attempting to find ;s_account2=(t[0].indexOf(
TIA
Ian
--
Ian Holsman
[EMAIL PROTECTED]





RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
  Please make the following test using your favorite relational DB:* 
  create a table with 300 mln rows and 10 columns of mixed type* 
  select 1 mln rows, sorted by some value* update 1 mln rows to 
  different valuesIf you find that these operations take less time 
  than with the current   crawldb then we will have to revisit this issue. 
  :)  That is so funny.I think the original question and the above answer 
  shows the big difference in the ways that Nutch is being used. For a small 
  niche searchengine with fewer than a few million pages, it would probably 
  be performant to use a relational DB. I have a webdb with 5 million 
  records, and usually fetch 20k pagesat a time. It takes me about 1 hour to 
  do an updatedb. To inject just a few dozen new urls takes about 20 minutes. 
  On a relational DB, I know the injecting would be *much* faster, and I 
  think the updatedb step would be also.Also for smaller engines, the raw 
  throughput doesn't matter as much, and other considerations like robustness 
  and flexibility could be more important. With a relational DB, I could 
  recover from a crashed crawl with a simple SQL update. Or I could remove a 
  set of bogus URLs from thedb just as easily. Now when I want to tweak the 
  webdb in an unanticipated way, I have to write a custom piece of Java to do 
  it. Just thought I'd throw in a perspective from a niche search guy.Howie
_
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Sorry about the previous crappily formatted message. In brief, my point wasthat 
relational DB might perform better for small niche users, and plusyou get the 
flexibility of SQL. No more writing custom code to tweak webdb.Howie
_
Live Search Maps – find all the local information you need, right when you need 
it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01