Re: Buggy fetchlist' urls

2006-03-15 Thread Jack Tang
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote:
 
  I am not familiar with Rhino engine. But it is said jdk 6 adopted it
  as embeded javascript engine. Can we build one RhinoInterpreter first,
  and then evaluate the javascipt function to get the result rather than
  extracting pure text now.

 Hi Jack,

 I recently write a small article about search engine and javascript (in
 french, sorry):
 http://www.moteurzine.com/archives/2006/moteurzine127.html#2

 My conslusion is simply: Ok, you can figure to you use a javascript
 interpreter to extract
 URLs. But in fact, how could you simulate all the user interaction?
 You could you make that the nutch crawler acts as a human user?
 Interpreting Javascript is one thing, knowing all the possible outputs of a
 javascript is another one.
  No?
Hi Jérôme.
Thanks for you article even I don't know french at all.
I agree with you on nutch crawler cannot simulate all the user
interaction. Somthing like onClick and onKeyDown event. And now I
don't how RhinoInterpreter deal with form submit and
xmlhttprequest(more time need to know Rhino).

 Regards

 Jérôme

 --
 http://motrech.free.fr/
 http://www.frutch.org/




--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Buggy fetchlist' urls

2006-03-14 Thread Jack Tang
Hi Andrzej.

In my previous projects, I bound javascript functions with center url.
And I knew the idea does not fit for nutch.

I am not familiar with Rhino engine. But it is said jdk 6 adopted it
as embeded javascript engine. Can we build one RhinoInterpreter first,
and then evaluate the javascipt function to get the result rather than
extracting pure text now.

You can find javadoc about Rhino here:
http://xmlgraphics.apache.org/batik/javadoc/index.html

Regards
/Jack

On 3/14/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Florent Gluck wrote:
  Some urls are totally bogus.  I didn't investigate what could be causing
  this yet, but it looks like it could be a parsing issue.  Some urls
  contain some javascript code and others contain some html tags.
 

 This is a side-effect of our primitive parse-js, which doesn't really
 parse anything, just uses some heuristic to extract possible URLs.
 Unfortunately, often as not the strings it extracts don't have anything
 to do with URLs.

 If you have suggestions on how to improve it I'm all ears.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Buggy fetchlist' urls

2006-03-14 Thread Florent Gluck
Hi Andrzej,

Well, I think for now I'll just disable the parse-js plugin since I
don't really need it anyway.
I'll let you know if I ever work on it (I may need it in the future).

Thanks,
--Flo

Andrzej Bialecki wrote:

 Florent Gluck wrote:

 Some urls are totally bogus.  I didn't investigate what could be causing
 this yet, but it looks like it could be a parsing issue.  Some urls
 contain some javascript code and others contain some html tags.
   


 This is a side-effect of our primitive parse-js, which doesn't really
 parse anything, just uses some heuristic to extract possible URLs.
 Unfortunately, often as not the strings it extracts don't have
 anything to do with URLs.

 If you have suggestions on how to improve it I'm all ears.




Re: Buggy fetchlist' urls

2006-03-14 Thread Jérôme Charron

 I am not familiar with Rhino engine. But it is said jdk 6 adopted it
 as embeded javascript engine. Can we build one RhinoInterpreter first,
 and then evaluate the javascipt function to get the result rather than
 extracting pure text now.

Hi Jack,

I recently write a small article about search engine and javascript (in
french, sorry):
http://www.moteurzine.com/archives/2006/moteurzine127.html#2

My conslusion is simply: Ok, you can figure to you use a javascript
interpreter to extract
URLs. But in fact, how could you simulate all the user interaction?
You could you make that the nutch crawler acts as a human user?
Interpreting Javascript is one thing, knowing all the possible outputs of a
javascript is another one.
 No?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Buggy fetchlist' urls

2006-03-13 Thread Florent Gluck
Hi,

I'm using nutch revision 385671 from the trunk.  I'm running it on a
single machine using the local fileystem.
I just started with a seed of one single url: http://www.osnews.com
Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and
dumpped the crawl db.  Here is where I got quite surprised:

[EMAIL PROTECTED]:~/tmp$ nutch readdb crawldb -dump dump
[EMAIL PROTECTED]:~/tmp$ grep ^http dump/part-0
http://a.ads.t-online.de/   Version: 4
http://a.as-eu.falkag.net/  Version: 4
http://a.as-rh4.falkag.net/ Version: 4
http://a.as-rh4.falkag.net/server/asldata.jsVersion: 4
http://a.as-test.falkag.net/Version: 4
http://a.as-us.falkag.net/  Version: 4
http://a.as-us.falkag.net/dat/bfx/  Version: 4
http://a.as-us.falkag.net/dat/bgf/  Version: 4
http://a.as-us.falkag.net/dat/bgf/trpix.gif;Version: 4
http://a.as-us.falkag.net/dat/bjf/  Version: 4
http://a.as-us.falkag.net/dat/brf/  Version: 4
http://a.as-us.falkag.net/dat/cjf/  Version: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/94.jsVersion: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/96.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/);QQt.document.write( Version: 4
http://a.as-us.falkag.net/dat/dlv/);document.write( Version: 4
http://a.as-us.falkag.net/dat/dlv/+((QQPc-QQwA)/1000)+  Version: 4
http://a.as-us.falkag.net/dat/dlv/.ads.t-online.de  Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-eu.falkag.net Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-rh4.falkag.netVersion: 4
http://a.as-us.falkag.net/dat/dlv/.as-us.falkag.net Version: 4
http://a.as-us.falkag.net/dat/dlv/://   Version: 4
http://a.as-us.falkag.net/dat/dlv//bbr  Version: 4
http://a.as-us.falkag.net/dat/dlv//big/bbrVersion: 4
http://a.as-us.falkag.net/dat/dlv//center/td/tr/table/body/html
Version: 4
http://a.as-us.falkag.net/dat/dlv//divVersion: 4
http://a.as-us.falkag.net/dat/dlv/Banner-Typ/PopUp  Version: 4
http://a.as-us.falkag.net/dat/dlv/ShockwaveFlash.ShockwaveFlash.   
Version: 4
http://a.as-us.falkag.net/dat/dlv/afxplay.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/application/x-shockwave-flash Version: 4
http://a.as-us.falkag.net/dat/dlv/aslmain.jsVersion: 4
http://a.as-us.falkag.net/dat/dlv/text/javascript   Version: 4
http://a.as-us.falkag.net/dat/dlv/window.blur();Version: 4
http://a.as-us.falkag.net/dat/njf/  Version: 4
http://bilbo.counted.com/0/42699/   Version: 4
http://bilbo.counted.com/7/42699/   Version: 4
http://bw.ads.t-online.de/  Version: 4
http://bw.as-eu.falkag.net/ Version: 4
http://bw.as-us.falkag.net/ Version: 4
http://data.as-us.falkag.net/server/asldata.js  Version: 4
http://denux.org/   Version: 4
...

Some urls are totally bogus.  I didn't investigate what could be causing
this yet, but it looks like it could be a parsing issue.  Some urls
contain some javascript code and others contain some html tags.

Is there anyone aware of this?
I can open a bug if needed.

Thanks,
--Flo