Re: Buggy fetchlist' urls
On 3/15/06, Jérôme Charron [EMAIL PROTECTED] wrote: I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the javascipt function to get the result rather than extracting pure text now. Hi Jack, I recently write a small article about search engine and javascript (in french, sorry): http://www.moteurzine.com/archives/2006/moteurzine127.html#2 My conslusion is simply: Ok, you can figure to you use a javascript interpreter to extract URLs. But in fact, how could you simulate all the user interaction? You could you make that the nutch crawler acts as a human user? Interpreting Javascript is one thing, knowing all the possible outputs of a javascript is another one. No? Hi Jérôme. Thanks for you article even I don't know french at all. I agree with you on nutch crawler cannot simulate all the user interaction. Somthing like onClick and onKeyDown event. And now I don't how RhinoInterpreter deal with form submit and xmlhttprequest(more time need to know Rhino). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Buggy fetchlist' urls
Hi Andrzej. In my previous projects, I bound javascript functions with center url. And I knew the idea does not fit for nutch. I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the javascipt function to get the result rather than extracting pure text now. You can find javadoc about Rhino here: http://xmlgraphics.apache.org/batik/javadoc/index.html Regards /Jack On 3/14/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Florent Gluck wrote: Some urls are totally bogus. I didn't investigate what could be causing this yet, but it looks like it could be a parsing issue. Some urls contain some javascript code and others contain some html tags. This is a side-effect of our primitive parse-js, which doesn't really parse anything, just uses some heuristic to extract possible URLs. Unfortunately, often as not the strings it extracts don't have anything to do with URLs. If you have suggestions on how to improve it I'm all ears. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Keep Discovering ... ... http://www.jroller.com/page/jmars
Re: Buggy fetchlist' urls
Hi Andrzej, Well, I think for now I'll just disable the parse-js plugin since I don't really need it anyway. I'll let you know if I ever work on it (I may need it in the future). Thanks, --Flo Andrzej Bialecki wrote: Florent Gluck wrote: Some urls are totally bogus. I didn't investigate what could be causing this yet, but it looks like it could be a parsing issue. Some urls contain some javascript code and others contain some html tags. This is a side-effect of our primitive parse-js, which doesn't really parse anything, just uses some heuristic to extract possible URLs. Unfortunately, often as not the strings it extracts don't have anything to do with URLs. If you have suggestions on how to improve it I'm all ears.
Re: Buggy fetchlist' urls
I am not familiar with Rhino engine. But it is said jdk 6 adopted it as embeded javascript engine. Can we build one RhinoInterpreter first, and then evaluate the javascipt function to get the result rather than extracting pure text now. Hi Jack, I recently write a small article about search engine and javascript (in french, sorry): http://www.moteurzine.com/archives/2006/moteurzine127.html#2 My conslusion is simply: Ok, you can figure to you use a javascript interpreter to extract URLs. But in fact, how could you simulate all the user interaction? You could you make that the nutch crawler acts as a human user? Interpreting Javascript is one thing, knowing all the possible outputs of a javascript is another one. No? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Buggy fetchlist' urls
Hi, I'm using nutch revision 385671 from the trunk. I'm running it on a single machine using the local fileystem. I just started with a seed of one single url: http://www.osnews.com Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and dumpped the crawl db. Here is where I got quite surprised: [EMAIL PROTECTED]:~/tmp$ nutch readdb crawldb -dump dump [EMAIL PROTECTED]:~/tmp$ grep ^http dump/part-0 http://a.ads.t-online.de/ Version: 4 http://a.as-eu.falkag.net/ Version: 4 http://a.as-rh4.falkag.net/ Version: 4 http://a.as-rh4.falkag.net/server/asldata.jsVersion: 4 http://a.as-test.falkag.net/Version: 4 http://a.as-us.falkag.net/ Version: 4 http://a.as-us.falkag.net/dat/bfx/ Version: 4 http://a.as-us.falkag.net/dat/bgf/ Version: 4 http://a.as-us.falkag.net/dat/bgf/trpix.gif;Version: 4 http://a.as-us.falkag.net/dat/bjf/ Version: 4 http://a.as-us.falkag.net/dat/brf/ Version: 4 http://a.as-us.falkag.net/dat/cjf/ Version: 4 http://a.as-us.falkag.net/dat/cjf/00/13/60/94.jsVersion: 4 http://a.as-us.falkag.net/dat/cjf/00/13/60/96.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/);QQt.document.write( Version: 4 http://a.as-us.falkag.net/dat/dlv/);document.write( Version: 4 http://a.as-us.falkag.net/dat/dlv/+((QQPc-QQwA)/1000)+ Version: 4 http://a.as-us.falkag.net/dat/dlv/.ads.t-online.de Version: 4 http://a.as-us.falkag.net/dat/dlv/.as-eu.falkag.net Version: 4 http://a.as-us.falkag.net/dat/dlv/.as-rh4.falkag.netVersion: 4 http://a.as-us.falkag.net/dat/dlv/.as-us.falkag.net Version: 4 http://a.as-us.falkag.net/dat/dlv/:// Version: 4 http://a.as-us.falkag.net/dat/dlv//bbr Version: 4 http://a.as-us.falkag.net/dat/dlv//big/bbrVersion: 4 http://a.as-us.falkag.net/dat/dlv//center/td/tr/table/body/html Version: 4 http://a.as-us.falkag.net/dat/dlv//divVersion: 4 http://a.as-us.falkag.net/dat/dlv/Banner-Typ/PopUp Version: 4 http://a.as-us.falkag.net/dat/dlv/ShockwaveFlash.ShockwaveFlash. Version: 4 http://a.as-us.falkag.net/dat/dlv/afxplay.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/application/x-shockwave-flash Version: 4 http://a.as-us.falkag.net/dat/dlv/aslmain.jsVersion: 4 http://a.as-us.falkag.net/dat/dlv/text/javascript Version: 4 http://a.as-us.falkag.net/dat/dlv/window.blur();Version: 4 http://a.as-us.falkag.net/dat/njf/ Version: 4 http://bilbo.counted.com/0/42699/ Version: 4 http://bilbo.counted.com/7/42699/ Version: 4 http://bw.ads.t-online.de/ Version: 4 http://bw.as-eu.falkag.net/ Version: 4 http://bw.as-us.falkag.net/ Version: 4 http://data.as-us.falkag.net/server/asldata.js Version: 4 http://denux.org/ Version: 4 ... Some urls are totally bogus. I didn't investigate what could be causing this yet, but it looks like it could be a parsing issue. Some urls contain some javascript code and others contain some html tags. Is there anyone aware of this? I can open a bug if needed. Thanks, --Flo