Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
This branch was superseded by lp:~zorba-coders/zorba/web_crawler_tutorial. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The attempt to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba failed. Below is the output from the failed tests. CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:272 (message): Validation queue job web_crawler_tutorial-2011-12-14T14-31-29.81Z is finished. The final status was: 1 tests did not succeed - changes not commited. Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Approved => Needs review For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Validation queue starting for merge proposal. Log at: http://zorbatest.lambda.nu:8080/remotequeue/web_crawler_tutorial-2011-12-14T14-31-29.81Z/log.html -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Needs review => Approved For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Daniel Turcanu has proposed merging lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. Requested reviews: Zorba Coders (zorba-coders) For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Updated the web crawler tutorial -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/85669 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. === added file 'doc/zorba/link_crawler2.dox' --- doc/zorba/link_crawler2.dox 1970-01-01 00:00:00 + +++ doc/zorba/link_crawler2.dox 2011-12-14 14:31:06 + @@ -0,0 +1,238 @@ +/** +\page link_crawler2 Web Crawler example in XQuery +\code +(: + : Copyright 2006-2011 The FLWOR Foundation. + : + : Licensed under the Apache License, Version 2.0 (the "License"); + : you may not use this file except in compliance with the License. + : You may obtain a copy of the License at + : + : http://www.apache.org/licenses/LICENSE-2.0 + : + : Unless required by applicable law or agreed to in writing, software + : distributed under the License is distributed on an "AS IS" BASIS, + : WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + : See the License for the specific language governing permissions and + : limitations under the License. +:) + +import module namespace http = "http://www.zorba-xquery.com/modules/http-client";; +import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";; +import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";; +import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";; +import module namespace file = "http://expath.org/ns/file";; + +declare namespace ann = "http://www.zorba-xquery.com/annotations";; +declare namespace xhtml="http://www.w3.org/1999/xhtml";; +declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";; +declare namespace err="http://www.w3.org/2005/xqt-errors";; +declare namespace httpsch = "http://expath.org/ns/http-client";; + +declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";; +declare variable $uri-host as xs:string := "http://www.zorba-xquery.com";; + + + +declare variable $local:processed-internal-links := xs:QName("processed-internal-links"); +declare variable $local:processed-external-links := xs:QName("processed-external-links"); + + +declare %ann:sequential function local:create-containers() +{ + map:create($local:processed-internal-links, xs:QName("xs:string")); + map:create($local:processed-external-links, xs:QName("xs:string")); +}; + +declare %ann:sequential function local:delete-containers(){ + for $x in map:available-maps() + return map:delete($x); +}; + +declare function local:is-internal($x as xs:string) as xs:boolean +{ + starts-with($x, $uri-host) +}; + +declare function local:my-substring-before($s1 as xs:string, $s2 as xs:string) as xs:string +{ +let $sb := fn:substring-before($s1, $s2) +return if($sb = "") then $s1 else $sb +}; + +declare %ann:sequential function local:get-real-link($href as xs:string, $start-uri as xs:string) as xs:string? +{ + variable $absuri; + try{ +$absuri := local:my-substring-before(resolve-uri(fn:normalize-space($href), $start-uri), "#"); + } + catch * + { + map:insert($local:processed-external-links, ({$start-uri}, + malformed, + broken), $href); + } + $absuri +}; + + +declare function local:get-media-type ($http-call as node()) as xs:string +{ + local:my-substring-before($http-call/httpsch:header[@name = 'Content-Type'][1]/string(@value), ";") +}; + +declare function local:alive($http-call as item()*) as xs:boolean +{ + if((count($http-call) ge 1) and +($http-call[1]/@status eq 200)) + then true() else fn:trace(false(), "alive") +}; + + +declare %ann:sequential function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* +{ distinct-values( for $y in ($content//*:a/string(@href), + $content//*:link/string(@href), + $content//*:script/string(@src), + $content//*:img/string(@src), + $content//*:area/string(@href) + ) +return local:get-real-link($y, $uri)) +}; + + +declare %ann:sequential function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{ + + distinct-values( + let $search := fn:analyze-string($content, "(<|<|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7") + for $other-uri2 in $search//group[@nr=8]/string() + return local:get-real-link($other-uri2, $uri) + ) +}; + + +declare %ann:sequential function local:map-insert-result($map-name as xs:QName, $
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Daniel Turcanu has proposed merging lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. Requested reviews: Sorin Marian Nasoi (sorin.marian.nasoi) For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/79407 Updated the web crawler tutorial with the latest updates in link_crawler2.xq -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/79407 Your team Zorba Coders is subscribed to branch lp:zorba. === added file 'doc/zorba/link_crawler2.dox' --- doc/zorba/link_crawler2.dox 1970-01-01 00:00:00 + +++ doc/zorba/link_crawler2.dox 2011-10-14 15:00:48 + @@ -0,0 +1,238 @@ +/** +\page link_crawler2 Web Crawler example in XQuery +\code +(: + : Copyright 2006-2011 The FLWOR Foundation. + : + : Licensed under the Apache License, Version 2.0 (the "License"); + : you may not use this file except in compliance with the License. + : You may obtain a copy of the License at + : + : http://www.apache.org/licenses/LICENSE-2.0 + : + : Unless required by applicable law or agreed to in writing, software + : distributed under the License is distributed on an "AS IS" BASIS, + : WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + : See the License for the specific language governing permissions and + : limitations under the License. +:) + +import module namespace http = "http://www.zorba-xquery.com/modules/http-client";; +import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";; +import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";; +import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";; +import module namespace file = "http://expath.org/ns/file";; + +declare namespace ann = "http://www.zorba-xquery.com/annotations";; +declare namespace xhtml="http://www.w3.org/1999/xhtml";; +declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";; +declare namespace err="http://www.w3.org/2005/xqt-errors";; +declare namespace httpsch = "http://expath.org/ns/http-client";; + +declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";; +declare variable $uri-host as xs:string := "http://www.zorba-xquery.com";; + + + +declare variable $local:processed-internal-links := xs:QName("processed-internal-links"); +declare variable $local:processed-external-links := xs:QName("processed-external-links"); + + +declare %ann:sequential function local:create-containers() +{ + map:create($local:processed-internal-links, xs:QName("xs:string")); + map:create($local:processed-external-links, xs:QName("xs:string")); +}; + +declare %ann:sequential function local:delete-containers(){ + for $x in map:available-maps() + return map:delete($x); +}; + +declare function local:is-internal($x as xs:string) as xs:boolean +{ + starts-with($x, $uri-host) +}; + +declare function local:my-substring-before($s1 as xs:string, $s2 as xs:string) as xs:string +{ +let $sb := fn:substring-before($s1, $s2) +return if($sb = "") then $s1 else $sb +}; + +declare %ann:sequential function local:get-real-link($href as xs:string, $start-uri as xs:string) as xs:string? +{ + variable $absuri; + try{ +$absuri := local:my-substring-before(resolve-uri(fn:normalize-space($href), $start-uri), "#"); + } + catch * + { + map:insert($local:processed-external-links, ({$start-uri}, + malformed, + broken), $href); + } + $absuri +}; + + +declare function local:get-media-type ($http-call as node()) as xs:string +{ + local:my-substring-before($http-call/httpsch:header[@name = 'Content-Type'][1]/string(@value), ";") +}; + +declare function local:alive($http-call as item()*) as xs:boolean +{ + if((count($http-call) ge 1) and +($http-call[1]/@status eq 200)) + then true() else fn:trace(false(), "alive") +}; + + +declare %ann:sequential function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* +{ distinct-values( for $y in ($content//*:a/string(@href), + $content//*:link/string(@href), + $content//*:script/string(@src), + $content//*:img/string(@src), + $content//*:area/string(@href) + ) +return local:get-real-link($y, $uri)) +}; + + +declare %ann:sequential function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{ + + distinct-values( + let $search := fn:analyze-string($content, "(<|<|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7") + for $other-uri2 in $search//group[@nr=8]/string() + return local:get-real-link($other-uri2, $uri) + ) +}; + + +declare %ann:sequential function local:map-insert-result($map-name as xs:QName, $url as xs:string
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Daniel Turcanu has proposed merging lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. Requested reviews: Sorin Marian Nasoi (sorin.marian.nasoi) For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78614 Updated the web crawler tutorial with the latest fixes in link_crawler2.xq -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78614 Your team Zorba Coders is subscribed to branch lp:zorba. === added file 'doc/zorba/link_crawler2.dox' --- doc/zorba/link_crawler2.dox 1970-01-01 00:00:00 + +++ doc/zorba/link_crawler2.dox 2011-10-07 14:41:17 + @@ -0,0 +1,221 @@ +/** +\page link_crawler2 Web Crawler example in XQuery +\code +(: + : Copyright 2006-2011 The FLWOR Foundation. + : + : Licensed under the Apache License, Version 2.0 (the "License"); + : you may not use this file except in compliance with the License. + : You may obtain a copy of the License at + : + : http://www.apache.org/licenses/LICENSE-2.0 + : + : Unless required by applicable law or agreed to in writing, software + : distributed under the License is distributed on an "AS IS" BASIS, + : WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + : See the License for the specific language governing permissions and + : limitations under the License. +:) + +import module namespace http = "http://www.zorba-xquery.com/modules/http-client";; +import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";; +import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";; +import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";; +import module namespace file = "http://expath.org/ns/file";; + +declare namespace ann = "http://www.zorba-xquery.com/annotations";; +declare namespace xhtml="http://www.w3.org/1999/xhtml";; +declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";; +declare namespace err="http://www.w3.org/2005/xqt-errors";; +declare namespace httpsch = "http://expath.org/ns/http-client";; + +declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";; +declare variable $uri-host as xs:string := "http://www.zorba-xquery.com/site2/";; + + + +declare variable $local:processed-internal-links := xs:QName("processed-internal-links"); +declare variable $local:processed-external-links := xs:QName("processed-external-links"); + + +declare %ann:sequential function local:create-containers() +{ + map:create($local:processed-internal-links, xs:QName("xs:string")); + map:create($local:processed-external-links, xs:QName("xs:string")); +}; + +declare %ann:sequential function local:delete-containers(){ + for $x in map:available-maps() + return map:delete($x); +}; + +declare function local:is-internal($x as xs:string) as xs:boolean +{ + starts-with($x, $uri-host) +}; + +declare function local:my-substring-before($s1 as xs:string, $s2 as xs:string) as xs:string +{ +let $sb := fn:substring-before($s1, $s2) +return if($sb = "") then $s1 else $sb +}; + +declare %ann:sequential function local:get-real-link($href as xs:string, $start-uri as xs:string) as xs:string? +{ + variable $absuri; + try{ +$absuri := local:my-substring-before(resolve-uri(fn:normalize-space($href), $start-uri), "#"); + } + catch * + { + map:insert($local:processed-external-links, fn:concat("malformed, referenced in page ", $start-uri), $href); + } + $absuri +}; + + +declare function local:get-media-type ($http-call as node()) as xs:string +{ + local:my-substring-before($http-call/httpsch:header[@name = 'Content-Type'][1]/string(@value), ";") +}; + +declare function local:alive($http-call as item()*) as xs:boolean +{ + if((count($http-call) ge 1) and +($http-call[1]/@status eq 200)) + then true() else fn:trace(false(), "alive") +}; + + +declare %ann:sequential function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* +{ distinct-values( for $y in ($content//*:a/string(@href), + $content//*:link/string(@href), + $content//*:script/string(@src), + $content//*:img/string(@src), + $content//*:area/string(@href) + ) +return local:get-real-link($y, $uri)) +}; + + +declare %ann:sequential function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{ + + distinct-values( + let $search := fn:analyze-string($content, "(<|<|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7") + for $other-uri2 in $search//group[@nr=8]/string() + return local:get-real-link($other-uri2, $uri) + ) +}; + + + +declare %ann:sequential function local:process-link($x as xs:string, $n as xs:integer) as item()*{ + if(local:is-internal($x)) + then local:process-internal-link($x,
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Review: Needs Fixing you could add the link to the script in the Doxy page instead of adding a new Doxy page. Something like: \include zorba/store/sc2_ex1.xq First you need to add the path to the WebCrawler script in the Doxygen example search path. Edit doc/zorba/doxy.config.in line 504, EXAMPLE_PATH -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Review: Approve I have checked the changes. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Approved => Needs review For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Voting does not meet specified criteria. Required: Approve > 0, Disapprove < 1. Got: 1 Pending. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Validation queue job web_crawler_tutorial-2011-10-05T12-23-57.066Z is finished. The final status was: All tests succeeded! -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Validation queue starting for merge proposal. Log at: http://zorbatest.lambda.nu:8080/remotequeue/web_crawler_tutorial-2011-10-05T12-23-57.066Z/log.html -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Needs review => Approved For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Daniel Turcanu has proposed merging lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. Requested reviews: Zorba Coders (zorba-coders) For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/78243 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. === added file 'doc/zorba/link_crawler2.dox' --- doc/zorba/link_crawler2.dox 1970-01-01 00:00:00 + +++ doc/zorba/link_crawler2.dox 2011-10-05 12:23:32 + @@ -0,0 +1,208 @@ +/** +\page link_crawler2 Web Crawler example in XQuery +\code +(: + : Copyright 2006-2011 The FLWOR Foundation. + : + : Licensed under the Apache License, Version 2.0 (the "License"); + : you may not use this file except in compliance with the License. + : You may obtain a copy of the License at + : + : http://www.apache.org/licenses/LICENSE-2.0 + : + : Unless required by applicable law or agreed to in writing, software + : distributed under the License is distributed on an "AS IS" BASIS, + : WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + : See the License for the specific language governing permissions and + : limitations under the License. +:) + +import module namespace http = "http://www.zorba-xquery.com/modules/http-client";; +import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";; +import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";; +import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";; + +declare namespace ann = "http://www.zorba-xquery.com/annotations";; +declare namespace xhtml="http://www.w3.org/1999/xhtml";; +declare namespace output="http://www.w3.org/2010/xslt-xquery-serialization";; +declare namespace err="http://www.w3.org/2005/xqt-errors";; +declare namespace httpsch = "http://expath.org/ns/http-client";; + +declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";; +declare variable $uri-host as xs:string := "http://www.zorba-xquery.com/site2/";; + + +declare variable $supported-media-types as xs:string+ := ("text/xml", "application/xml", "text/xml-external-parsed-entity", "application/xml-external-parsed-entity", + "application/atom+xml", "text/html"); + + +declare variable $local:processed-internal-links:=xs:QName("processed-internal-links"); +declare variable $local:processed-external-links :=xs:QName("processed-external-links"); + + +declare %ann:sequential function local:create-containers() +{ + map:create($local:processed-internal-links, xs:QName("xs:string")); + map:create($local:processed-external-links, xs:QName("xs:string")); +}; + +declare %ann:sequential function local:delete-containers(){ + for $x in map:available-maps() + return map:delete($x); +}; + +declare function local:is-internal($x as xs:string) as xs:boolean +{ + starts-with($x, $uri-host) +}; + +declare function local:my-substring-before($s1 as xs:string, $s2 as xs:string) as xs:string +{ +let $sb := fn:substring-before($s1, $s2) +return if($sb = "") then $s1 else $sb +}; + +declare function local:get-real-link($href as xs:string, $start-uri as xs:string) as xs:string +{ + local:my-substring-before(resolve-uri($href, $start-uri), "#") +}; + + +declare function local:get-media-type ($http-call as node()) as xs:string +{ + local:my-substring-before($http-call/httpsch:header[@name = 'Content-Type'][1]/string(@value), ";") +}; + +declare function local:alive($http-call as node()*) as xs:boolean +{ + if(($http-call[1]/@status eq 200)) then true() else false() +}; + + +declare function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* +{ distinct-values( for $y in ($content//*:a/string(@href), + $content//*:link/string(@href), + $content//*:script/string(@src), + $content//*:img/string(@src), + $content//*:area/string(@href) + ) +return local:get-real-link($y, $uri)) +}; + + +declare function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{ + + distinct-values( + let $search := fn:analyze-string($content, "(<|<|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7") + for $other-uri2 in $search//group[@nr=8]/string() + let $y:= fn:normalize-space($other-uri2) + return local:get-real-link($y, $uri) + ) +}; + + + +declare %ann:sequential function local:process-external-link($x as xs:string){ + if(not(empty(map:get($local:processed-external-links, $x + then exit returning false(); + else {} + variable $http-call:=(); + try{ +$http-call:=http:send-request(, (), ()); + } + catch * {}
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The link crawler is added in html module as a test for compilation. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Approved => Merged For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Validation queue job web_crawler_tutorial-2011-10-04T23-35-02.03Z is finished. The final status was: All tests succeeded! -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
I think that the code in the tutorial should be literally included and be tested as such to make sure that we don't regress. The tutorial should be linked from a blog entry. Also, the tutorial should provide a link to download the source code. Daniel, could you please provide Dana with the HTML version of the tutorial. I'm sure she is also interested in reading it before it gets published. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Validation queue starting for merge proposal. Log at: http://zorbatest.lambda.nu:8080/remotequeue/web_crawler_tutorial-2011-10-04T23-35-02.03Z/log.html -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The proposal to merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba has been updated. Status: Needs review => Approved For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Review: Approve I like it. I'd leave the link from the index page there - having a specific section marked "tutorials" will maybe encourage folks to write some more over time. If not, we can easily move that later. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Review: Abstain The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
Re: [Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it. -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is subscribed to branch lp:zorba. -- Mailing list: https://launchpad.net/~zorba-coders Post to : zorba-coders@lists.launchpad.net Unsubscribe : https://launchpad.net/~zorba-coders More help : https://help.launchpad.net/ListHelp
[Zorba-coders] [Merge] lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba
Daniel Turcanu has proposed merging lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. Requested reviews: Zorba Coders (zorba-coders) For more details, see: https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Added tutorial for web crawler script from html module (or script directory in zorba). -- https://code.launchpad.net/~danielturcanu/zorba/web_crawler_tutorial/+merge/77179 Your team Zorba Coders is requested to review the proposed merge of lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba. === modified file 'doc/zorba/indexpage.dox.in' --- doc/zorba/indexpage.dox.in 2011-09-06 16:39:46 + +++ doc/zorba/indexpage.dox.in 2011-09-27 15:05:56 + @@ -127,6 +127,14 @@ + + + + +Tutorials + +\ref web_crawler_tutorial + === added file 'doc/zorba/web_crawler.dox' --- doc/zorba/web_crawler.dox 1970-01-01 00:00:00 + +++ doc/zorba/web_crawler.dox 2011-09-27 15:05:56 + @@ -0,0 +1,173 @@ +/** +\page web_crawler_tutorial Web Crawler example in XQuery + +Description of a web crawler example in XQuery. + +The idea is to crawl through the pages of a website and store a list with external pages and internal pages and check if they work or not. +This example uses Zorba's http module for accessing the webpages, and the html module for converting the html to xml. +The complete code can be found in the test directory of the html convertor module. + +\code +import module namespace http = "http://www.zorba-xquery.com/modules/http-client";; +import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";; +import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";; +import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";; +\endcode + +The internal pages are checked recursively, while the external ones are only checked for existence. +The distinction between internal and external links is made by comparing the URI with a global string variable $uri-host. +Change this variable to point to your website, or a subdirectory on your website. + +\code +declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";; +declare variable $uri-host as xs:string := "http://www.zorba-xquery.com/site2/";; + +declare function local:is-internal($x as xs:string) as xs:boolean +{ + starts-with($x, $uri-host) +}; + +\endcode + +The crawling starts from the URI pointed by $top-uri. + +Visited links are stored as nodes in two maps, one for internal pages and one for external pages. +The keys are the URIs, and the values are the strings "broken" or "clean". +The maps are used to avoid parsing the same page twice. + +\code +declare variable $local:processed-internal-links := xs:QName("processed-internal-links"); +declare variable $local:processed-external-links := xs:QName("processed-external-links"); + +declare %ann:sequential function local:create-containers() +{ + map:create($local:processed-internal-links, xs:QName("xs:string")); + map:create($local:processed-external-links, xs:QName("xs:string")); +}; + +declare %ann:sequential function local:delete-containers(){ + for $x in map:available-maps() + return map:delete($x); +}; + +\endcode + +After parsing an internal page with html module, all the links are extracted and parsed recursively, if they haven't been parsed. +The html module uses tidy library, so we use tidy options to setup for converting from html to xml. +Some html tags are marked to be ignored in new-inline-tags param, this being a particular case of this website. +You can add or remove tags to suit your website needs. + +\code +declare function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string* +{ distinct-values( for $y in ($content//*:a/string(@href), + $content//*:link/string(@href), + $content//*:script/string(@src), + $content//*:img/string(@src), + $content//*:area/string(@href) + ) +return local:get-real-link($y, $uri)) +}; + +declare function local:tidy-options() +{http://www.zorba-xquery.com/modules/converters/html-options"; > + + + + + + + + +}; + +declare %ann:sequential function local:process-internal-link($x as xs:string, $n as xs:integer){ + if($n=3) then exit returning (); else {} + if(not(empty(map:get($local:processed-internal-links, $x +then exit returning false(); + else {} + variable $http-call:=(); + try{ + $http-call:=http:send-reques