Hi Karl, Thanks for the info. I will check with the people maintaining the Wiki site to see if there is any specific configuration that causes this.
Regards Kambiz ________________________________ From: Karl Wright <[email protected]> To: Kambiz Niktabar <[email protected]> Cc: "[email protected]" <[email protected]> Sent: Wednesday, October 1, 2014 4:34 PM Subject: Re: Wiki connector stuck crawling namespaces other than default The standard mediawiki api for this operation is listed here: http://www.mediawiki.org/wiki/API:Allpages Karl On Wed, Oct 1, 2014 at 10:20 AM, Karl Wright <[email protected]> wrote: Hi Kambiz, > >I looked deeper into the log, and found that it is looping on trying to seed. >The reason it is looping is because the wiki server you are crawling is not >honoring the "apfrom" parameter when the namespace is specified. Please see >the following response, which is coming back from the query: > >DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET >/wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500 > HTTP/1.1[\r][\n]" > >This response is *supposed* to start with Africa:Training and go on from >there. Instead, it seems to be starting from the beginning of the namespace: > > >>>>>>> ><?xml version="1.0"?><api><query> ><allpages> ><p pageid="10171" ns="404" title="Africa:Arcgis" /> ><p pageid="9977" ns="404" title="Africa:Atlas" /> ><p pageid="9979" ns="404" title="Africa:CTMargins" /> ><p pageid="9727" ns="404" title="Africa:Conferences" /> ><p pageid="9386" ns="404" title="Africa:Conferences2010" /> ><p pageid="9833" ns="404" title="Africa:Countryprojects" /> ><p pageid="9823" ns="404" title="Africa:Databases" /> ><p pageid="9976" ns="404" title="Africa:EasternMed" /> ><p pageid="10277" ns="404" title="Africa:Farmin" /> ><p pageid="9388" ns="404" title="Africa:FieldtripGuides" /> ><p pageid="9834" ns="404" title="Africa:Gabon2010" /> ><p pageid="9975" ns="404" title="Africa:InteriorRifs" /> ><p pageid="10762" ns="404" title="Africa:Kenya2011" /> ><p pageid="15660" ns="404" title="Africa:Kenya2012" /> ><p pageid="14945" ns="404" title="Africa:Madagascar2012" /> ><p pageid="9973" ns="404" title="Africa:Mozambique2011" /> ><p pageid="9385" ns="404" title="Africa:New Ventures Africa" /> ><p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" /> ><p pageid="9969" ns="404" title="Africa:Newsletter" /> ><p pageid="19985" ns="404" title="Africa:Project Abyss" /> ><p pageid="19986" ns="404" title="Africa:Project Geronimo" /> ><p pageid="20079" ns="404" title="Africa:Project Inlet" /> ><p pageid="9832" ns="404" title="Africa:Regionalprojects" /> ><p pageid="9974" ns="404" title="Africa:Seychelles2011" /> ><p pageid="9978" ns="404" title="Africa:TetianCarbonates" /> ><p pageid="9822" ns="404" title="Africa:Training" /> ></allpages> ></query></api> ><<<<<< > > >What version of Wiki are you crawling here? Perhaps something has changed in >the spec, or maybe you are crawling a wiki that is too old to support this >feature? > > >Karl > > > > >On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <[email protected]> wrote: > >Hi Kambiz, >> >>In the log you sent, I did not see any activity at all other than seeding. >>Was the log complete? >> >>You can get a better sense of what is happening by obtaining a simple history >>report for this connection, and a document status report for the job. If >>there are only 27 documents, it should be very clear what is happening by >>looking at these. Can you include them please? >> >>Karl >> >> >> >> >>On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <[email protected]> wrote: >> >>Hi Karl, >>> >>> >>>Snapshot of the job view page is attached. By the way, it seems the number >>>of pages under that namespace is only 27 and they are not being processed >>>even after some minutes (see the second snapshot) >>> >>> >>>Regards >>>Kambiz >>> >>> >>> >>>________________________________ >>> From: Karl Wright <[email protected]> >>>To: "[email protected]" <[email protected]>; Kambiz >>>Niktabar <[email protected]> >>>Sent: Wednesday, October 1, 2014 2:05 PM >>>Subject: Re: Wiki connector stuck crawling namespaces other than default >>> >>> >>> >>>Hi Kambiz, >>> >>>The debugging output indicates that your namespace name is "404". That >>>doesn't sound correct to me. >>> >>>>>>>>> >>>GET >>>/wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500 >>> HTTP/1.1 >>><<<<<< >>> >>>I've gone back and looked at the code and can find no way that the namespace >>>would be corrupted. But maybe this is actually correct. Can you send along >>>a screen shot of the view page for the job? >>> >>> >>>Also, the wiki connector seeds documents in batches of 500 at a time. It >>>uses the last title fetched in order to be able to find the next batch of >>>500. So if there are a lot of documents, it will take a while to seed them >>>all. In your log I see signs that this is what is happening. Have a look >>>at all the GET requests and note the apfrom parameter. >>> >>> >>> >>> >>> >>> >>> >>> >>>Thanks, >>>Karl >>> >>> >>> >>> >>> >>> >> >
