The standard mediawiki api for this operation is listed here: http://www.mediawiki.org/wiki/API:Allpages
Karl On Wed, Oct 1, 2014 at 10:20 AM, Karl Wright <[email protected]> wrote: > Hi Kambiz, > > I looked deeper into the log, and found that it is looping on trying to > seed. The reason it is looping is because the wiki server you are crawling > is not honoring the "apfrom" parameter when the namespace is specified. > Please see the following response, which is coming back from the query: > > DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET > /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500 > HTTP/1.1[\r][\n]" > > This response is *supposed* to start with Africa:Training and go on from > there. Instead, it seems to be starting from the beginning of the > namespace: > > >>>>>> > <?xml version="1.0"?><api><query> > <allpages> > <p pageid="10171" ns="404" title="Africa:Arcgis" /> > <p pageid="9977" ns="404" title="Africa:Atlas" /> > <p pageid="9979" ns="404" title="Africa:CTMargins" /> > <p pageid="9727" ns="404" title="Africa:Conferences" /> > <p pageid="9386" ns="404" title="Africa:Conferences2010" /> > <p pageid="9833" ns="404" title="Africa:Countryprojects" /> > <p pageid="9823" ns="404" title="Africa:Databases" /> > <p pageid="9976" ns="404" title="Africa:EasternMed" /> > <p pageid="10277" ns="404" title="Africa:Farmin" /> > <p pageid="9388" ns="404" title="Africa:FieldtripGuides" /> > <p pageid="9834" ns="404" title="Africa:Gabon2010" /> > <p pageid="9975" ns="404" title="Africa:InteriorRifs" /> > <p pageid="10762" ns="404" title="Africa:Kenya2011" /> > <p pageid="15660" ns="404" title="Africa:Kenya2012" /> > <p pageid="14945" ns="404" title="Africa:Madagascar2012" /> > <p pageid="9973" ns="404" title="Africa:Mozambique2011" /> > <p pageid="9385" ns="404" title="Africa:New Ventures Africa" /> > <p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" /> > <p pageid="9969" ns="404" title="Africa:Newsletter" /> > <p pageid="19985" ns="404" title="Africa:Project Abyss" /> > <p pageid="19986" ns="404" title="Africa:Project Geronimo" /> > <p pageid="20079" ns="404" title="Africa:Project Inlet" /> > <p pageid="9832" ns="404" title="Africa:Regionalprojects" /> > <p pageid="9974" ns="404" title="Africa:Seychelles2011" /> > <p pageid="9978" ns="404" title="Africa:TetianCarbonates" /> > <p pageid="9822" ns="404" title="Africa:Training" /> > </allpages> > </query></api> > <<<<<< > > What version of Wiki are you crawling here? Perhaps something has changed > in the spec, or maybe you are crawling a wiki that is too old to support > this feature? > > Karl > > > On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <[email protected]> wrote: > >> Hi Kambiz, >> >> In the log you sent, I did not see any activity at all other than >> seeding. Was the log complete? >> >> You can get a better sense of what is happening by obtaining a simple >> history report for this connection, and a document status report for the >> job. If there are only 27 documents, it should be very clear what is >> happening by looking at these. Can you include them please? >> >> Karl >> >> >> On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <[email protected]> >> wrote: >> >>> Hi Karl, >>> >>> Snapshot of the job view page is attached. By the way, it seems the >>> number of pages under that namespace is only 27 and they are not being >>> processed even after some minutes (see the second snapshot) >>> >>> Regards >>> Kambiz >>> >>> ------------------------------ >>> *From:* Karl Wright <[email protected]> >>> *To:* "[email protected]" <[email protected]>; Kambiz >>> Niktabar <[email protected]> >>> *Sent:* Wednesday, October 1, 2014 2:05 PM >>> *Subject:* Re: Wiki connector stuck crawling namespaces other than >>> default >>> >>> Hi Kambiz, >>> >>> The debugging output indicates that your namespace name is "404". That >>> doesn't sound correct to me. >>> >>> >>>>>> >>> GET >>> /wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500 >>> HTTP/1.1 >>> <<<<<< >>> >>> I've gone back and looked at the code and can find no way that the >>> namespace would be corrupted. But maybe this is actually correct. Can you >>> send along a screen shot of the view page for the job? >>> >>> Also, the wiki connector seeds documents in batches of 500 at a time. >>> It uses the last title fetched in order to be able to find the next batch >>> of 500. So if there are a lot of documents, it will take a while to seed >>> them all. In your log I see signs that this is what is happening. Have a >>> look at all the GET requests and note the apfrom parameter. >>> >>> >>> >>> >>> >>> Thanks, >>> Karl >>> >>> >>> >>> >>> >> >
