Hi Karl,

Thanks for the info. I will check with the people maintaining the Wiki site to 
see if there is any specific configuration that causes this.

Regards
Kambiz


________________________________
 From: Karl Wright <[email protected]>
To: Kambiz Niktabar <[email protected]> 
Cc: "[email protected]" <[email protected]> 
Sent: Wednesday, October 1, 2014 4:34 PM
Subject: Re: Wiki connector stuck crawling namespaces other than default
 


The standard mediawiki api for this operation is listed here:

http://www.mediawiki.org/wiki/API:Allpages

Karl






On Wed, Oct 1, 2014 at 10:20 AM, Karl Wright <[email protected]> wrote:

Hi Kambiz,
>
>I looked deeper into the log, and found that it is looping on trying to seed.  
>The reason it is looping is because the wiki server you are crawling is not 
>honoring the "apfrom" parameter when the namespace is specified.  Please see 
>the following response, which is coming back from the query:
>
>DEBUG 2014-10-01 08:34:22,470 (Thread-618) - http-outgoing-7 >> "GET 
>/wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATraining&aplimit=500
> HTTP/1.1[\r][\n]"
>
>This response is *supposed* to start with Africa:Training and go on from 
>there.  Instead, it seems to be starting from the beginning of the namespace:
>
>
>>>>>>>
><?xml version="1.0"?><api><query>
><allpages>
><p pageid="10171" ns="404" title="Africa:Arcgis" />
><p pageid="9977" ns="404" title="Africa:Atlas" />
><p pageid="9979" ns="404" title="Africa:CTMargins" />
><p pageid="9727" ns="404" title="Africa:Conferences" />
><p pageid="9386" ns="404" title="Africa:Conferences2010" />
><p pageid="9833" ns="404" title="Africa:Countryprojects" />
><p pageid="9823" ns="404" title="Africa:Databases" />
><p pageid="9976" ns="404" title="Africa:EasternMed" />
><p pageid="10277" ns="404" title="Africa:Farmin" />
><p pageid="9388" ns="404" title="Africa:FieldtripGuides" />
><p pageid="9834" ns="404" title="Africa:Gabon2010" />
><p pageid="9975" ns="404" title="Africa:InteriorRifs" />
><p pageid="10762" ns="404" title="Africa:Kenya2011" />
><p pageid="15660" ns="404" title="Africa:Kenya2012" />
><p pageid="14945" ns="404" title="Africa:Madagascar2012" />
><p pageid="9973" ns="404" title="Africa:Mozambique2011" />
><p pageid="9385" ns="404" title="Africa:New Ventures Africa" />
><p pageid="9812" ns="404" title="Africa:New Ventures Africa Map" />
><p pageid="9969" ns="404" title="Africa:Newsletter" />
><p pageid="19985" ns="404" title="Africa:Project Abyss" />
><p pageid="19986" ns="404" title="Africa:Project Geronimo" />
><p pageid="20079" ns="404" title="Africa:Project Inlet" />
><p pageid="9832" ns="404" title="Africa:Regionalprojects" />
><p pageid="9974" ns="404" title="Africa:Seychelles2011" />
><p pageid="9978" ns="404" title="Africa:TetianCarbonates" />
><p pageid="9822" ns="404" title="Africa:Training" />
></allpages>
></query></api>
><<<<<<
>
>
>What version of Wiki are you crawling here?  Perhaps something has changed in 
>the spec, or maybe you are crawling a wiki that is too old to support this 
>feature?
>
>
>Karl
>
>
>
>
>On Wed, Oct 1, 2014 at 9:57 AM, Karl Wright <[email protected]> wrote:
>
>Hi Kambiz,
>>
>>In the log you sent, I did not see any activity at all other than seeding.  
>>Was the log complete?
>>
>>You can get a better sense of what is happening by obtaining a simple history 
>>report for this connection, and a document status report for the job.  If 
>>there are only 27 documents, it should be very clear what is happening by 
>>looking at these. Can you include them please?
>>
>>Karl
>>
>>
>>
>>
>>On Wed, Oct 1, 2014 at 9:50 AM, Kambiz Niktabar <[email protected]> wrote:
>>
>>Hi Karl,
>>>
>>>
>>>Snapshot of the job view page is attached. By the way, it seems the number 
>>>of pages under that namespace is only 27 and they are not being processed 
>>>even after some minutes (see the second snapshot)
>>>
>>>
>>>Regards
>>>Kambiz
>>>
>>>
>>>
>>>________________________________
>>> From: Karl Wright <[email protected]>
>>>To: "[email protected]" <[email protected]>; Kambiz 
>>>Niktabar <[email protected]> 
>>>Sent: Wednesday, October 1, 2014 2:05 PM
>>>Subject: Re: Wiki connector stuck crawling namespaces other than default
>>> 
>>>
>>>
>>>Hi Kambiz,
>>>
>>>The debugging output indicates that your namespace name is "404".  That 
>>>doesn't sound correct to me.
>>>
>>>>>>>>>
>>>GET 
>>>/wiki/api.php?format=xml&action=query&list=allpages&apnamespace=404&apfrom=Africa%3ATetianCarbonates&aplimit=500
>>> HTTP/1.1
>>><<<<<<
>>>
>>>I've gone back and looked at the code and can find no way that the namespace 
>>>would be corrupted.  But maybe this is actually correct.  Can you send along 
>>>a screen shot of the view page for the job?
>>>
>>>
>>>Also, the wiki connector seeds documents in batches of 500 at a time.  It 
>>>uses the last title fetched in order to be able to find the next batch of 
>>>500.  So if there are a lot of documents, it will take a while to seed them 
>>>all.  In your log I see signs that this is what is happening.  Have a look 
>>>at all the GET requests and note the apfrom parameter.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>Thanks,
>>>Karl
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to