Thanks for your answer Karl. I was unsure about that concerning the output 
connections but it is still the same pipeline after all.
-------- Message d'origine --------De : Karl Wright <daddy...@gmail.com> Date : 
10/09/2019  20:08  (GMT+01:00) À : user@manifoldcf.apache.org Objet : Re: Job 
Multiple Outputs Hi Julien,You must understand that a job with a complex 
pipeline is really not running N independent jobs; it's running ONE job.  Every 
document is processed through the pipeline only once.  The pipeline may have 
faster components and slower components; doesn't matter; the document takes the 
sum total of the time all components need to fetch and process the 
document.KarlOn Tue, Sep 10, 2019 at 12:48 PM Julien Massiera 
<julien.massi...@francelabs.com> wrote:
  
    
  
  
    Ok, so to be sure I understood what you are saying: 
    
    suppose a job with two output connections and one of the outputs
      is twice time faster than the other one to index documents. At a
      given time t, both of the outputs will have indexed the same
      amount of documents, no matter if one output is faster than the
      other one. 
      In other words : The fastest output will not have indexed all the
      crawled documents meanwhile the second one will still have half of
      them to index. 
    
    Am I wrong ? 
    
    On 10/09/2019 18:09, Karl Wright wrote:
    
    
      
      The output connection contract is that a request to
        index is made to the connector, and the connector returns when
        it is done.
        When there are multiple output connections, these are each
        handed a copy of the document, one after the other, and told to
        index it.  This is all done by one worker thread.  Multiple
        worker threads are not used for multiple outputs of the same
        document.
        
        The framework is smart enough to not hand a document to a
        connector if it hasn't changed (according to how the connector
        computes the connector-specific output version string).
        
        
        Karl
        
        
      
      
      
        On Tue, Sep 10, 2019 at 11:00
          AM Julien Massiera <julien.massi...@francelabs.com>
          wrote:
        
        Hi,
          
          I would like to have an explanation about the behavior of a
          job when 
          several outputs are configured. My main question is : for each
          output, 
          how is the docs ingestion managed ? More precisely, are the
          ingest 
          processes synchronized or not ? (in other words, is the
          ingestion of the 
          next document waiting for the current ingestion to be
          completed for both 
          outputs ?). But also, if one output is configured to send a
          commit at 
          the end of the job, is this commit pending until the last
          ingestion has 
          occured in the other output ?
          
          Thanks for your help,
          Julien
        
      
    
    -- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers 
Summit
www.francelabs.com
  


Reply via email to