R: How delete unreachable documents on continous crawling?

Bisonti Mario Wed, 27 Aug 2014 02:39:43 -0700

Hallo.

I increased RAM to 4GB and I execute, manually the job to crawl “Web 
repository”  containing 3800 pdf documents.


I understood that “Start” executes a full scan instead, “Start minimal” 
executes a incremental scan only on modified documents.


I executed the job with “Start” : it used near 20 hours.
After
I executed the job with “Start minimal” : it rescan the same 3800 documents so 
it used 20 hours

Why this?

Note that there aren’t new documents by the moment that I started job with 
“Start” and the time I started job with “Start minimal”


Thanks for your help!

Mario






Da: Karl Wright [mailto:daddy...@gmail.com]
Inviato: martedì 12 agosto 2014 17:26
A: user@manifoldcf.apache.org
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

Setting up a schedule does not prevent you from starting the job manually.
But it sounds like you understand the solution.

Thanks,
Karl

On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario 
<mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote:
Ok, so, I think to have understood better, now.

But I have 3800 .pdf documents so “full crawl” by Tika is very long because it 
uses 2 days. (perhaps I need to increase RAM?)

I am using “web connector” so I see “Start minimal” option.

I understand that I can do this:
1) full crawl on the Saturday night so it deletes orphaned file
2) start minimal crawl every night except Saturday so it crawls only changed 
documents

are 1) and 2) right or I haven’t understood?


Furthermore, I haven’t so clear  the option:
“Start even inside a scheduled window” because I tried with “Start when 
scheduled window start” but I am able to start it manually, too.

Thanks a lot!


Mario



Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Inviato: martedì 12 agosto 2014 14:54

A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

What I would do is set up a single job.  (Multiple jobs that share the same 
documents may work but they aren't recommended because a document must vanish 
from ALL jobs that share it before it is removed.)  There are two different 
possibilities for the schedule, depending on the kind of connector you are 
using:
(1) Repeated full crawls
(2) Mostly minimal crawls, with periodic full crawls
If the connector you are using makes any distinction between minimal and full 
crawls, then (2) would probably be more efficient for you.  But only on full 
crawls will unreachable documents be removed.
To do the setup:
-- you will need multiple scheduling records for (2), but may be able to do (1) 
with a single scheduling record
-- for each day, you want the window to start at midnight, and its length to be 
the equivalent of 24 hours
-- you want to select the option to start crawls in the middle of a window, not 
just at the beginning
This should give you what you want.
Karl

On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario 
<mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote:
So , I suppose, the best solution could be :
Continous recrawling and one periodic recrawling to delete orphaned documents.

Can I superimpose the two jobs?

Mario Bisonti
Information and Comunications Technology

VIMAR SpA
Tel. +39 0424 488 644
mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>
Rispetta l’ambiente. Stampa solo se necessario.
Take care of the environment. Print only if necessary.





Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Inviato: martedì 12 agosto 2014 12:21

A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,

Yes, periodic recrawling allows ManifoldCF the opportunity to discover 
abandoned documents and remove them.

Karl

On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario 
<mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote:
Ok, thanks..

So you suggest to me to not use continuos crawling and schedule a re-crawling 
periodically of all documents?
Is it better?
Thanks a lot.



Mario





Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>]
Inviato: martedì 12 agosto 2014 12:16
A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>
Oggetto: Re: How delete unreachable documents on continous crawling?

Hi Mario,
Please read ManifoldCF in Action Chapter 1.  Continuous crawling has no 
mechanism for deleting unreachable documents, and never will, because it is 
fundamentally impossible to do.
Thanks,
Karl

On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario 
<mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I set continuous crawling on a folder of a website to index the pdf files 
contained.

Schedule type: Rescan documents dinamically
Recrawl interval (if continuous):5

I see that if documents are added on the folder, they are indexed, but if 
documents are deleted they aren’t deleted from indexing.
I see that on the “MainfoldCF in action” , is mentioned “…that continuous 
crawling seems to be missing a phase – the “delete unreachable documents” 
phase.”

But, how could I solve the problem, please?
Thanks a lot for yopur help.
Mario

R: How delete unreachable documents on continous crawling?

Reply via email to