Hallo. I increased RAM to 4GB and I execute, manually the job to crawl “Web repository” containing 3800 pdf documents.
I understood that “Start” executes a full scan instead, “Start minimal” executes a incremental scan only on modified documents. I executed the job with “Start” : it used near 20 hours. After I executed the job with “Start minimal” : it rescan the same 3800 documents so it used 20 hours Why this? Note that there aren’t new documents by the moment that I started job with “Start” and the time I started job with “Start minimal” Thanks for your help! Mario Da: Karl Wright [mailto:daddy...@gmail.com] Inviato: martedì 12 agosto 2014 17:26 A: user@manifoldcf.apache.org Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Setting up a schedule does not prevent you from starting the job manually. But it sounds like you understand the solution. Thanks, Karl On Tue, Aug 12, 2014 at 10:30 AM, Bisonti Mario <mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote: Ok, so, I think to have understood better, now. But I have 3800 .pdf documents so “full crawl” by Tika is very long because it uses 2 days. (perhaps I need to increase RAM?) I am using “web connector” so I see “Start minimal” option. I understand that I can do this: 1) full crawl on the Saturday night so it deletes orphaned file 2) start minimal crawl every night except Saturday so it crawls only changed documents are 1) and 2) right or I haven’t understood? Furthermore, I haven’t so clear the option: “Start even inside a scheduled window” because I tried with “Start when scheduled window start” but I am able to start it manually, too. Thanks a lot! Mario Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>] Inviato: martedì 12 agosto 2014 14:54 A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, What I would do is set up a single job. (Multiple jobs that share the same documents may work but they aren't recommended because a document must vanish from ALL jobs that share it before it is removed.) There are two different possibilities for the schedule, depending on the kind of connector you are using: (1) Repeated full crawls (2) Mostly minimal crawls, with periodic full crawls If the connector you are using makes any distinction between minimal and full crawls, then (2) would probably be more efficient for you. But only on full crawls will unreachable documents be removed. To do the setup: -- you will need multiple scheduling records for (2), but may be able to do (1) with a single scheduling record -- for each day, you want the window to start at midnight, and its length to be the equivalent of 24 hours -- you want to select the option to start crawls in the middle of a window, not just at the beginning This should give you what you want. Karl On Tue, Aug 12, 2014 at 8:43 AM, Bisonti Mario <mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote: So , I suppose, the best solution could be : Continous recrawling and one periodic recrawling to delete orphaned documents. Can I superimpose the two jobs? Mario Bisonti Information and Comunications Technology VIMAR SpA Tel. +39 0424 488 644 mario.biso...@vimar.com<mailto:mario.biso...@vimar.com> Rispetta l’ambiente. Stampa solo se necessario. Take care of the environment. Print only if necessary. Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>] Inviato: martedì 12 agosto 2014 12:21 A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Yes, periodic recrawling allows ManifoldCF the opportunity to discover abandoned documents and remove them. Karl On Tue, Aug 12, 2014 at 6:18 AM, Bisonti Mario <mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote: Ok, thanks.. So you suggest to me to not use continuos crawling and schedule a re-crawling periodically of all documents? Is it better? Thanks a lot. Mario Da: Karl Wright [mailto:daddy...@gmail.com<mailto:daddy...@gmail.com>] Inviato: martedì 12 agosto 2014 12:16 A: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org> Oggetto: Re: How delete unreachable documents on continous crawling? Hi Mario, Please read ManifoldCF in Action Chapter 1. Continuous crawling has no mechanism for deleting unreachable documents, and never will, because it is fundamentally impossible to do. Thanks, Karl On Tue, Aug 12, 2014 at 6:10 AM, Bisonti Mario <mario.biso...@vimar.com<mailto:mario.biso...@vimar.com>> wrote: Hallo. I set continuous crawling on a folder of a website to index the pdf files contained. Schedule type: Rescan documents dinamically Recrawl interval (if continuous):5 I see that if documents are added on the folder, they are indexed, but if documents are deleted they aren’t deleted from indexing. I see that on the “MainfoldCF in action” , is mentioned “…that continuous crawling seems to be missing a phase – the “delete unreachable documents” phase.” But, how could I solve the problem, please? Thanks a lot for yopur help. Mario