Hi Martin, FWIW, the agents startup sequence also does not have logic which deletes documents or jobs.
Nevertheless I will create a ticket and have a look at this ASAP. Karl On Tue, Oct 9, 2012 at 9:25 AM, Martin Gielow <[email protected]> wrote: > I have just completed testing the behaviour on the unaltered > multiprocess-example using the provided HSQL instance. > > Indeed, when using the file system connector, Manifold works as it should. > The agent can be stopped and restarted and the previously processed > documents are retained. When I tried the JDBC (pointed to a MySQL DB) and > Wiki connectors, however, I received the same results as yesterday - all > documents are deleted as soon as the agent restarts (not on shutdown but > when running the agent again after it has been stopped). > > For the JDBC connector I could imagine that this may somehow be related to > flawed seeding or version queries (although I believe them to be ok), but in > the case of Wiki there are hardly any settings I believe I could have gotten > wrong. > > > On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <[email protected]> wrote: >> >> I just tried this; the experiment yields no document deletions >> recorded in the simple history (as expected). >> >> So clearly there is a complicating factor somewhere that you will need to >> find. >> >> I would suggest going about the basic process of eliminating >> variables. For example, try a continuous crawl in your environment >> using the file system connector on a moderately-sized set of sample >> documents, and see if it seems to do the same thing as the other >> connectors you are using. If it does, then that would suggest that >> one of your modifications was in fact causing the problem. If not, >> then I should look at trying to repeat the experiment here with one of >> the connectors you are working with. >> >> Thanks, >> Karl >> >> On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <[email protected]> wrote: >> > There is no logic whatsoever in agents-shutdown that should delete >> > documents from the queue and from the index, and I have never seen >> > this behavior before, but this is really easy to verify. It should be >> > simple to take an unaltered 1.0 distribution, create a filesystem job >> > on the multiprocess example, start it crawling continuously, then stop >> > and restart the agents process, and then look at the simple history to >> > see whether any documents get deleted or not. I may have time to try >> > this later in the evening, we'll see. >> > >> > Karl >> > >> > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <[email protected]> >> > wrote: >> >> Hi Karl, >> >> >> >> thanks for the lightning-speed reply! :) >> >> >> >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <[email protected]> wrote: >> >>> >> >>> Hi Martin, >> >>> >> >>> The behavior you describe is expected only if you are either deleting >> >>> the job, or the job is set to expire old documents after a certain >> >>> time interval (and that interval has transpired). >> >>> >> >>> Can you tell me what your expiration interval is? >> >>> >> >> >> >> The expiration interval is set to 1440 (minutes, according to the >> >> interface). I also just tried to leave the box empty, so that there >> >> should >> >> be no expiration, but the behaviour remained the same. >> >> >> >>> >> >>> Also, when you say "shutting down agents process", can you clarify >> >>> what deployment model you are using? How are you shutting down this >> >>> process? >> >> >> >> >> >> I am using a slightly modified version of the multiprocess-example with >> >> postgres as the DBMS. To run and shutdown the agents I use the batch >> >> files >> >> that are provided with the example (start-agents.bat and >> >> stop-agents.bat). >> >> I have also tried to run the agents process from Eclipse to be able to >> >> debug >> >> into it and was getting the same results. >> >> >> >>> >> >>> Thanks, >> >>> Karl >> >> >> >> >> >> Regards, >> >> Martin >> >> >> >> >> >>> >> >>> >> >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow >> >>> <[email protected]> >> >>> wrote: >> >>> > Hello, >> >>> > >> >>> > I'm using Manifold to crawl several data sources using the Wiki and >> >>> > the >> >>> > JDBC >> >>> > connectors. I have set the associated jobs to run continuously so >> >>> > that >> >>> > new >> >>> > documents will be added in a timely manner. The problem I am having >> >>> > with >> >>> > this, is that whenever the Agent is stopped and then restarted, the >> >>> > jobs >> >>> > will delete all of their documents (also propagating the deletes to >> >>> > the >> >>> > associated output connection) before turning themselves inactive >> >>> > (which >> >>> > they >> >>> > shouldn't as they are set to run continuously). >> >>> > >> >>> > If I then restart the job, in case of the JDBC connection, it is not >> >>> > finding >> >>> > any previously added documents and will set itself inactive again. >> >>> > In >> >>> > case >> >>> > of the Wiki connection, the documents are also deleted, but are >> >>> > successfully >> >>> > reindexed when the job is restartet manually. >> >>> > >> >>> > The only way I found to prevent the jobs from deleting their items >> >>> > in >> >>> > this >> >>> > case, was to manually stop the affected jobs before the Agent is >> >>> > stopped >> >>> > (using the abort option) and to restart them after the Agent has >> >>> > been >> >>> > restarted. >> >>> > >> >>> > >> >>> > I am using the 1.0 release of Manifold and couldn't find anything >> >>> > regarding >> >>> > this behaviour in either the documentation or the wiki. >> >>> > >> >>> > Is there an obvious flaw with my setup or something I may have >> >>> > missed in >> >>> > the >> >>> > configuration? >> >>> > >> >>> > Thanks in advance for any tips! >> >>> > >> >>> > Regards, >> >>> > Martin >> >> >> >> > >
