I have just completed testing the behaviour on the unaltered multiprocess-example using the provided HSQL instance.
Indeed, when using the file system connector, Manifold works as it should. The agent can be stopped and restarted and the previously processed documents are retained. When I tried the JDBC (pointed to a MySQL DB) and Wiki connectors, however, I received the same results as yesterday - all documents are deleted as soon as the agent restarts (not on shutdown but when running the agent again after it has been stopped). For the JDBC connector I could imagine that this may somehow be related to flawed seeding or version queries (although I believe them to be ok), but in the case of Wiki there are hardly any settings I believe I could have gotten wrong. On Mon, Oct 8, 2012 at 6:58 PM, Karl Wright <[email protected]> wrote: > I just tried this; the experiment yields no document deletions > recorded in the simple history (as expected). > > So clearly there is a complicating factor somewhere that you will need to > find. > > I would suggest going about the basic process of eliminating > variables. For example, try a continuous crawl in your environment > using the file system connector on a moderately-sized set of sample > documents, and see if it seems to do the same thing as the other > connectors you are using. If it does, then that would suggest that > one of your modifications was in fact causing the problem. If not, > then I should look at trying to repeat the experiment here with one of > the connectors you are working with. > > Thanks, > Karl > > On Mon, Oct 8, 2012 at 12:22 PM, Karl Wright <[email protected]> wrote: > > There is no logic whatsoever in agents-shutdown that should delete > > documents from the queue and from the index, and I have never seen > > this behavior before, but this is really easy to verify. It should be > > simple to take an unaltered 1.0 distribution, create a filesystem job > > on the multiprocess example, start it crawling continuously, then stop > > and restart the agents process, and then look at the simple history to > > see whether any documents get deleted or not. I may have time to try > > this later in the evening, we'll see. > > > > Karl > > > > On Mon, Oct 8, 2012 at 12:06 PM, Martin Gielow <[email protected]> > wrote: > >> Hi Karl, > >> > >> thanks for the lightning-speed reply! :) > >> > >> On Mon, Oct 8, 2012 at 5:23 PM, Karl Wright <[email protected]> wrote: > >>> > >>> Hi Martin, > >>> > >>> The behavior you describe is expected only if you are either deleting > >>> the job, or the job is set to expire old documents after a certain > >>> time interval (and that interval has transpired). > >>> > >>> Can you tell me what your expiration interval is? > >>> > >> > >> The expiration interval is set to 1440 (minutes, according to the > >> interface). I also just tried to leave the box empty, so that there > should > >> be no expiration, but the behaviour remained the same. > >> > >>> > >>> Also, when you say "shutting down agents process", can you clarify > >>> what deployment model you are using? How are you shutting down this > >>> process? > >> > >> > >> I am using a slightly modified version of the multiprocess-example with > >> postgres as the DBMS. To run and shutdown the agents I use the batch > files > >> that are provided with the example (start-agents.bat and > stop-agents.bat). > >> I have also tried to run the agents process from Eclipse to be able to > debug > >> into it and was getting the same results. > >> > >>> > >>> Thanks, > >>> Karl > >> > >> > >> Regards, > >> Martin > >> > >> > >>> > >>> > >>> On Mon, Oct 8, 2012 at 11:18 AM, Martin Gielow < > [email protected]> > >>> wrote: > >>> > Hello, > >>> > > >>> > I'm using Manifold to crawl several data sources using the Wiki and > the > >>> > JDBC > >>> > connectors. I have set the associated jobs to run continuously so > that > >>> > new > >>> > documents will be added in a timely manner. The problem I am having > with > >>> > this, is that whenever the Agent is stopped and then restarted, the > jobs > >>> > will delete all of their documents (also propagating the deletes to > the > >>> > associated output connection) before turning themselves inactive > (which > >>> > they > >>> > shouldn't as they are set to run continuously). > >>> > > >>> > If I then restart the job, in case of the JDBC connection, it is not > >>> > finding > >>> > any previously added documents and will set itself inactive again. In > >>> > case > >>> > of the Wiki connection, the documents are also deleted, but are > >>> > successfully > >>> > reindexed when the job is restartet manually. > >>> > > >>> > The only way I found to prevent the jobs from deleting their items in > >>> > this > >>> > case, was to manually stop the affected jobs before the Agent is > stopped > >>> > (using the abort option) and to restart them after the Agent has been > >>> > restarted. > >>> > > >>> > > >>> > I am using the 1.0 release of Manifold and couldn't find anything > >>> > regarding > >>> > this behaviour in either the documentation or the wiki. > >>> > > >>> > Is there an obvious flaw with my setup or something I may have > missed in > >>> > the > >>> > configuration? > >>> > > >>> > Thanks in advance for any tips! > >>> > > >>> > Regards, > >>> > Martin > >> > >> >
