Hi, thanks for the quick response. So it seems I didn't miss something obvious.
Thank you! Best wishes, Alexey On Mon, Jun 29, 2015 at 10:28 PM, Valentin Kulichenko < [email protected]> wrote: > Alexey, > > I see your point and it really looks like your use case should be an > option of AlwaysFailoverSpi (which is the default one). But now it doesn't > failover if it has already tried all nodes for a particular job. So you > will have to implement your own failover SPI (it should be pretty simple - > just pick a random node from the topology each time a job is trying to fail > over). > > As for global nature of the SPI, you're right, but its failover() takes > FailoverContext, which has information about failed job (task name, > attributes, exception, etc.), so you can make decision based on this > information. > > Hope this helps. > > Thanks! > > On Mon, Jun 29, 2015 at 1:08 PM, Aleksei Valikov < > [email protected]> wrote: > >> Hi, >> >> this is basically a copy of >> >> >> http://stackoverflow.com/questions/31124341/how-to-retry-failed-job-on-any-node-with-apache-ignite-gridgain >> >> I'm experimenting with fault tolerance >> <https://apacheignite.readme.io/v1.1/docs/fault-tolerance> in Apache >> Ignite. >> >> What I can't figure out is how to retry a failed job on any node. I have >> a use case where my jobs will be calling a third-party tool as a system >> process via process buildr to do some calculations. In some cases the tool >> may fail, but in most cases it's OK to retry the job on any node - >> including the one where it previously failed. >> >> At the moment Ignite seems to reroute the job to another node which did >> not have this job before. So, after a while all nodes are gone and the task >> fails. >> >> What I'm looking for is how to retry a job on any node. >> >> Here's a test to demonstrate my problem. >> >> Here's my randomly failing job: >> >> public static class RandomlyFailingComputeJob implements ComputeJob { >> private static final long serialVersionUID = -8351095134107406874L; >> private final String data; >> >> public RandomlyFailingComputeJob(String data) { >> Validate.notNull(data); >> this.data = data; >> } >> >> public void cancel() { >> } >> >> public Object execute() throws IgniteException { >> final double random = Math.random(); >> if (random > 0.5) { >> throw new IgniteException(); >> } else { >> return StringUtils.reverse(data); >> } >> }} >> >> An below is the task: >> >> public static class RandomlyFailingComputeTask extends >> ComputeTaskSplitAdapter<String, String> { >> private static final long serialVersionUID = 6756691331287458885L; >> >> @Override >> public ComputeJobResultPolicy result(ComputeJobResult res, >> List<ComputeJobResult> rcvd) throws IgniteException { >> if (res.getException() != null) { >> return ComputeJobResultPolicy.FAILOVER; >> } >> return ComputeJobResultPolicy.WAIT; >> } >> >> public String reduce(List<ComputeJobResult> results) >> throws IgniteException { >> final Collection<String> reducedResults = new ArrayList<String>( >> results.size()); >> for (ComputeJobResult result : results) { >> reducedResults.add(result.<String> getData()); >> } >> return StringUtils.join(reducedResults, ' '); >> } >> >> @Override >> protected Collection<? extends ComputeJob> split(int gridSize, >> String arg) throws IgniteException { >> final String[] args = StringUtils.split(arg, ' '); >> final Collection<ComputeJob> computeJobs = new ArrayList<ComputeJob>( >> args.length); >> for (String data : args) { >> computeJobs.add(new RandomlyFailingComputeJob(data)); >> } >> return computeJobs; >> } >> } >> >> Test code: >> >> final Ignite ignite = Ignition.start(); >> final String original = "The quick brown fox jumps over the lazy dog"; >> >> final String reversed = StringUtils.join( >> ignite.compute().execute(new RandomlyFailingComputeTask(), >> original), ' '); >> >> As you can see, should always be failovered. Since the probability of >> failure != 1, I expect the task to successfully terminate at some point. >> >> With the probability threshold of 0.5 and a total of 3 nodes this hardly >> happens. I'm getting an exception like class >> org.apache.ignite.cluster.ClusterTopologyException: Failed to failover a >> job to another node (failover SPI returned null). After some debugging >> I've found out that this is because I eventually run out of nodes. All of >> the are gone. >> >> I understand that I can write my own FailoverSpi to handle this. >> >> But this just doesn't feel right. >> >> First, it seems to be an overkill to do this. >> But then the SPI is a kind of global thing. I'd like to decide per job if >> it should be retried or failed over. This may, for instance, depend on what >> the exit code of the third-party tool I'm invoking. So configuring failover >> over the global SPI isn't right. >> >> I'd appreciate any pointers. >> >> Many thanks and best wishes, >> >> Alexey >> > >
