Consider the following simple example of a nested batch with no retries. This is a very common scenario for batch processing, where an input source is processed until exhausted, but we commit periodically at the end of a "chunk" of processing.
1 | REPEAT(until=exhausted) { | 2 | TX { 3 | REPEAT(size=5) { 3.1 | input; 3.2 | output; | } | } | | }
The input operation (3.1) could be a message-based receive (e.g. JMS), or a file-based read, but to recover and continue processing with a chance of completing the whole job, it must be transactional. The same applies to the operation at (3.2) - it must be either transactional or idempotent.
If the chunk at REPEAT(3) fails because of a database exception at (3.2), then TX(2) will roll back the whole chunk.
It is also useful to use a retry for an operation which is not transactional, like a call to a web-service or other remote resource. For example:
0 | TX { 1 | input; 1.1 | output; 2 | RETRY { 2.1 | remote access; | } | }
This is actually one of the most useful applications of a retry, since a remote call is much more likely to fail and be retryable than a database update. As long as the remote access (2.1) eventually succeeds, the transaction TX(0) will commit. If the remote access (2.1) eventually fails, then the transaction TX(0) is guaranteed to roll back.
The most typical batch processing pattern is to add a retry to the inner block of the chunk in the simple example. Consider this:
1 | REPEAT(until=exhausted, exception=not critical) { | 2 | TX { 3 | REPEAT(size=5) { | 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { 5.1 | output; 6 | } SKIP and RECOVER { | notify; | } | | } | } | | }
The inner RETRY(4) block is marked as "stateful" - see the typical use case for a description of a stateful retry. This means that if the the retry PROCESS(5) block fails, the behaviour of the RETRY(4) is as follows.
Notice that the notation used for the RETRY(4) in the plan above shows explictly that the the input step (4.1) is part of the retry. It also makes clear that there are two alternate paths for processing: the normal case is denoted by PROCESS(5), and the recovery path is a separate block, RECOVER(6). The two alternate paths are completely distinct: only one is ever taken in normal circumstances.
In special cases (e.g. a special TranscationValidException type), the retry policy might be able to determine that the RECOVER(6) path can be taken on the last attempt after PROCESS(5) has just failed, instead of waiting for the item to be re-presented. This is not the default behavior because it requires detailed knowledge of what has happened inside the PROCESS(5) block, which is not usually available - e.g. if the output included write access before the failure, then the exception should be rethrown to ensure transactional integrity.
The completion policy in the outer, REPEAT(1) is crucial to the success of the above plan. If the output(5.1) fails it may throw an exception (it usually does, as described), in which case the transaction TX(2) fails and the exception could propagate up through the outer batch REPEAT(1). We do not want the whole batch to stop because the RETRY(4) might still be successful if we try again, so we add the exception=not critical to the outer REPEAT(1).
Note, however, that if the TX(2) fails and we do try again, by virtue of the outer completion policy, the item that is next processed in the inner REPEAT(3) is not guaranteed to be the one that just failed. It might well be, but it depends on the implementation of the input(4.1). Thus the output(5.1) might fail again, on a new item, or on the old one. The client of the batch should not assume that each RETRY(4) attempt is going to process the same items as the last one that failed. E.g. if the termination policy for REPEAT(1) is to fail after 10 attempts, it will fail after 10 consecutive attempts, but not necessarily at the same item. This is consistent with the overall retry strategy: it is the inner RETRY(4) that is aware of the history of each item, and can decide whether or not to have another attempt at it.
The inner batches or chunks in the typical example above can be executed concurrently by configuring the outer batch to use an AsyncTaskExecutor. The outer batch waits for all the chunks to complete before completing.
1 | REPEAT(until=exhausted, concurrent, exception=not critical) { | 2 | TX { 3 | REPEAT(size=5) { | 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { | output; 6 | } RECOVER { | recover; | } | | } | } | | }
The individual items in chunks in the typical can also in principle be processed concurrently. In this case the transaction boundary has to move to the level of the individual item, so that each transaction is on a single thread:
1 | REPEAT(until=exhausted, exception=not critical) { | 2 | REPEAT(size=5, concurrent) { | 3 | TX { 4 | RETRY(stateful, exception=deadlock loser) { 4.1 | input; 5 | } PROCESS { | output; 6 | } RECOVER { | recover; | } | } | | } | | }
This plan sacrifices the optimisation benefit, that the simple plan had, of having all the transactional resources chunked together. It is only useful if the cost of the processing (5) is much higher than the cost of transaction management (3).
There is a tighter coupling between batch-retry and TX management than we would ideally like. In particular a stateless retry cannot be used to retry database operations with a transaction manager that doesn't support NESTED propagation.
For a simple example using retry without repeat, consider this:
1 | TX { | 1.1 | input; 2.2 | database access; 2 | RETRY { 3 | TX { 3.1 | database access; | } | } | | }
Again, and for the same reason, the inner transaction TX(3) can cause the outer transaction TX(1) to fail, even if the RETRY(2) is eventually successful.
Unfortunately the same effect percolates from the retry block up to the surrounding repeat batch if there is one:
1 | TX { | 2 | REPEAT(size=5) { 2.1 | input; 2.2 | database access; 3 | RETRY { 4 | TX { 4.1 | database access; | } | } | } | | }
Now if TX(3) rolls back it can pollute the whole batch at TX(1) and force it to roll back at the end.
What about non-default propagation?
If TX(3) rolls back, TX(1) does not necessarily (but it probably will in practice because the retry will throw a roll back exception).
So NESTED is best if the retry block contains any database access.
Default propagation is always OK for simple cases where there are no nested database transactions. Consider this (where the SESSION and TX are not global XA resources, so their resources are orthogonal):
0 | SESSION { 1 | input; 2 | RETRY { 3 | TX { 3.1 | database access; | } | } | }
Here there is a transactional message SESSION(0), but it doesn't participate in other transactions with PlatformTransactionManager, so doesn't propagate when TX(3) starts. There is no database access outside the RETRY(2) block. If TX(3) fails and then eventually succeeds on a retry, SESSION(0) can commit (it can do this independent of a TX block). This is similar to the vanilla "best-efforts-one-phase-commit" scenario - the worst that can happen is a duplicate message when the RETRY(2) succeeds and the SESSION(0) cannot commit, e.g. because the message system is unavailable.
The distinction between a stateless and a stateful retry in the typical example above is important. It is actually ultimately a transactional constraint that forces the distinction, and this constraint also makes it obvious why the distinction exists.
We start with the observation that there is no way to skip an item that failed and successfully commit the rest of the chunk unless we wrap the item processing in a transaction. So we simplify the typical batch execution plan to look like this:
0 | REPEAT(until=exhausted) { | 1 | TX { 2 | REPEAT(size=5) { | 3 | RETRY(stateless) { 4 | TX { 4.1 | input; 4.2 | database access; | } 5 | } RECOVER { 5.1 | skip; | } | | } | } | | }
Here we have a stateless RETRY(3) with a RECOVER(5) path that kicks in after the final attempt fails. The "stateless" label just means that the block will be repeated without rethrowing any exception up to some limit. This will only work if the transaction TX(4) has propagation NESTED.
If the TX(3) has default propagation properties and it rolls back, it will pollute the outer TX(1). The inner transaction is assumed by the transaction manager to have corrupted the transactional resource, and so it cannot be used again.
Support for NESTED propagation is sufficiently rare that we choose not to support recovery with stateless retries in current versions of Spring Batch. The same effect can always be achieved (at the expense of repeating more processing) using the typical pattern above.