To any experienced batch architect, the overall concepts of batch processing used in Spring Batch should be familiar and comfortable. There are “Jobs” and “Steps” and developer supplied processing units called ItemReaders and ItemWriters. However, because of the Spring patterns, operations, templates, callbacks, and idioms, there are opportunities for the following:
significant improvement in adherence to a clear separation of concerns
clearly delineated architectural layers and services provided as interfaces
simple and default implementations that allowed for quick adoption and ease of use out-of-the-box
significantly enhanced extensibility
The diagram below is only a slight variation of the batch reference architecture that has been used for decades. It provides an overview of the high level components, technical services, and basic operations required by a batch architecture. This architecture framework is a blueprint that has been proven through decades of implementations on the last several generations of platforms (COBOL/Mainframe, C++/Unix, and now Java/anywhere). JCL and COBOL developers are likely to be as comfortable with the concepts as C++, C# and Java developers. Spring Batch provides a physical implementation of the layers, components and technical services commonly found in robust, maintainable systems used to address the creation of simple to complex batch applications, with the infrastructure and extensions to address very complex processing needs.
The above diagram highlights the interactions and key services
provided by the Spring Batch framework. The colors used are important to
understanding the responsibilities of a developer in Spring Batch. Grey
represents an external application such as an enterprise scheduler or a
database. It's important to note that scheduling is grey, and should thus
be considered separate from Spring Batch. Blue represents application
architecture services. In most cases these are provided by Spring Batch
with out of the box implementations, but an architecture team may make
specific implementations that better address their specific needs. Yellow
represents the pieces that must be configured by a developer. For example,
they need to configure their job schedule so that the job is kicked off at
the appropriate time. They also need to create a job configuration that
defines how their job will be run. It is also worth noting that the
ItemReader
and ItemWriter
used by an application may just as easily be a custom one made by the
developer for the specific batch job, rather than one provided by Spring
Batch or even an architecture team.
The Batch Application Style is organized into four logical tiers, which include Run, Job, Application, and Data. The primary goal for organizing an application according to the tiers is to embed what is known as "separation of concerns" within the system. These tiers can be conceptual but may prove effective in mapping the deployment of the artifacts onto physical components like Java runtimes and integration with data sources and targets. Effective separation of concerns results in reducing the impact of change to the system. The four conceptual tiers containing batch artifacts are:
Run Tier: The Run Tier is concerned with the scheduling and launching of the application. A vendor product is typically used in this tier to allow time-based and interdependent scheduling of batch jobs as well as providing parallel processing capabilities.
Job Tier: The Job Tier is responsible for the overall execution of a batch job. It sequentially executes batch steps, ensuring that all steps are in the correct state and all appropriate policies are enforced.
Application Tier: The Application Tier contains components required to execute the program. It contains specific tasklets that address the required batch functionality and enforces policies around a tasklet execution (e.g., commit intervals, capture of statistics, etc.)
Data Tier: The Data Tier provides the integration with the physical data sources that might include databases, files, or queues.
This section describes stereotypes relating to the concept of a
batch job. A job is an entity that encapsulates an entire batch process.
As is common with other Spring projects, a Job
will
be wired together via an XML configuration file. This file may be referred
to as the "job configuration". However, Job
is just
the top of an overall hierarchy:
A job is represented by a Spring bean that implements the
Job
interface and contains all of the information
necessary to define the operations performed by a job. A job
configuration is typically contained within a Spring XML configuration
file and the job's name is determined by the "id" attribute associated
with the job configuration bean. The job configuration contains
The simple name of the job
Definition and ordering of Steps
Whether or not the job is restartable
A default simple implementation of the Job
interface is provided by Spring Batch in the form of the
SimpleJob
class which creates some standard
functionality on top of Job
, namely a standard
execution logic that all jobs should utilize. In general, all jobs
should be defined using a bean of type
SimpleJob
:
<bean id="footballJob" class="org.springframework.batch.core.job.SimpleJob"> <property name="steps"> <list> <!-- Step Bean details ommitted for clarity --> <bean id="playerload" parent="simpleStep" /> <bean id="gameLoad" parent="simpleStep" /> <bean id="playerSummarization" parent="simpleStep" /> </list> </property> <property name="restartable" value="true" /> </bean>
A JobInstance
refers to the concept of a
logical job run. Let's consider a batch job that should be run once at
the end of the day, such as the 'EndOfDay' job from the diagram above.
There is one 'EndOfDay' Job
, but each individual
run of the Job
must be tracked separately. In the
case of this job, there will be one logical
JobInstance
per day. For example, there will be a
January 1st run, and a January 2nd run. If the January 1st run fails the
first time and is run again the next day, it's still the January 1st
run. (Usually this corresponds with the data its processing as well,
meaning the January 1st run processes data for January 1st, etc) That is
to say, each JobInstance
can have multiple
executions. (JobExecution
is discussed in more
detail below) and only one JobInstance
corresponding to a particular Job
can be running
at a given time. The definition of a JobInstance
has absolutely no bearing on the data the will be loaded. It is entirely
up to the ItemReader
implementation used to
determine how data will be loaded. For example, in the EndOfDay
scenario, there may be a column on the data that indicates the
'effective date' or 'schedule date' to which the data belongs. So, the
January 1st run would only load data from the 1st, and the January 2nd
run would only use data from the 2nd. Because this determination will
likely be a business decision, it is left up to the
ItemReader
to decide. What using the same
JobInstance
will determine, however, is whether
or not the 'state' (i.e. the ExecutionContext, which is discussed below)
from previous executions will be used. Using a new
JobInstance
will mean 'start from the beginning'
and using an existing instance will generally mean 'start from where you
left off'.
Having discussed JobInstance
and how it
differs from Job
, the natural question to ask is:
"how is one JobInstance
distinguished from
another?" The answer is: JobParameters
.
JobParameters
are any set of parameters used to
start a batch job, which can be used for identification or even as
reference data during the run. In the example above, where there are two
instances, one for January 1st, and another for January 2nd, there is
really only one Job, one that was started with a job parameter of
01-01-2008 and another that was started with a parameter of 01-02-2008.
Thus, the contract can be defined as: JobInstance
= Job
+ JobParameters
.
This allows you to effectively control how you define a
JobInstance
, since you control what parameters
are passed in.
A JobExecution
refers to the technical
concept of a single attempt to run a Job
. An
execution may end in failure or success, but the
JobInstance
corresponding to a given execution
will not be marked as complete unless the execution completes
successfully. For instance, if we have a
JobInstance
of the EndOfDay job for 01-01-2008,
as described above, that fails to successfully complete its work the
first time it is run, when we attempt to run it again (with the same job
parameters of 01-01-2008), a new job execution will be created.
A Job defines what a job is and defines how it is to be executed,
and JobInstance
is a purely organization object
to group executions together, primarily to enable correct restart. A
JobExecution
, however, is the primary storage
mechanism for what actually happened during a run, and as such contains
many more properties that must be controlled and persisted:
Table 2.1. JobExecution properties
status | A BatchStatus object that
indicates the status of the execution. While it's running, it's
BatchStatus.STARTED, if it fails it's BatchStatus.FAILED, and if
it finishes successfully it's BatchStatus.COMPLETED |
startTime | A java.util.Date representing the
current system time when the execution was started. |
endTime | A java.util.Date representing the
current system time when the execution finished, regardless of
whether or not it was successful. |
exitStatus | The ExitStatus indicating the
result of the run. It is most important because it contains an
exit code that will be returned to the caller. See chapter 5 for
more details. |
These properties are important because they will be persisted and can be used to completely determine the status of an execution. For example, if the EndOfDay job for 01-01 is executed at 9:00 PM, and fails at 9:30, the following entries will be made in the batch meta data tables:
Table 2.3. BATCH_JOB_PARAMS
JOB_INSTANCE_ID | TYPE_CD | KEY_NAME | DATE_VAL |
1 | DATE | schedule.Date | 2008-01-01 00:00:00 |
Table 2.4. BATCH_JOB_EXECUTION
JOB_EXECUTION_ID | JOB_INSTANCE_ID | START_TIME | END_TIME | STATUS |
1 | 1 | 2008-01-01 21:00:23.571 | 2008-01-01 21:30:17.132 | FAILED |
extra columns in the table have been removed for added clarity.
Now that the job has failed, let's assume that it took the entire
course of the night for the problem to be determined, so that the 'batch
window' is now closed. Assuming the window starts at 9:00 PM, the job
will be kicked off again for 01-01, starting where it left off and
completing successfully at 9:30. Because it's now the next day, the
01-02 job must be run as well, which is kicked off just afterwards at
9:31, and completes in it's normal one hour time at 10:30. There is no
requirement that one JobInstance
be kicked off
after another, unless there is potential for the two jobs to attempt to
access the same data, causing issues with locking at the database level.
It is entirely up to the scheduler to determine when to run. Since
they're separate JobInstances, Spring Batch will make no attempt to stop
them from being run concurrently. (Attempting to run the same
JobInstance
while another is already running will
result in a JobExecutionAlreadyRunningException
being thrown) There should now be an extra entry in both the
JobInstance
and
JobParameters
tables, and two extra entries in
the JobExecution
table:
Table 2.6. BATCH_JOB_PARAMS
JOB_INSTANCE_ID | TYPE_CD | KEY_NAME | DATE_VAL |
1 | DATE | schedule.Date | 2008-01-01 00:00:00 |
2 | DATE | schedule.Date | 2008-01-02 00:00:00 |
Table 2.7. BATCH_JOB_EXECUTION
JOB_EXECUTION_ID | JOB_INSTANCE_ID | START_TIME | END_TIME | STATUS |
1 | 1 | 2008-01-01 21:00 | 2008-01-01 21:30 | FAILED |
2 | 1 | 2008-01-02 21:00 | 2008-01-02 21:30 | COMPLETED |
3 | 2 | 2008-01-02 21:31 | 2008-01-02 22:29 | COMPLETED |
A Step
is a domain object that encapsulates
an independent, sequential phase of a batch job. Therefore, every Job is
composed entirely of one or more steps. A Step
should be thought of as a unique processing stream that will be executed
in sequence. For example, if you have one step that loads a file into a
database, another that reads from the database, validates the data,
preforms processing, and then writes to another table, and another that
reads from that table and writes out to a file. Each of these steps will
be performed completely before moving on to the next step. The file will
be completely read into the database before step 2 can begin. As with
Job
, a Step
has an
individual StepExecution
that corresponds with a
unique JobExecution
:
A Step
contains all of the information
necessary to define and control the actual batch processing. This is a
necessarily vague description because the contents of any given
Step
are at the discretion of the developer
writing a Job
. A Step can be as simple or complex
as the developer desires. A simple Step
might
load data from a file into the database, requiring little or no code.
(depending upon the implementations used) A more complex
Step
may have complicated business rules that are
applied as part of the processing.
Steps are defined by instantiating implementations of the
Step
interface. Two step implementation classes
are available in the Spring Batch framework, and they are each discussed
in detail in Chatper 4 of this guide. For most situations, the
ItemOrientedStep
implementation is sufficient,
but for situations where only one call is needed, such as a stored
procedure call or a wrapper around existing script, a
TaskletStep
may be a better option.
A StepExecution
represents a single attempt
to execute a Step
. Using the example from
JobExecution
, if there is a
JobInstance
for the "EndOfDayJob", with
JobParameters
of "01-01-2008" that fails to
successfully complete its work the first time it is run, when it is
executed again, a new StepExecution
will be
created. Each of these step executions may represent a different
invocation of the batch framework, but they will all correspond to the
same JobInstance
, just as multiple
JobExecutions
belong to the same
JobInstance
.
Step executions are represented by objects of the
StepExecution
class. Each execution contains a
reference to its corresponding step and
JobExecution
, and transaction related data such
as commit and rollback count and start and end times. Additionally, each
step execution will contain an ExecutionContext
,
which contains any data a developer needs persisted across batch runs,
such as statistics or state information needed to restart. The following
is a listing of the properties for
StepExecution
:
Table 2.8. StepExecution properties
status | A BatchStatus object that
indicates the status of the execution. While it's running, the
status is BatchStatus.STARTED, if it fails the status is
BatchStatus.FAILED, and if it finishes successfully the status
is BatchStatus.COMPLETED |
startTime | A java.util.Date representing the
current system time when the execution was started. |
endTime | A java.util.Date representing the
current system time when the execution finished, regardless of
whether or not it was successful. |
exitStatus | The ExitStatus indicating the
result of the execution. It is most important because it
contains an exit code that will be returned to the caller. See
chapter 5 for more details. |
executionContext | The 'property bag' containing any user data that needs to be persisted between executions. |
commitCount | The number transactions that have been committed for this execution |
itemCount | The number of items that have been processed for this execution. |
An ExecutionContext
represents a collection
of key/value pairs that are persisted and controlled by the framework in
order to allow developers a place to store persistent state that is
scoped to a StepExecution
. For those familiar
with Quartz, it is very similar to JobDataMap
.
The best usage example is restart. Using flat file input as an example,
while processing individual lines, the framework periodically persists
the ExecutionContext
at commit points. This
allows the ItemReader
to store its state in case
a fatal error occurs during the run, or even if the power goes out. All
that is needed is to put the current number of lines read into the
context, and the framework will do the rest:
executionContext.putLong(getKey(LINES_READ_COUNT), reader.getPosition());
The call above will store the current number of lines read into
the ExecutionContext. It should be made just before the framework
commits. Being notified before a commit requires one of the various
StepListeners, or an ItemStream, which are discussed in more detail
later in this guide. When the ItemReader
is
opened, it can check to see if it has any stored state in the context,
and initialize itself from there:
if (executionContext.containsKey(getKey(LINES_READ_COUNT))) { log.debug("Initializing for restart. Restart data is: " + executionContext); long lineCount = executionContext.getLong(getKey(LINES_READ_COUNT)); LineReader reader = getReader(); Object record = ""; while (reader.getPosition() < lineCount && record != null) { record = readLine(); } }
The ExecutionContext
can also be used for
startistics that need to be persisted about the run itself. For example,
if a flat file contains orders for processing that exist across multiple
lines, it may be necessary to store how many orders have been processed
(which is much different from than the number of lines read) so that an
email can be sent at the end of the Step
with the
total orders processed in the body. The framework handles storing this
for the developer, in order to correctly scope it with an individual
JobInstance
. It can be very difficult to know
whether an existing ExecutionContext
should be
used or not. For example, using the 'EndOfDay' example from above, when
the 01-01 run starts again for the second time, the framework recognizes
that it is the same JobInstance
and on an
individual Step
basis, pulls the
ExecutionContext
out of the database and hands it
as part of the StepExecution
to the
Step
itself. Conversely, for the 01-02 run the
framework recognizes that it is a different instance, so an empty
context must be handed to the Step
. There are
many of these types of determinations that the framework makes for the
developer to ensure the state is given to them at the correct time. It
is also important to note that exactly one
ExecutionContext
exists per
StepExecution
at any given time. Clients of the
ExecutionContext
should be careful because this
creates a shared keyspace, so care should be taken when putting values
in to ensure no data is overwritten, however, the
Step
stores absolutely no data in the context, so
there is no way to adversely affect the framework.
JobRepository
is the persistence mechanism
for all of the Stereotypes mentioned above. When a job is first launched,
a JobExecution
is obtained by calling the
repository's createJobExecution
method, and
during the course of execution, StepExecution
and
JobExecution
are persisted by passing them to the
repository:
public interface JobRepository { public JobExecution createJobExecution(Job job, JobParameters jobParameters) throws JobExecutionAlreadyRunningException, JobRestartException; void saveOrUpdate(JobExecution jobExecution); void saveOrUpdate(StepExecution stepExecution); void saveOrUpdateExecutionContext(StepExecution stepExecution); StepExecution getLastStepExecution(JobInstance jobInstance, Step step); int getStepExecutionCount(JobInstance jobInstance, Step step); }
JobLauncher
represents a simple interface for
launching a Job
with a given set of
JobParameters
:
public interface JobLauncher { public JobExecution run(Job job, JobParameters jobParameters) throws JobExecutionAlreadyRunningException, JobRestartException; }
It is expected that implementations will obtain a valid
JobExecution
from the
JobRepository
and execute the
Job
.
JobLocator
represents an interface for
locating a Job
:
public interface JobLocator { Job getJob(String name) throws NoSuchJobException; }
This interface is very necessary due to the nature of Spring itself.
Because we can't guarantee one ApplicationContext
equals one Job
, an abstraction is needed to obtain
a Job
for a given name. It becomes especially
useful when launching jobs from within a Java EE application
server.
ItemReader
is an abstraction that represents
the retrieval of input for a Step
, one item at a
time. When the ItemReader
has exhausted the items
it can provide, it will indicate this by returning null. More details
about the ItemReader
interface and it's various
implementations can be found in Chapter 3.
ItemWriter
is an abstraction that represents
the output of a Step
, one item at a time.
Generally, an item writer has no knowledge of the input it will receive
next, only the item that was passed in its current invocation. More
details about the ItemWriter
interface and it's
various implementations can be found in Chapter 3.
A Tasklet
represents the execution of a
logical unit of work, as defined by its implementation of the Spring Batch
provided Tasklet
interface. A
Tasklet
is useful for encapsulating processing
logic that is not natural to split into read-(transform)-write phases,
such as invoking a system command or a stored procedure.