Chapter 2. The Domain Language of Batch

2.1. Introduction

To any experienced batch architect, the overall concepts of batch processing used in Spring Batch should be familiar and comfortable. There are “Jobs” and “Steps” and developer supplied processing units called ItemReaders and ItemWriters. However, because of the Spring patterns, operations, templates, callbacks, and idioms, there are opportunities for the following:

  • significant improvement in adherence to a clear separation of concerns

  • clearly delineated architectural layers and services provided as interfaces

  • simple and default implementations that allowed for quick adoption and ease of use out-of-the-box

  • significantly enhanced extensibility

The diagram below is only a slight variation of the batch reference architecture that has been used for decades. It provides an overview of the high level components, technical services, and basic operations required by a batch architecture. This architecture framework is a blueprint that has been proven through decades of implementations on the last several generations of platforms (COBOL/Mainframe, C++/Unix, and now Java/anywhere). JCL and COBOL developers are likely to be as comfortable with the concepts as C++, C# and Java developers. Spring Batch provides a physical implementation of the layers, components and technical services commonly found in robust, maintainable systems used to address the creation of simple to complex batch applications, with the infrastructure and extensions to address very complex processing needs.

2.2. Batch Application Style Interactions and Services

Figure 2.1: Batch Stereotypes

The above diagram highlights the interactions and key services provided by the Spring Batch framework. The colors used are important to understanding the responsibilities of a developer in Spring Batch. Grey represents an external application such as an enterprise scheduler or a database. It's important to note that scheduling is grey, and should thus be considered separate from Spring Batch. Blue represents application architecture services. In most cases these are provided by Spring Batch with out of the box implementations, but an architecture team may make specific implementations that better address their specific needs. Yellow represents the pieces that must be configured by a developer. For example, they need to configure their job schedule so that the job is kicked off at the appropriate time. They also need to create a job configuration that defines how their job will be run. It is also worth noting that the ItemReader and ItemWriter used by an application may just as easily be a custom one made by the developer for the specific batch job, rather than one provided by Spring Batch or even an architecture team.

The Batch Application Style is organized into four logical tiers, which include Run, Job, Application, and Data. The primary goal for organizing an application according to the tiers is to embed what is known as "separation of concerns" within the system. These tiers can be conceptual but may prove effective in mapping the deployment of the artifacts onto physical components like Java runtimes and integration with data sources and targets. Effective separation of concerns results in reducing the impact of change to the system. The four conceptual tiers containing batch artifacts are:

  • Run Tier: The Run Tier is concerned with the scheduling and launching of the application. A vendor product is typically used in this tier to allow time-based and interdependent scheduling of batch jobs as well as providing parallel processing capabilities.

  • Job Tier: The Job Tier is responsible for the overall execution of a batch job. It sequentially executes batch steps, ensuring that all steps are in the correct state and all appropriate policies are enforced.

  • Application Tier: The Application Tier contains components required to execute the program. It contains specific tasklets that address the required batch functionality and enforces policies around a tasklet execution (e.g., commit intervals, capture of statistics, etc.)

  • Data Tier: The Data Tier provides the integration with the physical data sources that might include databases, files, or queues.

2.3. Job Stereotypes

This section describes stereotypes relating to the concept of a batch job. A job is an entity that encapsulates an entire batch process. As is common with other Spring projects, a Job will be wired together via an XML configuration file. This file may be referred to as the "job configuration". However, Job is just the top of an overall hierarchy:

2.3.1. Job

A job is represented by a Spring bean that implements the Job interface and contains all of the information necessary to define the operations performed by a job. A job configuration is typically contained within a Spring XML configuration file and the job's name is determined by the "id" attribute associated with the job configuration bean. The job configuration contains

  • The simple name of the job

  • Definition and ordering of Steps

  • Whether or not the job is restartable

A default simple implementation of the Job interface is provided by Spring Batch in the form of the SimpleJob class which creates some standard functionality on top of Job, namely a standard execution logic that all jobs should utilize. In general, all jobs should be defined using a bean of type SimpleJob:

  <bean id="footballJob"
        class="org.springframework.batch.core.job.SimpleJob">
    <property name="steps">
      <list>
        <!-- Step Bean details ommitted for clarity -->
        <bean id="playerload" parent="simpleStep" />
        <bean id="gameLoad" parent="simpleStep" />
        <bean id="playerSummarization" parent="simpleStep" />
      </list>
    </property>
    <property name="restartable" value="true" />
  </bean>

2.3.2. JobInstance

A JobInstance refers to the concept of a logical job run. Let's consider a batch job that should be run once at the end of the day, such as the 'EndOfDay' job from the diagram above. There is one 'EndOfDay' Job, but each individual run of the Job must be tracked separately. In the case of this job, there will be one logical JobInstance per day. For example, there will be a January 1st run, and a January 2nd run. If the January 1st run fails the first time and is run again the next day, it's still the January 1st run. (Usually this corresponds with the data its processing as well, meaning the January 1st run processes data for January 1st, etc) That is to say, each JobInstance can have multiple executions. (JobExecution is discussed in more detail below) and only one JobInstance corresponding to a particular Job can be running at a given time. The definition of a JobInstance has absolutely no bearing on the data the will be loaded. It is entirely up to the ItemReader implementation used to determine how data will be loaded. For example, in the EndOfDay scenario, there may be a column on the data that indicates the 'effective date' or 'schedule date' to which the data belongs. So, the January 1st run would only load data from the 1st, and the January 2nd run would only use data from the 2nd. Because this determination will likely be a business decision, it is left up to the ItemReader to decide. What using the same JobInstance will determine, however, is whether or not the 'state' (i.e. the ExecutionContext, which is discussed below) from previous executions will be used. Using a new JobInstance will mean 'start from the beginning' and using an existing instance will generally mean 'start from where you left off'.

2.3.3. JobParameters

Having discussed JobInstance and how it differs from Job, the natural question to ask is: "how is one JobInstance distinguished from another?" The answer is: JobParameters. JobParameters are any set of parameters used to start a batch job, which can be used for identification or even as reference data during the run. In the example above, where there are two instances, one for January 1st, and another for January 2nd, there is really only one Job, one that was started with a job parameter of 01-01-2008 and another that was started with a parameter of 01-02-2008. Thus, the contract can be defined as: JobInstance = Job + JobParameters. This allows you to effectively control how you define a JobInstance, since you control what parameters are passed in.

2.3.4. JobExecution

A JobExecution refers to the technical concept of a single attempt to run a Job. An execution may end in failure or success, but the JobInstance corresponding to a given execution will not be marked as complete unless the execution completes successfully. For instance, if we have a JobInstance of the EndOfDay job for 01-01-2008, as described above, that fails to successfully complete its work the first time it is run, when we attempt to run it again (with the same job parameters of 01-01-2008), a new job execution will be created.

A Job defines what a job is and defines how it is to be executed, and JobInstance is a purely organization object to group executions together, primarily to enable correct restart. A JobExecution, however, is the primary storage mechanism for what actually happened during a run, and as such contains many more properties that must be controlled and persisted:

Table 2.1. JobExecution properties

statusA BatchStatus object that indicates the status of the execution. While it's running, it's BatchStatus.STARTED, if it fails it's BatchStatus.FAILED, and if it finishes successfully it's BatchStatus.COMPLETED
startTimeA java.util.Date representing the current system time when the execution was started.
endTimeA java.util.Date representing the current system time when the execution finished, regardless of whether or not it was successful.
exitStatusThe ExitStatus indicating the result of the run. It is most important because it contains an exit code that will be returned to the caller. See chapter 5 for more details.

These properties are important because they will be persisted and can be used to completely determine the status of an execution. For example, if the EndOfDay job for 01-01 is executed at 9:00 PM, and fails at 9:30, the following entries will be made in the batch meta data tables:

Table 2.2. BATCH_JOB_INSTANCE

JOB_INSTANCE_IDJOB_NAME
1EndOfDayJob

Table 2.3. BATCH_JOB_PARAMS

JOB_INSTANCE_IDTYPE_CDKEY_NAMEDATE_VAL
1DATEschedule.Date2008-01-01 00:00:00

Table 2.4. BATCH_JOB_EXECUTION

JOB_EXECUTION_IDJOB_INSTANCE_IDSTART_TIMEEND_TIMESTATUS
112008-01-01 21:00:23.5712008-01-01 21:30:17.132FAILED

Note

extra columns in the table have been removed for added clarity.

Now that the job has failed, let's assume that it took the entire course of the night for the problem to be determined, so that the 'batch window' is now closed. Assuming the window starts at 9:00 PM, the job will be kicked off again for 01-01, starting where it left off and completing successfully at 9:30. Because it's now the next day, the 01-02 job must be run as well, which is kicked off just afterwards at 9:31, and completes in it's normal one hour time at 10:30. There is no requirement that one JobInstance be kicked off after another, unless there is potential for the two jobs to attempt to access the same data, causing issues with locking at the database level. It is entirely up to the scheduler to determine when to run. Since they're separate JobInstances, Spring Batch will make no attempt to stop them from being run concurrently. (Attempting to run the same JobInstance while another is already running will result in a JobExecutionAlreadyRunningException being thrown) There should now be an extra entry in both the JobInstance and JobParameters tables, and two extra entries in the JobExecution table:

Table 2.5. BATCH_JOB_INSTANCE

JOB_INSTANCE_IDJOB_NAME
1EndOfDayJob
2EndOfDayJob

Table 2.6. BATCH_JOB_PARAMS

JOB_INSTANCE_IDTYPE_CDKEY_NAMEDATE_VAL
1DATEschedule.Date2008-01-01 00:00:00
2DATEschedule.Date2008-01-02 00:00:00

Table 2.7. BATCH_JOB_EXECUTION

JOB_EXECUTION_IDJOB_INSTANCE_IDSTART_TIMEEND_TIMESTATUS
112008-01-01 21:002008-01-01 21:30FAILED
212008-01-02 21:002008-01-02 21:30COMPLETED
322008-01-02 21:312008-01-02 22:29COMPLETED

2.4. Step Stereotypes

A Step is a domain object that encapsulates an independent, sequential phase of a batch job. Therefore, every Job is composed entirely of one or more steps. A Step should be thought of as a unique processing stream that will be executed in sequence. For example, if you have one step that loads a file into a database, another that reads from the database, validates the data, preforms processing, and then writes to another table, and another that reads from that table and writes out to a file. Each of these steps will be performed completely before moving on to the next step. The file will be completely read into the database before step 2 can begin. As with Job, a Step has an individual StepExecution that corresponds with a unique JobExecution:

2.4.1. Step

A Step contains all of the information necessary to define and control the actual batch processing. This is a necessarily vague description because the contents of any given Step are at the discretion of the developer writing a Job. A Step can be as simple or complex as the developer desires. A simple Step might load data from a file into the database, requiring little or no code. (depending upon the implementations used) A more complex Step may have complicated business rules that are applied as part of the processing.

Steps are defined by instantiating implementations of the Step interface. Two step implementation classes are available in the Spring Batch framework, and they are each discussed in detail in Chatper 4 of this guide. For most situations, the ItemOrientedStep implementation is sufficient, but for situations where only one call is needed, such as a stored procedure call or a wrapper around existing script, a TaskletStep may be a better option.

2.4.2. StepExecution

A StepExecution represents a single attempt to execute a Step. Using the example from JobExecution, if there is a JobInstance for the "EndOfDayJob", with JobParameters of "01-01-2008" that fails to successfully complete its work the first time it is run, when it is executed again, a new StepExecution will be created. Each of these step executions may represent a different invocation of the batch framework, but they will all correspond to the same JobInstance, just as multiple JobExecutions belong to the same JobInstance.

Step executions are represented by objects of the StepExecution class. Each execution contains a reference to its corresponding step and JobExecution, and transaction related data such as commit and rollback count and start and end times. Additionally, each step execution will contain an ExecutionContext, which contains any data a developer needs persisted across batch runs, such as statistics or state information needed to restart. The following is a listing of the properties for StepExecution:

Table 2.8. StepExecution properties

statusA BatchStatus object that indicates the status of the execution. While it's running, the status is BatchStatus.STARTED, if it fails the status is BatchStatus.FAILED, and if it finishes successfully the status is BatchStatus.COMPLETED
startTimeA java.util.Date representing the current system time when the execution was started.
endTimeA java.util.Date representing the current system time when the execution finished, regardless of whether or not it was successful.
exitStatusThe ExitStatus indicating the result of the execution. It is most important because it contains an exit code that will be returned to the caller. See chapter 5 for more details.
executionContextThe 'property bag' containing any user data that needs to be persisted between executions.
commitCountThe number transactions that have been committed for this execution
itemCountThe number of items that have been processed for this execution.

2.4.3. ExecutionContext

An ExecutionContext represents a collection of key/value pairs that are persisted and controlled by the framework in order to allow developers a place to store persistent state that is scoped to a StepExecution. For those familiar with Quartz, it is very similar to JobDataMap. The best usage example is restart. Using flat file input as an example, while processing individual lines, the framework periodically persists the ExecutionContext at commit points. This allows the ItemReader to store its state in case a fatal error occurs during the run, or even if the power goes out. All that is needed is to put the current number of lines read into the context, and the framework will do the rest:

executionContext.putLong(getKey(LINES_READ_COUNT), reader.getPosition());

The call above will store the current number of lines read into the ExecutionContext. It should be made just before the framework commits. Being notified before a commit requires one of the various StepListeners, or an ItemStream, which are discussed in more detail later in this guide. When the ItemReader is opened, it can check to see if it has any stored state in the context, and initialize itself from there:

  if (executionContext.containsKey(getKey(LINES_READ_COUNT))) {
    log.debug("Initializing for restart. Restart data is: " + executionContext);

    long lineCount = executionContext.getLong(getKey(LINES_READ_COUNT));

    LineReader reader = getReader();

    Object record = "";
    while (reader.getPosition() < lineCount && record != null) {
       record = readLine();
    }
  }

The ExecutionContext can also be used for startistics that need to be persisted about the run itself. For example, if a flat file contains orders for processing that exist across multiple lines, it may be necessary to store how many orders have been processed (which is much different from than the number of lines read) so that an email can be sent at the end of the Step with the total orders processed in the body. The framework handles storing this for the developer, in order to correctly scope it with an individual JobInstance. It can be very difficult to know whether an existing ExecutionContext should be used or not. For example, using the 'EndOfDay' example from above, when the 01-01 run starts again for the second time, the framework recognizes that it is the same JobInstance and on an individual Step basis, pulls the ExecutionContext out of the database and hands it as part of the StepExecution to the Step itself. Conversely, for the 01-02 run the framework recognizes that it is a different instance, so an empty context must be handed to the Step. There are many of these types of determinations that the framework makes for the developer to ensure the state is given to them at the correct time. It is also important to note that exactly one ExecutionContext exists per StepExecution at any given time. Clients of the ExecutionContext should be careful because this creates a shared keyspace, so care should be taken when putting values in to ensure no data is overwritten, however, the Step stores absolutely no data in the context, so there is no way to adversely affect the framework.

2.5. JobRepository

JobRepository is the persistence mechanism for all of the Stereotypes mentioned above. When a job is first launched, a JobExecution is obtained by calling the repository's createJobExecution method, and during the course of execution, StepExecution and JobExecution are persisted by passing them to the repository:

  public interface JobRepository {

    public JobExecution createJobExecution(Job job, JobParameters jobParameters)
         throws JobExecutionAlreadyRunningException, JobRestartException;

    void saveOrUpdate(JobExecution jobExecution);

    void saveOrUpdate(StepExecution stepExecution);

    void saveOrUpdateExecutionContext(StepExecution stepExecution);

    StepExecution getLastStepExecution(JobInstance jobInstance, Step step);

    int getStepExecutionCount(JobInstance jobInstance, Step step);

}

2.6. JobLauncher

JobLauncher represents a simple interface for launching a Job with a given set of JobParameters:

  public interface JobLauncher {

    public JobExecution run(Job job, JobParameters jobParameters) throws JobExecutionAlreadyRunningException,
        JobRestartException;
}

It is expected that implementations will obtain a valid JobExecution from the JobRepository and execute the Job.

2.7. JobLocator

JobLocator represents an interface for locating a Job:

  public interface JobLocator {

    Job getJob(String name) throws NoSuchJobException;
  }

This interface is very necessary due to the nature of Spring itself. Because we can't guarantee one ApplicationContext equals one Job, an abstraction is needed to obtain a Job for a given name. It becomes especially useful when launching jobs from within a Java EE application server.

2.8. Item Reader

ItemReader is an abstraction that represents the retrieval of input for a Step, one item at a time. When the ItemReader has exhausted the items it can provide, it will indicate this by returning null. More details about the ItemReader interface and it's various implementations can be found in Chapter 3.

2.9. Item Writer

ItemWriter is an abstraction that represents the output of a Step, one item at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation. More details about the ItemWriter interface and it's various implementations can be found in Chapter 3.

2.10. Tasklet

A Tasklet represents the execution of a logical unit of work, as defined by its implementation of the Spring Batch provided Tasklet interface. A Tasklet is useful for encapsulating processing logic that is not natural to split into read-(transform)-write phases, such as invoking a system command or a stored procedure.