当前版本仍在开发中,尚不被视为稳定版本。最新稳定版请使用 Spring Batch 文档 6.0.2!

FlatFileItemReader

平面文件指的是任何至多只包含二维(表格型)数据的文件。在 Spring Batch 框架中,读取平面文件主要依靠 FlatFileItemReader,它提供了读取和解析平面文件的基础能力。FlatFileItemReader 最重要的两个必需依赖是 ResourceLineMapper。下一节会进一步介绍 LineMapper 接口。resource 属性表示一个 Spring Core 的 Resource。关于如何创建这类 bean,可参见 Spring Framework 第 5 章:资源。 因此,这里不展开介绍 Resource 对象的创建细节,只给出下面这个简单示例:

Resource resource = new FileSystemResource("resources/trades.csv");

在复杂的批处理环境中,目录结构通常由企业应用集成(EAI)基础设施统一管理,其中会为外部接口建立投递区,用于把文件从 FTP 位置移动到批处理位置, 或者反向移动。文件搬运工具本身不属于 Spring Batch 架构的范围,但在批处理 job 流程中把文件搬运作为某个 step 也并不罕见。 批处理架构本身只需要知道如何定位待处理文件即可。Spring Batch 会从这个起点开始把数据送入处理管道。不过, Spring Integration 已经提供了很多这类服务。

FlatFileItemReader 的其他属性可以进一步定义你的数据应如何被解释,如下表所示:

表 1. FlatFileItemReader 属性
属性 类型 说明

comments

String[]

指定哪些行前缀表示注释行。

encoding

String

指定使用哪种文本编码。默认值是 UTF-8

lineMapper

LineMapper

把一行 String 转换为表示该 item 的 Object

linesToSkip

int

文件开头需要忽略的行数。

recordSeparatorPolicy

RecordSeparatorPolicy

用于确定行结束位置,并处理诸如在引号字符串内部跨越换行继续读取等场景。

resource

Resource

待读取的资源。

skippedLinesCallback

LineCallbackHandler

用于接收文件中那些要被跳过的原始行内容的接口。如果 linesToSkip 设置为 2,那么这个接口会被调用两次。

strict

boolean

在严格模式下,如果输入资源不存在,reader 会在 ExecutionContext 上抛出异常;否则只记录日志并继续执行。

LineMapper

就像 RowMapper 可以接收 ResultSet 这样的底层结构并返回一个 Object 一样,平面文件处理也需要类似的结构, 把一行 String 转换为一个 Object,如下接口定义所示:

public interface LineMapper<T> {

    T mapLine(String line, int lineNumber) throws Exception;

}

其基本契约是:给定当前行及其对应的行号,mapper 应返回一个领域对象。这与 RowMapper 很相似,因为每一行都关联着自己的行号, 正如 ResultSet 中的每一行都对应一个行号一样。这样可以把行号与结果对象关联起来,用于身份比较或输出更有信息量的日志。不过, 与 RowMapper 不同的是,LineMapper 接收到的是一条原始文本行,如前所述,这依然只完成了一半工作。该行还必须先被切分成一个 FieldSet,之后才能再映射成对象,后文会进一步说明。

LineTokenizer

由于平面文件数据可能有多种格式,而这些格式都需要被转换为 FieldSet,因此必须有一个把输入行转换成 FieldSet 的抽象。 在 Spring Batch 中,这个接口就是 LineTokenizer

public interface LineTokenizer {

    FieldSet tokenize(String line);

}

LineTokenizer 的契约是:给定一行输入(理论上这个 String 甚至可以包含多行),返回一个表示该行内容的 FieldSet。然后,这个 FieldSet 可以再传给 FieldSetMapper。Spring Batch 内置了以下几种 LineTokenizer 实现:

  • DelimitedLineTokenizer:用于记录中字段通过分隔符分隔的文件。最常见的分隔符是逗号,但也经常使用竖线或分号。

  • FixedLengthTokenizer:用于记录中各字段都是“固定宽度”的文件。对于每种记录类型,都必须定义各字段的宽度。

  • PatternMatchingCompositeLineTokenizer:通过模式匹配来决定在某一行上应使用哪一个 LineTokenizer

FieldSetMapper

FieldSetMapper 接口只定义了一个方法:mapFieldSet。它接收一个 FieldSet 对象,并将其中内容映射为一个对象。 这个对象可以是自定义 DTO、领域对象,或者数组,具体取决于 job 的需求。FieldSetMapper 通常与 LineTokenizer 配合使用, 把资源中的一行数据转换为所需类型的对象,如下接口定义所示:

public interface FieldSetMapper<T> {

    T mapFieldSet(FieldSet fieldSet) throws BindException;

}

它采用的模式与 JdbcTemplate 使用的 RowMapper 完全一致。

DefaultLineMapper

现在,平面文件读取所需的基础接口都已经定义清楚了,也就可以看出读取过程需要三个基本步骤:

  1. 从文件中读取一行。

  2. 把这行 String 传给 LineTokenizer#tokenize(),得到一个 FieldSet

  3. 把分词得到的 FieldSet 传给 FieldSetMapper,并将映射结果作为 ItemReader#read() 的返回值。

上面介绍的两个接口分别代表两项独立任务:把一行转换为 FieldSet,以及把 FieldSet 映射为领域对象。 由于 LineTokenizer 的输入与 LineMapper 的输入一致(都是一行文本),而 FieldSetMapper 的输出又与 LineMapper 的输出一致,因此框架提供了一个同时使用 LineTokenizerFieldSetMapper 的默认实现。 下面这个类定义中的 DefaultLineMapper,就代表了大多数用户所需要的行为:

public class DefaultLineMapper<T> implements LineMapper<>, InitializingBean {

    private LineTokenizer tokenizer;

    private FieldSetMapper<T> fieldSetMapper;

    public T mapLine(String line, int lineNumber) throws Exception {
        return fieldSetMapper.mapFieldSet(tokenizer.tokenize(line));
    }

    public void setLineTokenizer(LineTokenizer tokenizer) {
        this.tokenizer = tokenizer;
    }

    public void setFieldSetMapper(FieldSetMapper<T> fieldSetMapper) {
        this.fieldSetMapper = fieldSetMapper;
    }
}

之所以把上述功能放在一个默认实现中,而不是像框架早期版本那样直接内建到 reader 本身,是为了给用户更大的解析过程控制灵活性, 尤其是在需要访问原始行内容时。

简单的分隔型文件读取示例

下面的示例通过一个实际领域场景说明如何读取平面文件。这个批处理 job 会从下面的文件中读取足球运动员信息:

ID,lastName,firstName,position,birthYear,debutYear
"AbduKa00,Abdul-Jabbar,Karim,rb,1974,1996",
"AbduRa00,Abdullah,Rabih,rb,1975,1999",
"AberWa00,Abercrombie,Walter,rb,1959,1982",
"AbraDa00,Abramowicz,Danny,wr,1945,1967",
"AdamBo00,Adams,Bob,te,1946,1969",
"AdamCh00,Adams,Charlie,wr,1979,2003"

该文件的内容会被映射为下面这个 Player 领域对象:

public class Player implements Serializable {

    private String ID;
    private String lastName;
    private String firstName;
    private String position;
    private int birthYear;
    private int debutYear;

    public String toString() {
        return "PLAYER:ID=" + ID + ",Last Name=" + lastName +
            ",First Name=" + firstName + ",Position=" + position +
            ",Birth Year=" + birthYear + ",DebutYear=" +
            debutYear;
    }

    // setters and getters...
}

要把一个 FieldSet 映射为 Player 对象,需要定义一个返回 PlayerFieldSetMapper,如下例所示:

protected static class PlayerFieldSetMapper implements FieldSetMapper<Player> {
    public Player mapFieldSet(FieldSet fieldSet) {
        Player player = new Player();

        player.setID(fieldSet.readString(0));
        player.setLastName(fieldSet.readString(1));
        player.setFirstName(fieldSet.readString(2));
        player.setPosition(fieldSet.readString(3));
        player.setBirthYear(fieldSet.readInt(4));
        player.setDebutYear(fieldSet.readInt(5));

        return player;
    }
}

然后,只需正确构造一个 FlatFileItemReader 并调用 read,就可以读取该文件,如下例所示:

FlatFileItemReader<Player> itemReader = new FlatFileItemReader<>();
itemReader.setResource(new FileSystemResource("resources/players.csv"));
DefaultLineMapper<Player> lineMapper = new DefaultLineMapper<>();
//DelimitedLineTokenizer defaults to comma as its delimiter
lineMapper.setLineTokenizer(new DelimitedLineTokenizer());
lineMapper.setFieldSetMapper(new PlayerFieldSetMapper());
itemReader.setLineMapper(lineMapper);
itemReader.open(new ExecutionContext());
Player player = itemReader.read();

每次调用 read,都会返回文件当前行对应的一个新的 Player 对象。读到文件末尾时,则返回 null

按名称映射字段

DelimitedLineTokenizerFixedLengthTokenizer 都支持一个额外功能,它在作用上与 JDBC 的 ResultSet 类似。你可以把字段名称注入到这两种 LineTokenizer 实现中,以提升映射函数的可读性。 首先,把平面文件中所有字段的列名注入到 tokenizer 中,如下例所示:

tokenizer.setNames(new String[] {"ID", "lastName", "firstName", "position", "birthYear", "debutYear"});

FieldSetMapper 可以像下面这样利用这些信息:

public class PlayerMapper implements FieldSetMapper<Player> {
    public Player mapFieldSet(FieldSet fs) {

       if (fs == null) {
           return null;
       }

       Player player = new Player();
       player.setID(fs.readString("ID"));
       player.setLastName(fs.readString("lastName"));
       player.setFirstName(fs.readString("firstName"));
       player.setPosition(fs.readString("position"));
       player.setDebutYear(fs.readInt("debutYear"));
       player.setBirthYear(fs.readInt("birthYear"));

       return player;
   }
}

将 FieldSet 自动映射为领域对象

对于很多人来说,手写一个专用 FieldSetMapper 就像为 JdbcTemplate 手写一个专用 RowMapper 一样繁琐。 Spring Batch 通过提供一个可自动映射字段的 FieldSetMapper 简化了这件事。它会按照 JavaBean 规范,通过字段名匹配对象上的 setter。

  • Java

  • XML

继续沿用前面的足球示例,BeanWrapperFieldSetMapper 在 Java 中的配置如下:

Java Configuration
@Bean
public FieldSetMapper fieldSetMapper() {
	BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper();

	fieldSetMapper.setPrototypeBeanName("player");

	return fieldSetMapper;
}

@Bean
@Scope("prototype")
public Player player() {
	return new Player();
}

继续沿用前面的足球示例,BeanWrapperFieldSetMapper 在 XML 中的配置如下:

XML Configuration
<bean id="fieldSetMapper"
      class="org.springframework.batch.infrastructure.item.file.mapping.BeanWrapperFieldSetMapper">
    <property name="prototypeBeanName" value="player" />
</bean>

<bean id="player"
      class="org.springframework.batch.samples.domain.Player"
      scope="prototype" />

对于 FieldSet 中的每一个条目,mapper 都会在一个新的 Player 实例上查找对应的 setter(因此这里必须使用 prototype 作用域), 这个过程与 Spring 容器按属性名查找 setter 的方式一致。FieldSet 中所有可用字段都会被映射,最终返回一个 Player 对象, 而且无需编写任何映射代码。

定长文件格式

到目前为止,我们主要详细讨论的是分隔型文件。但这只覆盖了文件读取场景的一半。很多使用平面文件的组织也会采用定长格式。下面是一个定长文件示例:

UK21341EAH4121131.11customer1
UK21341EAH4221232.11customer2
UK21341EAH4321333.11customer3
UK21341EAH4421434.11customer4
UK21341EAH4521535.11customer5

虽然它看起来像一个很长的字段,但实际上包含了 4 个独立字段:

  1. ISIN:被订购物品的唯一标识,长度为 12 个字符。

  2. Quantity:订购数量,长度为 3 个字符。

  3. Price:商品价格,长度为 5 个字符。

  4. Customer:下单客户的 ID,长度为 9 个字符。

配置 FixedLengthLineTokenizer 时,必须以区间范围的形式提供这些字段长度。

  • Java

  • XML

下面的示例展示了如何在 Java 中为 FixedLengthLineTokenizer 定义区间:

Java Configuration
@Bean
public FixedLengthTokenizer fixedLengthTokenizer() {
	FixedLengthTokenizer tokenizer = new FixedLengthTokenizer();

	tokenizer.setNames("ISIN", "Quantity", "Price", "Customer");
	tokenizer.setColumns(new Range(1, 12),
						new Range(13, 15),
						new Range(16, 20),
						new Range(21, 29));

	return tokenizer;
}

下面的示例展示了如何在 XML 中为 FixedLengthLineTokenizer 定义区间:

XML Configuration
<bean id="fixedLengthLineTokenizer"
      class="org.springframework.batch.infrastructure.item.file.transform.FixedLengthTokenizer">
    <property name="names" value="ISIN,Quantity,Price,Customer" />
    <property name="columns" value="1-12, 13-15, 16-20, 21-29" />
</bean>

由于 FixedLengthLineTokenizer 使用的仍是前面讨论过的同一个 LineTokenizer 接口,因此它返回的 FieldSet 与使用分隔符时得到的结果是同一种形式。这意味着,你可以用同样的方法处理它的输出,例如使用 BeanWrapperFieldSetMapper

要支持前面这种区间语法,需要在 ApplicationContext 中配置一个专用属性编辑器 RangeArrayPropertyEditor。不过,如果使用了 batch 命名空间,这个 bean 会在 ApplicationContext 中自动声明。

Because the FixedLengthLineTokenizer uses the same LineTokenizer interface as discussed above, it returns the same FieldSet as if a delimiter had been used. This lets the same approaches be used in handling its output, such as using the BeanWrapperFieldSetMapper.

单个文件中的多种记录类型

到目前为止,前面的文件读取示例为了简化说明,都做了一个关键假设:文件中的所有记录格式都相同。但实际情况并不总是如此。 一个文件中包含多种不同格式的记录非常常见,这些记录需要用不同方式切分,并映射为不同对象。下面这个文件片段就说明了这种情况:

USER;Smith;Peter;;T;20014539;F
LINEA;1044391041ABC037.49G201XX1383.12H
LINEB;2134776319DEF422.99M005LI

在这个文件里,有三种记录类型:USERLINEALINEB。其中,USER 行对应一个 User 对象; LINEALINEB 都对应 Line 对象,只不过 LINEA 包含的信息比 LINEB 更多。

ItemReader 仍然是一行一行地读取文件,但我们必须指定不同的 LineTokenizerFieldSetMapper, 这样 ItemWriter 才能收到正确类型的 item。PatternMatchingCompositeLineMapper 通过允许你配置 “模式到 LineTokenizer”以及“模式到 FieldSetMapper”的映射,使这件事变得很容易。

  • Java

  • XML

Java Configuration
@Bean
public PatternMatchingCompositeLineMapper orderFileLineMapper() {
	PatternMatchingCompositeLineMapper lineMapper =
		new PatternMatchingCompositeLineMapper();

	Map<String, LineTokenizer> tokenizers = new HashMap<>(3);
	tokenizers.put("USER*", userTokenizer());
	tokenizers.put("LINEA*", lineATokenizer());
	tokenizers.put("LINEB*", lineBTokenizer());

	lineMapper.setTokenizers(tokenizers);

	Map<String, FieldSetMapper> mappers = new HashMap<>(2);
	mappers.put("USER*", userFieldSetMapper());
	mappers.put("LINE*", lineFieldSetMapper());

	lineMapper.setFieldSetMappers(mappers);

	return lineMapper;
}

下面的示例展示了如何在 XML 中为 FixedLengthLineTokenizer 定义区间:

XML Configuration
<bean id="orderFileLineMapper"
      class="org.spr...PatternMatchingCompositeLineMapper">
    <property name="tokenizers">
        <map>
            <entry key="USER*" value-ref="userTokenizer" />
            <entry key="LINEA*" value-ref="lineATokenizer" />
            <entry key="LINEB*" value-ref="lineBTokenizer" />
        </map>
    </property>
    <property name="fieldSetMappers">
        <map>
            <entry key="USER*" value-ref="userFieldSetMapper" />
            <entry key="LINE*" value-ref="lineFieldSetMapper" />
        </map>
    </property>
</bean>

In this example, "LINEA" and "LINEB" have separate LineTokenizer instances, but they both use the same FieldSetMapper.

The PatternMatchingCompositeLineMapper uses the PatternMatcher#match method in order to select the correct delegate for each line. The PatternMatcher allows for two wildcard characters with special meaning: the question mark ("?") matches exactly one character, while the asterisk ("*") matches zero or more characters. Note that, in the preceding configuration, all patterns end with an asterisk, making them effectively prefixes to lines. The PatternMatcher always matches the most specific pattern possible, regardless of the order in the configuration. So if "LINE*" and "LINEA*" were both listed as patterns, "LINEA" would match pattern "LINEA*", while "LINEB" would match pattern "LINE*". Additionally, a single asterisk ("*") can serve as a default by matching any line not matched by any other pattern.

  • Java

  • XML

The following example shows how to match a line not matched by any other pattern in Java:

Java Configuration
...
tokenizers.put("*", defaultLineTokenizer());
...

The following example shows how to match a line not matched by any other pattern in XML:

XML Configuration
<entry key="*" value-ref="defaultLineTokenizer" />

There is also a PatternMatchingCompositeLineTokenizer that can be used for tokenization alone.

It is also common for a flat file to contain records that each span multiple lines. To handle this situation, a more complex strategy is required. A demonstration of this common pattern can be found in the multiLineRecords sample.

Exception Handling in 平面文件

There are many scenarios when tokenizing a line may cause exceptions to be thrown. Many flat files are imperfect and contain incorrectly formatted records. Many users choose to skip these erroneous lines while logging the issue, the original line, and the line number. These logs can later be inspected manually or by another batch job. For this reason, Spring Batch provides a hierarchy of exceptions for handling parse exceptions: FlatFileParseException and FlatFileFormatException. FlatFileParseException is thrown by the FlatFileItemReader when any errors are encountered while trying to read a file. FlatFileFormatException is thrown by implementations of the LineTokenizer interface and indicates a more specific error encountered while tokenizing.

IncorrectTokenCountException

Both DelimitedLineTokenizer and FixedLengthLineTokenizer have the ability to specify column names that can be used for creating a FieldSet. However, if the number of column names does not match the number of columns found while tokenizing a line, the FieldSet cannot be created, and an IncorrectTokenCountException is thrown, which contains the number of tokens encountered, and the number expected, as shown in the following example:

tokenizer.setNames(new String[] {"A", "B", "C", "D"});

try {
    tokenizer.tokenize("a,b,c");
}
catch (IncorrectTokenCountException e) {
    assertEquals(4, e.getExpectedCount());
    assertEquals(3, e.getActualCount());
}

Because the tokenizer was configured with 4 column names but only 3 tokens were found in the file, an IncorrectTokenCountException was thrown.

IncorrectLineLengthException

Files formatted in a fixed-length format have additional requirements when parsing because, unlike a delimited format, each column must strictly adhere to its predefined width. If the total line length does not equal the widest value of this column, an exception is thrown, as shown in the following example:

tokenizer.setColumns(new Range[] { new Range(1, 5),
                                   new Range(6, 10),
                                   new Range(11, 15) });
try {
    tokenizer.tokenize("12345");
    fail("Expected IncorrectLineLengthException");
}
catch (IncorrectLineLengthException ex) {
    assertEquals(15, ex.getExpectedLength());
    assertEquals(5, ex.getActualLength());
}

The configured ranges for the tokenizer above are: 1-5, 6-10, and 11-15. Consequently, the total length of the line is 15. However, in the preceding example, a line of length 5 was passed in, causing an IncorrectLineLengthException to be thrown. Throwing an exception here rather than only mapping the first column allows the processing of the line to fail earlier and with more information than it would contain if it failed while trying to read in column 2 in a FieldSetMapper. However, there are scenarios where the length of the line is not always constant. For this reason, validation of line length can be turned off via the 'strict' property, as shown in the following example:

tokenizer.setColumns(new Range[] { new Range(1, 5), new Range(6, 10) });
tokenizer.setStrict(false);
FieldSet tokens = tokenizer.tokenize("12345");
assertEquals("12345", tokens.readString(0));
assertEquals("", tokens.readString(1));

The preceding example is almost identical to the one before it, except that tokenizer.setStrict(false) was called. This setting tells the tokenizer to not enforce line lengths when tokenizing the line. A FieldSet is now correctly created and returned. However, it contains only empty tokens for the remaining values.