UseJunior Book a Demo

safe-docx · Parsing Heuristics

Null label for plain paragraph text

When interpreting paragraph text as possible legal-document list structure, a parser has to separate explicit list labels from ordinary prose. A list label (i.e., a leading marker such as a section heading, article heading, numbered heading, parenthesized number, Roman numeral, or letter) can drive later paragraph classification, so prose without that marker should not receive one.

extractListLabel applies that parsing heuristic to plain text and returns null label fields when no supported leading pattern matches. The result keeps the absence of a list label explicit, so downstream operations do not target the wrong text.[1]

Below is a test scenario of the baseline unmatched case of extractListLabel: plain paragraph text without list-label patterns returns null label fields.

The scenario

Given plain paragraph text with no list-label pattern,
When extractListLabel is called,
Then

  • result.label is null.
  • result.label_type is null.

The Test Fixture

The fixture calls extractListLabel with plain paragraph text and asserts that the label fields remain null. Those assertions are the observable behavior for this scenario.[2]

Below is the test fixture code.

test.openspec('null label for plain text without list patterns')('Scenario: null label for plain text without list patterns', async ({ when, then, attachPrettyJson }: AllureBddContext) => {
  const text = 'This is just a normal paragraph with no list label.';

  let result!: ReturnType<typeof extractListLabel>;
  await when('extractListLabel is called', async () => {
    result = extractListLabel(text);
    await attachPrettyJson('Result', result);
  });

  await then('label and label_type SHALL be null', () => {
    expect(result.label).toBeNull();
    expect(result.label_type).toBeNull();
  });
});

The Expected Result Shape

The scenario asserts the two label fields on the returned ListLabelResult. The matcher does not assert match_end in this scenario.

Below is the result that extractListLabel is expected to return for this scenario.

expect(result.label).toBeNull();
expect(result.label_type).toBeNull();

Below is a description of the expected fields:

A Non-Obvious Detail

The implementation checks several leading-pattern families before returning the null label fields. Because each pattern is anchored at the start of the text, ordinary prose that only contains similar words later in the paragraph is not classified as a list label.