UseJunior Book a Demo

safe-docx · Parsing Heuristics

List label removal with leading whitespace

When preparing legal-document list text for comparison, a leading list label (i.e., text-formatted numbering at the start of a paragraph) must be removed without changing the item wording that follows. The removal also needs to discard leading whitespace after the list label, because that whitespace belongs to the label boundary rather than to the item wording.

The stripListLabel primitive (i.e., a small text operation used by document-processing code) applies the list-label parser and returns the item wording after the detected list label. When a list label is found, the primitive slices the input after the parsed label boundary and trims only the leading whitespace from the remaining item wording.[1]

Scenario

The scenario records the behavior for a lettered list label at the start of list text. The asserted outcome is that the returned item wording no longer includes the list label or the following whitespace, while the parsed label remains available in the nested result.

Below is a test scenario of the baseline successful case of stripListLabel: a lettered list label and its leading whitespace are removed from the item wording.

The scenario

Given the list text (a) First item of the agreement,
When stripListLabel is called,
Then stripped_text SHALL have label and leading whitespace removed.

  • result.label SHALL be (a).

Test Fixture

The test fixture calls stripListLabel with list text that starts with a parenthesized letter label, then checks the returned item wording and parsed label.[2]

Below is the test fixture code.

test.openspec('stripListLabel removes label and leading whitespace')('Scenario: stripListLabel removes label and leading whitespace', async ({ when, then, attachPrettyJson }: AllureBddContext) => {
  const text = '(a) First item of the agreement';

  let result!: ReturnType<typeof stripListLabel>;
  await when('stripListLabel is called', async () => {
    result = stripListLabel(text);
    await attachPrettyJson('Result', result);
  });

  await then('stripped_text SHALL have label and leading whitespace removed', () => {
    expect(result.stripped_text).toBe('First item of the agreement');
    expect(result.result.label).toBe('(a)');
  });
});

Expected Result Shape

The expected result shape mirrors the fields asserted by the scenario. The nested result object is shown only for the asserted label field, because the scenario does not assert the other parsed-label fields.

Below is the result that stripListLabel is expected to return for this scenario.

{
  stripped_text: 'First item of the agreement',
  result: {
    label: '(a)',
  },
}

Below is a description of the expected fields: