UseJunior Book a Demo

safe-docx · Parsing Heuristics

Multi-character Roman numeral list labels

When classifying text-formatted list labels (i.e., typed prefix markers that appear in paragraph text rather than OOXML numbering metadata) in legal documents, multi-character Roman numerals need a distinct list-label type before the item body is handled. The classification matters because the same prefix shape can affect style inference and label stripping differently from ordinary lettered labels.

The extractListLabel primitive (i.e., a small parsing operation) inspects the beginning of a paragraph string and returns a ListLabelResult with label, label_type, and match_end fields.[1] Roman numeral detection runs before letter detection, and a Roman numeral candidate must contain more than one character so single-letter legal labels remain letter labels.

Below is a test scenario of the baseline successful case of extractListLabel: multi-character Roman numeral list labels are classified as Roman labels.

The scenario

Given paragraph text begins with (ii) Second item,
When extractListLabel is called,
Then result.label_type is LabelType.ROMAN.

Test fixture

The fixture calls extractListLabel with paragraph text that begins with a parenthesized Roman numeral and then checks the returned label type.[2]

Below is the test fixture code.

test.openspec('extract multi-char roman numeral labels')('Scenario: extract multi-char roman numeral labels', async ({ when, then, attachPrettyJson }: AllureBddContext) => {
  const text = '(ii) Second item';

  let result!: ReturnType<typeof extractListLabel>;
  await when('extractListLabel is called', async () => {
    result = extractListLabel(text);
    await attachPrettyJson('Result', result);
  });

  await then('the result SHALL have label_type ROMAN', () => {
    expect(result.label_type).toBe(LabelType.ROMAN);
  });
});

The expected result shape

The scenario asserts one field from the returned ListLabelResult, so the expected result is the literal field-level assertion rather than a fabricated full object.

Below is the result that extractListLabel is expected to return for this scenario.

expect(result.label_type).toBe(LabelType.ROMAN);

The label_type field is expected to be LabelType.ROMAN, because the label prefix is a multi-character Roman numeral candidate.

A non-obvious detail

Roman numeral matching is intentionally narrower than the regular expression alone. The implementation first matches parenthesized Roman-numeral characters, then applies the candidate check that rejects single-character forms before falling through to letter-label handling.