UseJunior Book a Demo

safe-docx · Text Matching

Exact match for a literal substring

When replacing text in a paragraph, the text to be replaced (the "substring") must first be located. Locating the substring can be challenging because paragraph text in a Word file (.docx) is often fragmented across multiple run elements (<w:r>). Additionally, when locating the substring, it is important to know whether more than one match exists, so that downstream operations do not target the wrong text.

In this repo, safe-docx, a primitive function named findUniqueSubstringMatch[1] accepts a paragraph's plain text and a search string (the "needle"), and returns a result describing where the needle was found. The result is one of three types: not_found, multiple, or unique. When the result is unique, the caller can use the start, end, and matchedText fields from the result to locate the substring to be modified or inserted unambiguously.

Below is a test scenario of the baseline successful case of findUniqueSubstringMatch[2]: the needle appears in the paragraph text exactly once and verbatim.

The scenario

Given paragraph text containing the needle exactly once as a verbatim substring,
When findUniqueSubstringMatch is called with the paragraph text and the needle,
Then the result has only one match, no normalization was applied to find the match, and the matched text is preserved verbatim.

Below is the test fixture code.

const haystack = 'The Purchase Price shall be paid at Closing.';
const needle = 'Purchase Price';

const result = findUniqueSubstringMatch(haystack, needle);

The expected result shape

Below is the result that findUniqueSubstringMatch is expected to return for this scenario.

{
  status: 'unique',
  mode: 'exact',
  matchCount: 1,
  start: 4,
  end: 18,
  matchedText: 'Purchase Price'
}

Below is a description of the expected fields:

What this scenario does not cover

This scenario is deliberately limited to a single literal occurrence of the needle inside a single paragraph string. It does not exercise:

The assertions only test the result type, the mode, and the matched text for this fixture.

A non-obvious detail

A successful match is not merely "the result is non-empty." The expected status and mode together distinguish an exact, unique literal match from the other outcomes the same call could produce: a multiple match, a normalized-only match (one of quote_normalized, flexible_whitespace, quote_optional, or clean), or no match at all. The distinction matters because the status determines what downstream operations can safely do.

Consider an instruction to bold the word Closing in a paragraph. If the paragraph contains the word Closing four times, an operation that bolds the first match silently does the wrong thing. The multiple status surfaces this case, so that the caller can return an error, ask the user to disambiguate, or apply the operation to all matches. Returning the match count as a first-class signal, rather than letting callers assume a single match, comes from running this primitive against real contracts where the same defined term often appears many times.