When replacing text in a paragraph, the text to be replaced (the "substring") must first be located. Locating the substring can be challenging because paragraph text in a Word file (.docx) is often fragmented across multiple run elements (<w:r>). Additionally, when locating the substring, it is important to know whether more than one match exists, so that downstream operations do not target the wrong text.
In this repo, safe-docx, a primitive function named findUniqueSubstringMatch[1] accepts a paragraph's plain text and a search string (the "needle"), and returns a result describing where the needle was found. The result is one of three types: not_found, multiple, or unique. When the result is unique, the caller can use the start, end, and matchedText fields from the result to locate the substring to be modified or inserted unambiguously.
Below is a test scenario of the baseline successful case of findUniqueSubstringMatch[2]: the needle appears in the paragraph text exactly once and verbatim.
The scenario
Given paragraph text containing the needle exactly once as a verbatim substring,
When findUniqueSubstringMatch is called with the paragraph text and the needle,
Then the result has only one match, no normalization was applied to find the match, and the matched text is preserved verbatim.
Below is the test fixture code.
const haystack = 'The Purchase Price shall be paid at Closing.';
const needle = 'Purchase Price';
const result = findUniqueSubstringMatch(haystack, needle);
The expected result shape
Below is the result that findUniqueSubstringMatch is expected to return for this scenario.
{
status: 'unique',
mode: 'exact',
matchCount: 1,
start: 4,
end: 18,
matchedText: 'Purchase Price'
}
Below is a description of the expected fields:
- The
statusfield is expected to be'unique', meaning there is exactly one match in the paragraph. - The
modefield is expected to be'exact', meaning no normalization was applied to find the match. - The
matchCountfield is expected to be1, because the search returned a single match. - The
startfield is expected to be the integer4, the index of the first character of the match within the paragraph (zero-based). - The
endfield is expected to be the integer18, one past the index of the last character of the match. - The
matchedTextfield is expected to be the string'Purchase Price', byte-identical to the substring at[4, 18)in the paragraph.
What this scenario does not cover
This scenario is deliberately limited to a single literal occurrence of the needle inside a single paragraph string. It does not exercise:
- duplicate matches in the paragraph (covered by the sibling scenario multiple when needle appears more than once),
- the needle being absent (covered by not_found when needle is absent),
- empty needle inputs (covered by not_found for empty needle),
- punctuation or whitespace normalization (covered by the quote_normalized, flexible_whitespace, and quote_optional sibling scenarios),
- run-boundary reconstruction, which is the responsibility of a separate primitive that runs before
findUniqueSubstringMatchsees the paragraph text, - fuzzy matching of any kind; this primitive matches and never approximates.
The assertions only test the result type, the mode, and the matched text for this fixture.
A non-obvious detail
A successful match is not merely "the result is non-empty." The expected status and mode together distinguish an exact, unique literal match from the other outcomes the same call could produce: a multiple match, a normalized-only match (one of quote_normalized, flexible_whitespace, quote_optional, or clean), or no match at all. The distinction matters because the status determines what downstream operations can safely do.
Consider an instruction to bold the word Closing in a paragraph. If the paragraph contains the word Closing four times, an operation that bolds the first match silently does the wrong thing. The multiple status surfaces this case, so that the caller can return an error, ask the user to disambiguate, or apply the operation to all matches. Returning the match count as a first-class signal, rather than letting callers assume a single match, comes from running this primitive against real contracts where the same defined term often appears many times.