From c75c5ef077db72060ecb99f4f93f944c7f008dbd Mon Sep 17 00:00:00 2001 From: Rainer Kottenhoff Date: Thu, 26 Feb 2026 19:13:23 +0100 Subject: [PATCH] chore: Claude plan to replace retired Oniguruma by actively maintained PCRE2 --- plans/replace_oniguruma_by_pcre2.md | 824 ++++++++++++++++++++++++++++ src/Notepad3.c | 4 +- 2 files changed, 826 insertions(+), 2 deletions(-) create mode 100644 plans/replace_oniguruma_by_pcre2.md diff --git a/plans/replace_oniguruma_by_pcre2.md b/plans/replace_oniguruma_by_pcre2.md new file mode 100644 index 000000000..ccfbc86ac --- /dev/null +++ b/plans/replace_oniguruma_by_pcre2.md @@ -0,0 +1,824 @@ +# Plan: Replace Oniguruma by PCRE2 in Notepad3/Scintilla + +## Background & Motivation + +Oniguruma (https://github.com/kkos/oniguruma) was **archived on April 24, 2025** by its +maintainer K. Kosako. The repository is now read-only. No actively maintained community +fork exists. Security vulnerabilities will not be patched. + +**PCRE2** (https://github.com/PCRE2Project/pcre2) is the recommended replacement: +actively maintained, C library, UTF-8 native, feature-complete superset of Oniguruma +for Notepad3's use case. + +--- + +## Current Architecture + +### Scintilla Regex Engine Interface + +**File**: `scintilla/src/Document.h` (lines 107-119) + +```cpp +class RegexSearchBase { +public: + virtual ~RegexSearchBase() = default; + virtual Sci::Position FindText(Document *doc, Sci::Position minPos, Sci::Position maxPos, + const char *s, bool caseSensitive, bool word, bool wordStart, + Scintilla::FindOption flags, Sci::Position *length) = 0; + virtual const char *SubstituteByPosition(Document *doc, const char *text, + Sci::Position *length) = 0; +}; +extern RegexSearchBase *CreateRegexSearch(CharClassify *charClassTable); +``` + +### Factory Selection (compile-time) + +**File**: `scintilla/src/Document.cxx` (lines 3604-3610) + +```cpp +#ifndef SCI_OWNREGEX +RegexSearchBase *Scintilla::Internal::CreateRegexSearch(CharClassify *charClassTable) { + return new BuiltinRegex(charClassTable); // fallback: RESearch +} +#endif +``` + +When `SCI_OWNREGEX` is defined, the factory is provided by the custom engine file instead. + +### Current Oniguruma Implementation + +**File**: `scintilla/oniguruma/scintilla/OnigurumaRegExEngine.cxx` (904 lines) + +This file contains **two classes**: + +1. **`OnigurumaRegExEngine`** (line 162-244) — implements `RegexSearchBase` for Scintilla's + find/replace system. This is the main engine. +2. **`SimpleRegExEngine`** (line 753-805) — lightweight standalone engine used via the + exported `OnigRegExFind()` function for simple pattern matching outside Scintilla + (URL detection, lexer file matching). + +**Factory** (line 248-251): +```cpp +RegexSearchBase *Scintilla::Internal::CreateRegexSearch(CharClassify *charClassTable) { + return new OnigurumaRegExEngine(charClassTable); +} +``` + +**Exported C function** (line 888-901): +```cpp +extern "C" __declspec(dllexport) +ptrdiff_t WINAPI OnigRegExFind(const char *pchPattern, const char *pchText, + const bool caseSensitive, const int eolMode, int *matchLen_out); +``` + +### Callers of OnigRegExFind (outside Scintilla) + +**Declaration**: `src/SciCall.h` (line 107-108) +```cpp +ptrdiff_t WINAPI OnigRegExFind(const char *pchPattern, const char *pchText, + const bool caseSensitive, const int eolMode, int *matchLen_out); +``` + +**Call sites**: +- `src/Edit.c:2126` — URL validation via hyperlink regex pattern +- `src/Styles.c:2533` — Lexer file-path pattern matching + +Both call sites use forward-only search on small strings (not document buffers). + +### Build Configuration + +**Two vcxproj files** reference Oniguruma: + +#### `scintilla/Scintilla.vcxproj` (static library) +- Preprocessor: `NO_CXX11_REGEX;SCI_OWNREGEX;SCI_EMPTYCATALOGUE;ONIG_STATIC` +- Include dirs: `../oniguruma/src` +- Source files (lines 523-541): 19 Oniguruma `.c` files + `OnigurumaRegExEngine.cxx` +- Header files (lines 587-593): 7 Oniguruma `.h` files + +#### `scintilla/ScintillaDLL.vcxproj` (DLL) +- Preprocessor: `NO_CXX11_REGEX;SCI_OWNREGEX;ONIG_EXTERN=extern;SCINTILLA_DLL` +- Include dirs: `../oniguruma/src` +- Source files (lines 227-247): 20 Oniguruma `.c` files + `OnigurumaRegExEngine.cxx` +- Header files (lines 341-347): 7 Oniguruma `.h` files + +### Oniguruma Source Files (vendored) + +Located in `scintilla/oniguruma/src/`: +``` +ascii.c mktable.c onig_init.c regcomp.c regenc.c regerror.c +regexec.c regext.c reggnu.c regparse.c regposix.c regsyntax.c +regtrav.c regversion.c st.c unicode.c unicode_egcb_data.c +unicode_fold1_key.c unicode_fold2_key.c unicode_fold3_key.c +unicode_fold_data.c unicode_property_data.c unicode_property_data_posix.c +unicode_unfold_key.c unicode_wb_data.c utf8.c +``` + +Headers: `config.h oniggnu.h oniguruma.h regenc.h regint.h regparse.h st.h` + +--- + +## Notepad3 Regex Feature Usage (What Must Be Preserved) + +### Syntax features exposed to users via find/replace + +| Feature | Oniguruma syntax | PCRE2 equivalent | Action needed | +|---------|-----------------|------------------|---------------| +| Basic matching `.` `*` `+` `?` `{n,m}` | Same | Same | None | +| Character classes `[...]` | Same | Same | None | +| POSIX classes `[:alpha:]` | Same | Same | None | +| Alternation `\|` | Same | Same | None | +| Capturing groups `(...)` | Same | Same | None | +| Non-capturing groups `(?:...)` | Same | Same | None | +| Backreferences `\1`..`\99` | Same | Same | None | +| Named groups `(?...)` | Same | Same | None | +| Named groups `(?P...)` | Same | Same | None | +| Lookbehind `(?<=...)` `(?`** word boundaries | Oniguruma-specific | **Not supported** | **Translate** | +| **`\h` `\H`** horiz. space | Currently translated to char class | **Native in PCRE2** | **Simplify** | +| **`\uHHHH`** Unicode escape | Oniguruma-specific | `\x{HHHH}` in PCRE2 | **Translate** | + +### Replacement string features + +The replacement logic is **not** handled by Oniguruma itself — it's custom code in +`SubstituteByPosition()` (lines 458-522). It processes: +- `$1`..`$99` and `\1`..`\9` — numbered group references +- `${name}` and `$+{name}` — named group references (uses `onig_name_to_backref_number()`) +- `$$` and `\\` — literal escapes +- `\n \r \t \a \b \f \v` — escape sequences +- `\xHH` and `\uHHHH` — hex/Unicode escapes (converted to UTF-8 via `WideCharToMultiByte`) + +The `convertReplExpr()` method (lines 640-743) converts `\1`..`\9` to `$1`..`$9` format +and processes escape sequences into literal bytes before `SubstituteByPosition()` handles +group substitution. + +### Compile options mapping + +Current NP3 configuration in `SetSimpleOptions()` (lines 97-158): + +| Oniguruma option | PCRE2 equivalent | NP3 default | +|-----------------|------------------|-------------| +| `ONIG_OPTION_IGNORECASE` | `PCRE2_CASELESS` | OFF (case-sensitive) | +| `ONIG_OPTION_MULTILINE` (dot-matches-all) | `PCRE2_DOTALL` | OFF (enabled by `FindOption::DotMatchAll`) | +| `ONIG_OPTION_SINGLELINE` | (none needed) | OFF | +| `ONIG_OPTION_NOT_BEGIN_STRING` | `PCRE2_NOTBOL` | Dynamic (based on range position) | +| `ONIG_OPTION_NOT_END_STRING` | `PCRE2_NOTEOL` | Dynamic (based on range position) | + +Custom syntax modifications in constructor (lines 78-88): +- **Removed**: `ONIG_SYN_OP2_ESC_H_XDIGIT` (replaced `\h`/`\H` with custom char classes) +- **Added**: `ONIG_SYN_OP_ESC_LTGT_WORD_BEGIN_END` (enables `\<` `\>`) +- **Added**: `ONIG_SYN_OP2_ESC_U_HEX4` (enables `\uHHHH`) + +--- + +## Implementation Plan + +### Step 1: Obtain PCRE2 Sources + +Download PCRE2 source release from https://github.com/PCRE2Project/pcre2/releases +(latest stable, currently 10.44+). + +Create directory `scintilla/pcre2/` (parallel to existing `scintilla/oniguruma/`). + +Minimal files needed for static compilation: +``` +pcre2.h (public API — generated from pcre2.h.in) +pcre2_internal.h +pcre2_intmodedep.h +pcre2_ucp.h +pcre2_chartables.c (generated or use default) +pcre2_auto_possess.c +pcre2_chkdint.c +pcre2_compile.c +pcre2_config.c +pcre2_context.c +pcre2_dfa_match.c +pcre2_error.c +pcre2_extuni.c +pcre2_find_bracket.c +pcre2_maketables.c +pcre2_match.c +pcre2_match_data.c +pcre2_newline.c +pcre2_ord2utf.c +pcre2_pattern_info.c +pcre2_script_run.c +pcre2_string_utils.c +pcre2_study.c +pcre2_substitute.c +pcre2_substring.c +pcre2_tables.c +pcre2_ucd.c +pcre2_valid_utf.c +pcre2_xclass.c +``` + +Generate `config.h` and `pcre2.h` for MSVC (or use the CMake config step, or create +manually — PCRE2 provides `pcre2.h.generic` as a starting point). + +Key defines for `config.h`: +```c +#define PCRE2_CODE_UNIT_WIDTH 8 // UTF-8 mode +#define HAVE_MEMMOVE 1 +#define PCRE2_STATIC 1 // static linking +#define SUPPORT_UNICODE 1 // Unicode properties, scripts, \p{}, \X, etc. +#define SUPPORT_UCP 1 // Unicode character properties +#define SUPPORT_UTF 1 // UTF-8/16/32 support +#define BSR_ANYCRLF 0 // \R matches any Unicode newline by default +#define NEWLINE_DEFAULT 0 // 0 = any newline +#define LINK_SIZE 2 // internal link size +#define PARENS_NEST_LIMIT 250 // parentheses nesting limit +#define MATCH_LIMIT 10000000 // default match limit +#define MATCH_LIMIT_DEPTH 10000000 // default depth limit (prevents stack overflow) +#define MAX_NAME_SIZE 128 // max named group name length +#define MAX_NAME_COUNT 10000 // max number of named groups +``` + +### Step 2: Create PCRE2RegExEngine.cxx + +New file: `scintilla/pcre2/scintilla/PCRE2RegExEngine.cxx` + +Structure mirrors `OnigurumaRegExEngine.cxx` exactly: + +```cpp +// encoding: UTF-8 +/** + * @file PCRE2RegExEngine.cxx + * @brief integrate PCRE2 regex engine for Scintilla library + * Replaces Oniguruma (archived April 2025) + * + * uses PCRE2 - Perl Compatible Regular Expressions (v10.x) + * https://github.com/PCRE2Project/pcre2 + */ + +#ifdef SCI_OWNREGEX + +#include +#include +#include +#include +#include + +#define VC_EXTRALEAN 1 +#define NOMINMAX 1 +#include + +#include "Geometry.h" +#include "Platform.h" +#include "Scintilla.h" +#include "ScintillaTypes.h" +#include "ILexer.h" +#include "SplitVector.h" +#include "Partitioning.h" +#include "CellBuffer.h" +#include "CaseFolder.h" +#include "RunStyles.h" +#include "Decoration.h" +#include "CharClassify.h" +#include "CharacterCategoryMap.h" +#include "Document.h" + +#define PCRE2_CODE_UNIT_WIDTH 8 +#define PCRE2_STATIC +#include "pcre2.h" + +// ... implementation ... + +#endif //SCI_OWNREGEX +``` + +#### Class: PCRE2RegExEngine + +```cpp +class PCRE2RegExEngine : public RegexSearchBase { +public: + explicit PCRE2RegExEngine(CharClassify* charClassTable); + ~PCRE2RegExEngine() override; + + Sci::Position FindText(Document* doc, Sci::Position minPos, Sci::Position maxPos, + const char* pattern, bool caseSensitive, bool word, bool wordStart, + Scintilla::FindOption searchFlags, Sci::Position *length) override; + + const char* SubstituteByPosition(Document* doc, const char* text, + Sci::Position* length) override; + +private: + void clear(); + std::string translateRegExpr(const std::string& regExprStr, bool wholeWord, + bool wordStart, EndOfLine eolMode); + std::string convertReplExpr(const std::string& replStr); + + std::string m_RegExprStrg; + uint32_t m_CompileOptions; + pcre2_code* m_CompiledPattern; + pcre2_match_data* m_MatchData; + pcre2_match_context* m_MatchContext; + EOLmode m_EOLmode; + Sci::Position m_RangeBeg; + Sci::Position m_RangeEnd; + char m_ErrorInfo[256]; + Sci::Position m_MatchPos; + Sci::Position m_MatchLen; + +public: + std::string m_SubstBuffer; +}; +``` + +#### Member mapping (Oniguruma -> PCRE2) + +| Oniguruma member | PCRE2 member | Notes | +|-----------------|--------------|-------| +| `OnigRegex m_RegExpr` | `pcre2_code* m_CompiledPattern` | Compiled pattern handle | +| `OnigRegion m_Region` | `pcre2_match_data* m_MatchData` | Match result storage | +| `OnigSyntaxType m_OnigSyntax` | (not needed) | PCRE2 has fixed syntax | +| `OnigOptionType m_CmplOptions` | `uint32_t m_CompileOptions` | Compile-time flags | +| — | `pcre2_match_context* m_MatchContext` | **New**: needed for offset limit | + +#### Constructor + +```cpp +PCRE2RegExEngine::PCRE2RegExEngine(CharClassify* /*charClassTable*/) + : m_CompileOptions(PCRE2_UTF | PCRE2_UCP) // UTF-8 + Unicode properties always on + , m_CompiledPattern(nullptr) + , m_MatchData(nullptr) + , m_MatchContext(nullptr) + , m_EOLmode(EOLmode::UDEF) + , m_RangeBeg(-1) + , m_RangeEnd(-1) + , m_ErrorInfo() + , m_MatchPos(-1) + , m_MatchLen(0) +{ + m_MatchContext = pcre2_match_context_create(nullptr); + // Set match limit to prevent catastrophic backtracking + pcre2_set_match_limit(m_MatchContext, 10000000); + pcre2_set_depth_limit(m_MatchContext, 10000000); +} +``` + +#### Destructor + +```cpp +PCRE2RegExEngine::~PCRE2RegExEngine() { + clear(); + if (m_MatchContext) pcre2_match_context_free(m_MatchContext); +} +``` + +#### clear() + +```cpp +void PCRE2RegExEngine::clear() { + m_RegExprStrg.clear(); + if (m_MatchData) { pcre2_match_data_free(m_MatchData); m_MatchData = nullptr; } + if (m_CompiledPattern) { pcre2_code_free(m_CompiledPattern); m_CompiledPattern = nullptr; } + m_RangeBeg = -1; + m_RangeEnd = -1; + m_MatchPos = -1; + m_MatchLen = 0; +} +``` + +#### FindText() — Key implementation + +This is the most complex method. Key differences from Oniguruma: + +**1. Compile options assembly:** +```cpp +uint32_t options = PCRE2_UTF | PCRE2_UCP; // always UTF-8 + Unicode properties +if (!caseSensitive) options |= PCRE2_CASELESS; +if (FlagSet(searchFlags, FindOption::DotMatchAll)) options |= PCRE2_DOTALL; +``` + +**2. Pattern compilation:** +```cpp +int errorcode; +PCRE2_SIZE erroroffset; +m_CompiledPattern = pcre2_compile( + (PCRE2_SPTR)sRegExprStrg.c_str(), + sRegExprStrg.length(), + options, + &errorcode, + &erroroffset, + nullptr // default compile context +); +if (!m_CompiledPattern) { + pcre2_get_error_message(errorcode, (PCRE2_UCHAR*)m_ErrorInfo, sizeof(m_ErrorInfo)); + return SciPos(-2); +} +m_MatchData = pcre2_match_data_create_from_pattern(m_CompiledPattern, nullptr); +// Optional: JIT compile for performance +pcre2_jit_compile(m_CompiledPattern, PCRE2_JIT_COMPLETE); +``` + +**3. Search execution (FORWARD):** + +```cpp +// Get document buffer pointer +auto const docBegPtr = (PCRE2_SPTR)doc->BufferPointer(); +auto const docLen = (PCRE2_SIZE)doc->Length(); + +// Set offset limit to constrain search end +pcre2_set_offset_limit(m_MatchContext, (PCRE2_SIZE)rangeEnd); + +// Match options +uint32_t matchOptions = 0; +if (rangeBeg != 0) matchOptions |= PCRE2_NOTBOL; +if (rangeEnd != docLen) matchOptions |= PCRE2_NOTEOL; + +int rc = pcre2_match( + m_CompiledPattern, + docBegPtr, + docLen, + (PCRE2_SIZE)rangeBeg, // start offset + matchOptions, + m_MatchData, + m_MatchContext +); + +if (rc > 0) { + PCRE2_SIZE* ovector = pcre2_get_ovector_pointer(m_MatchData); + m_MatchPos = SciPos(ovector[0]); + m_MatchLen = SciPos(ovector[1]) - m_MatchPos; +} +``` + +**4. Search execution (BACKWARD) — THE HARD PART:** + +PCRE2 has no native backward search. Oniguruma does this by swapping range pointers. +Implementation strategy: + +```cpp +// Backward search: iterate forward, keep last match within range +Sci::Position lastMatchPos = -1; +Sci::Position lastMatchLen = 0; +PCRE2_SIZE searchStart = (PCRE2_SIZE)rangeBeg; + +while (true) { + int rc = pcre2_match( + m_CompiledPattern, docBegPtr, docLen, + searchStart, matchOptions, + m_MatchData, m_MatchContext + ); + if (rc <= 0) break; + + PCRE2_SIZE* ovector = pcre2_get_ovector_pointer(m_MatchData); + Sci::Position pos = SciPos(ovector[0]); + Sci::Position len = SciPos(ovector[1]) - pos; + + if (pos > rangeEnd) break; // past search range + if (pos <= minPos) { // within backward search range (minPos > maxPos) + lastMatchPos = pos; + lastMatchLen = len; + } + + // Advance past this match (at least 1 byte, handle UTF-8) + searchStart = ovector[1]; + if (searchStart == ovector[0]) { + // zero-length match — advance by one UTF-8 character + searchStart = doc->MovePositionOutsideChar(searchStart + 1, 1, true); + } +} +m_MatchPos = lastMatchPos; +m_MatchLen = lastMatchLen; +``` + +**Performance note**: For backward search in large files, this iterates all forward matches. +Optimization: restrict the search range using `pcre2_set_offset_limit()` to only search +within `[maxPos, minPos]` (the relevant backward range), then find the last match in that window. + +**Alternative backward search (line-by-line):** +Search backward line by line from `minPos`, stopping at `maxPos`. This avoids scanning +the entire document but is more complex to implement. Scintilla's built-in `BuiltinRegex` +uses this approach. + +#### SubstituteByPosition() — Nearly identical + +The replacement logic is custom and doesn't use Oniguruma APIs except for one call: +`onig_name_to_backref_number()` (line 499) to resolve named group references. + +**PCRE2 replacement**: Use `pcre2_substring_number_from_name()`: + +```cpp +// Where Oniguruma had: +int grpNum = onig_name_to_backref_number(m_RegExpr, name_beg, name_end, &m_Region); + +// PCRE2 equivalent: +int grpNum = pcre2_substring_number_from_name(m_CompiledPattern, (PCRE2_SPTR)name_str); +``` + +**Important**: The name must be null-terminated for PCRE2 (Oniguruma took begin/end pointers). +Extract the name into a temporary buffer first. + +For numbered group access, replace `m_Region.beg[n]`/`m_Region.end[n]` with: +```cpp +PCRE2_SIZE* ovector = pcre2_get_ovector_pointer(m_MatchData); +auto rBeg = SciPos(ovector[2 * grpNum]); // was: m_Region.beg[grpNum] +auto rEnd = SciPos(ovector[2 * grpNum + 1]); // was: m_Region.end[grpNum] +auto len = rEnd - rBeg; +// Check for unset group: ovector[2*n] == PCRE2_UNSET +``` + +Number of capture groups: `pcre2_get_ovector_count(m_MatchData)` replaces `m_Region.num_regs`. + +The rest of `SubstituteByPosition()` and `convertReplExpr()` can be copied verbatim — +they are pure string processing with no Oniguruma dependencies. + +#### translateRegExpr() — Pattern translation + +```cpp +std::string PCRE2RegExEngine::translateRegExpr(const std::string& regExprStr, + bool wholeWord, bool wordStart, EndOfLine eolMode) +{ + std::string transRegExpr; + + // Word boundary wrapping (same as Oniguruma version) + if (wholeWord || wordStart) { + transRegExpr.push_back('\\'); + transRegExpr.push_back('b'); + transRegExpr.append(regExprStr); + if (wholeWord) { + transRegExpr.push_back('\\'); + transRegExpr.push_back('b'); + } + replaceAll(transRegExpr, ".", R"(\w)"); + } else { + transRegExpr.append(regExprStr); + } + + // PCRE2 does NOT support \< and \> — translate to lookaround equivalents + replaceAll(transRegExpr, R"(\<)", R"((?)", R"((?<=\w)(?!\w))"); // word end + // NOTE: must also handle escaped versions \\< and \\> (literal < >) — do NOT translate those + + // \h and \H are NATIVE in PCRE2 — no translation needed! + // (Oniguruma version had: replaceAll \h -> [^\S\n\v\f\r\u2028\u2029]) + // REMOVE those translations. + + // \uHHHH -> \x{HHHH} for PCRE2 + // This requires a smarter regex-based replacement (not simple string replace) + // to transform \uHHHH patterns to \x{HHHH} + // Implementation: scan for \u followed by exactly 4 hex digits + translateUnicodeEscapes(transRegExpr); // helper function + + return transRegExpr; +} +``` + +**Helper for `\uHHHH` -> `\x{HHHH}` translation:** +```cpp +static void translateUnicodeEscapes(std::string& s) { + std::string result; + result.reserve(s.length()); + for (size_t i = 0; i < s.length(); ++i) { + if (s[i] == '\\' && (i + 5) <= s.length() && s[i+1] == 'u' && + isxdigit(s[i+2]) && isxdigit(s[i+3]) && isxdigit(s[i+4]) && isxdigit(s[i+5])) { + result += "\\x{"; + result += s[i+2]; result += s[i+3]; result += s[i+4]; result += s[i+5]; + result += '}'; + i += 5; + } else { + result += s[i]; + } + } + s = result; +} +``` + +**Edge case warning for `\<` / `\>` translation**: The naive `replaceAll` approach will +also incorrectly translate escaped `\\<` (literal `<`). A proper implementation must +skip escaped backslashes. Consider scanning character by character instead of using +`replaceAll`. + +#### SimpleRegExEngine / OnigRegExFind export + +Replace with a PCRE2-based equivalent. This is simpler — forward-only, no document +buffer concerns: + +```cpp +class SimplePCRE2Engine { +public: + explicit SimplePCRE2Engine(EOLmode eolMode); + ~SimplePCRE2Engine(); + ptrdiff_t Find(const char* pattern, const char* text, + bool caseSensitive, int* matchLen_out = nullptr); +private: + EOLmode m_EOLmode; + pcre2_code* m_Code; + pcre2_match_data* m_MatchData; +}; + +// Exported function — keep same signature for binary compatibility +extern "C" __declspec(dllexport) +ptrdiff_t WINAPI OnigRegExFind(const char *pchPattern, const char *pchText, + const bool caseSensitive, const int eolMode, int *matchLen_out) { + SimplePCRE2Engine engine(static_cast(eolMode)); + return engine.Find(pchPattern, pchText, caseSensitive, matchLen_out); +} +``` + +**IMPORTANT**: Keep the exported function name as `OnigRegExFind` (despite using PCRE2) +to avoid changing `src/SciCall.h`, `src/Edit.c`, and `src/Styles.c`. The function +signature is the same. Consider renaming to `NP3RegExFind` in a follow-up if desired. + +### Step 3: Update Build Configuration + +#### scintilla/Scintilla.vcxproj + +**Preprocessor** — All configurations, change: +``` +ONIG_STATIC --> PCRE2_CODE_UNIT_WIDTH=8;PCRE2_STATIC;HAVE_CONFIG_H +``` +Keep `SCI_OWNREGEX` and `NO_CXX11_REGEX` unchanged. + +**Include directories** — All configurations, change: +``` +../oniguruma/src --> ../pcre2/src +``` + +**Source files** — Replace Oniguruma `.c` files with PCRE2 `.c` files: +```xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +``` + +**Header files** — Replace Oniguruma headers: +```xml + + + + + + + +``` + +#### scintilla/ScintillaDLL.vcxproj + +Same changes as above, but paths use `../pcre2/` prefix instead of `../oniguruma/`. + +**Preprocessor** change: `ONIG_EXTERN=extern` --> `PCRE2_CODE_UNIT_WIDTH=8;PCRE2_STATIC;HAVE_CONFIG_H` + +### Step 4: Optional — JIT Support + +PCRE2 includes an optional JIT compiler (`pcre2_jit_compile`) that can significantly +speed up repeated matches. Enable it by: + +1. Including `pcre2_jit_compile.c` in the build (already listed above) +2. Including the sljit sources that PCRE2 bundles (`sljit/` subdirectory) +3. Calling `pcre2_jit_compile(m_CompiledPattern, PCRE2_JIT_COMPLETE)` after compilation +4. Using `pcre2_jit_match()` instead of `pcre2_match()` for JIT-compiled patterns + +If JIT adds too much complexity or build size, it can be omitted — the interpreter +is already faster than Oniguruma for most patterns. + +### Step 5: Testing Checklist + +1. **Forward search**: Basic patterns, regex patterns, case-sensitive/insensitive +2. **Backward search**: Find Previous with regex enabled +3. **Replace**: `$1`, `${name}`, escape sequences in replacement +4. **Replace All**: Full document replacement +5. **Word boundary**: `\` patterns still work (via translation) +6. **Unicode**: `\p{L}`, `\p{N}`, Unicode text in documents +7. **Lookbehind**: `(?<=prefix)match` patterns +8. **Possessive**: `a++b` type patterns +9. **\h \H**: Horizontal space matching (now native, should be same behavior) +10. **\uHHHH**: Unicode escapes in patterns (translated to `\x{HHHH}`) +11. **\K**: Match reset patterns +12. **\R**: Newline sequence matching +13. **OnigRegExFind**: URL detection in Edit.c still works +14. **OnigRegExFind**: Lexer file pattern matching in Styles.c still works +15. **Large files**: Performance regression test with multi-MB files +16. **All platforms**: Build x86, x64, x64_AVX2, ARM64 + +### Step 6: Cleanup + +After confirming everything works: +1. Remove `scintilla/oniguruma/` directory entirely +2. Update `CLAUDE.md` vendored dependencies table +3. Update `.github/copilot-instructions.md` if it references Oniguruma +4. Consider renaming `OnigRegExFind` export to `NP3RegExFind` (requires updating + `src/SciCall.h:107`, `src/Edit.c:2126`, `src/Styles.c:2533`) + +--- + +## API Quick Reference + +### PCRE2 Core Functions (8-bit) + +```c +// Compilation +pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, + uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, + pcre2_compile_context *ccontext); +void pcre2_code_free(pcre2_code *code); + +// JIT (optional) +int pcre2_jit_compile(pcre2_code *code, uint32_t options); + +// Matching +pcre2_match_data *pcre2_match_data_create_from_pattern(const pcre2_code *code, + pcre2_general_context *gcontext); +void pcre2_match_data_free(pcre2_match_data *match_data); + +int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, PCRE2_SIZE length, + PCRE2_SIZE startoffset, uint32_t options, + pcre2_match_data *match_data, pcre2_match_context *mcontext); + +// Results +PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); +uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); + +// Named groups +int pcre2_substring_number_from_name(const pcre2_code *code, PCRE2_SPTR name); + +// Context (for offset limit, match limits) +pcre2_match_context *pcre2_match_context_create(pcre2_general_context *gcontext); +void pcre2_match_context_free(pcre2_match_context *mcontext); +int pcre2_set_offset_limit(pcre2_match_context *mcontext, PCRE2_SIZE value); +int pcre2_set_match_limit(pcre2_match_context *mcontext, uint32_t value); +int pcre2_set_depth_limit(pcre2_match_context *mcontext, uint32_t value); + +// Error messages +int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, PCRE2_SIZE bufflen); +``` + +### Key PCRE2 Compile Options + +```c +PCRE2_UTF // Treat pattern and subject as UTF-8 +PCRE2_UCP // Use Unicode properties for \d, \w, \s, \b +PCRE2_CASELESS // Case-insensitive matching +PCRE2_DOTALL // . matches any character including newline +PCRE2_MULTILINE // ^ and $ match at line boundaries (NOT same as Onig MULTILINE!) +PCRE2_BSR_UNICODE // \R matches any Unicode newline sequence +``` + +**IMPORTANT naming confusion**: Oniguruma `ONIG_OPTION_MULTILINE` = "dot matches newline" += PCRE2 `PCRE2_DOTALL`. PCRE2 `PCRE2_MULTILINE` = "^ $ match line boundaries" which +is a different concept. + +--- + +## Risk Assessment + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Backward search performance | Medium | Start with iterate-forward approach; optimize later if needed | +| `\<` `\>` translation edge cases | Low | Thorough testing; character-by-character scanner | +| `\uHHHH` translation edge cases | Low | Simple pattern; well-defined 4-hex-digit format | +| PCRE2 compile/config complexity | Low | Use `.h.generic` files; minimal config | +| Binary size change | Low | PCRE2 is similar size to Oniguruma | +| User-visible regex syntax changes | Low | All features preserved; `\h`/`\H` now native (better) | + +## Files to Create/Modify Summary + +| Action | File | +|--------|------| +| **CREATE** | `scintilla/pcre2/src/` — PCRE2 source + headers | +| **CREATE** | `scintilla/pcre2/scintilla/PCRE2RegExEngine.cxx` | +| **MODIFY** | `scintilla/Scintilla.vcxproj` — swap source/header/define/include | +| **MODIFY** | `scintilla/ScintillaDLL.vcxproj` — swap source/header/define/include | +| **DELETE** | `scintilla/oniguruma/` — entire directory (after validation) | +| **Optional MODIFY** | `src/SciCall.h:107` — rename `OnigRegExFind` to `NP3RegExFind` | +| **Optional MODIFY** | `src/Edit.c:2126` — update function call name | +| **Optional MODIFY** | `src/Styles.c:2533` — update function call name | diff --git a/src/Notepad3.c b/src/Notepad3.c index b0d1e3426..49886c50b 100644 --- a/src/Notepad3.c +++ b/src/Notepad3.c @@ -10,8 +10,8 @@ * * * (c) Rizonesoft 2008-2026 * * https://rizonesoft.com * - * * - * * +* * +* * *******************************************************************************/ #include "Helpers.h"