Skip to content

chacana.grammar

PEG grammar definition and input preprocessing.

chacana.grammar

Arpeggio PEG grammar for Chacana tensor expressions.

normalize_input(expr: str) -> str

Apply Unicode NFC normalization and reject out-of-scope characters.

This function MUST be called before parsing to prevent visual spoofing and ensure canonical representation of characters.

Raises:

Type Description
ChacanaParseError

if the expression contains characters from disallowed Unicode blocks (e.g., Cyrillic, Arabic, CJK).

Source code in src/chacana/grammar.py
def normalize_input(expr: str) -> str:
    """Apply Unicode NFC normalization and reject out-of-scope characters.

    This function MUST be called before parsing to prevent visual spoofing
    and ensure canonical representation of characters.

    Raises:
        ChacanaParseError: if the expression contains characters from
            disallowed Unicode blocks (e.g., Cyrillic, Arabic, CJK).
    """
    # Step 1: NFC normalization
    normalized = unicodedata.normalize("NFC", expr)

    # Step 2: Reject disallowed Unicode blocks (fast blocklist check)
    match = _DISALLOWED_UNICODE_RE.search(normalized)
    if match:
        char = match.group()
        codepoint = ord(char)
        try:
            name = unicodedata.name(char)
        except ValueError:
            name = "UNKNOWN"
        if 0x0400 <= codepoint <= 0x052F:
            block = "Cyrillic"
        elif 0x0600 <= codepoint <= 0x06FF:
            block = "Arabic"
        elif 0x3000 <= codepoint <= 0x9FFF:
            block = "CJK"
        else:
            block = "disallowed"
        raise ChacanaParseError(
            f"Unicode character U+{codepoint:04X} ({name}) from {block} "
            f"block is not allowed. Only Latin and Greek characters are "
            f"permitted in Chacana expressions."
        )

    # Step 3: Allowlist check for any remaining exotic characters
    if not _ALLOWED_LETTER_RE.match(normalized):
        # Find the offending character
        for i, ch in enumerate(normalized):
            if not re.match(
                r"[a-zA-Z0-9\u00C0-\u024F\u0300-\u036F"
                r"\u0386\u0388-\u038A\u038C\u038E-\u038F"
                r"\u0391-\u03A1\u03A3-\u03A9\u03AC-\u03CE"
                r"\u1E00-\u1EFF\s"
                r"\+\-\*/\^\{\}\[\]\(\)_;,\.@]",
                ch,
            ):
                codepoint = ord(ch)
                try:
                    name = unicodedata.name(ch)
                except ValueError:
                    name = "UNKNOWN"
                raise ChacanaParseError(
                    f"Unicode character U+{codepoint:04X} ({name}) at "
                    f"position {i} is not allowed. Only Latin and Greek "
                    f"characters are permitted in Chacana expressions."
                )

    return normalized

parse_and_validate(expr: str) -> Any

Parse an expression string and apply post-parse validations.

This function uses a cached parser, parses the (already normalized) input, and runs post-parse checks such as nested symmetrization rejection.

Parameters:

Name Type Description Default
expr str

The expression string (should already be NFC-normalized).

required

Returns:

Type Description
Any

The Arpeggio parse tree.

Raises:

Type Description
ChacanaParseError

on parse failure or validation failure.

NoMatch

on grammar mismatch (re-raised as-is for callers that catch it directly).

Source code in src/chacana/grammar.py
def parse_and_validate(expr: str) -> Any:
    """Parse an expression string and apply post-parse validations.

    This function uses a cached parser, parses the (already normalized) input,
    and runs post-parse checks such as nested symmetrization rejection.

    Args:
        expr: The expression string (should already be NFC-normalized).

    Returns:
        The Arpeggio parse tree.

    Raises:
        ChacanaParseError: on parse failure or validation failure.
        arpeggio.NoMatch: on grammar mismatch (re-raised as-is for
            callers that catch it directly).
    """
    parser = _get_parser()
    tree = parser.parse(expr)
    _reject_nested_symmetrization(tree)
    return tree

create_parser(**kwargs: Any) -> ParserPython

Source code in src/chacana/grammar.py
def create_parser(**kwargs: Any) -> ParserPython:
    return ParserPython(expression, skipws=True, **kwargs)