PILE Base: pUtf8

pUtf8 provides functions for working with UTF-8 strings.

See also: pUtf8Conv, for converting between UTF-8 and UTF-16.

local pUtf8 = require("p_utf8")

print(pUtf8.check("good string"))
--> 11

print(pUtf8.check("bad" .. string.char(0xff) .. "string"))
--> nil; unknown UTF-8 byte length marker; 4

Table of Contents

API
Module Notes
- Terminology
- Links

API

pUtf8.check

Checks a UTF-8 string for encoding problems.

local ok, err, byte = pUtf8.check(s, [i], [j])

s: The string to check.
[i]: (empty string: 0; non-empty string: 1) The first byte index.
[j]: (#str) The last byte index. Cannot be lower than i.

Returns: If no problems were found, the total number of code points scanned. Otherwise, nil, error string, and byte index.

Notes

As a special case, this function will return 0 when given an empty string and values of zero for i and j. (In other words, pUtf8.check("") will always return 0.)

For non-empty strings, if the range arguments are specified, then i needs to point to a UTF-8 Start Byte, and j needs to point to the last byte of a UTF-8-encoded character.

pUtf8.codes

A loop iterator for code points in a UTF-8 string, where i is the byte position, c is the code point number, and u is the code point’s UTF-8 substring.

This function raises an error if it encounters a problem with the UTF-8 encoding.

for i, c, u in utf8.codes(s) do --[[...]] end

s: The string to iterate.

Returns: The byte position i, the code point number c, and the code point’s UTF-8 string representation u.

pUtf8.codeFromString

Gets a Unicode Code Point and its isolated UTF-8 Sequence from a string.

local code, u8_seq = pUtf8.codeFromString(s, [i])

s: The UTF-8 string to read. Cannot be empty.
[i]: (1) The byte position to read from. Must point to a valid UTF-8 Start Byte.

Returns: The code point number and its equivalent UTF-8 Sequence as a string, or nil plus an error string if unsuccessful.

pUtf8.concatCodes

Creates a UTF-8 string from one or more code point numbers.

This function raises an error if it encounters a problem with the code point numbers.

local str = pUtf8.concatCodes(...)

…: Code point numbers.

Returns: A concatenated UTF-8 string.

Notes

This function allocates a temporary table. To convert single code points, pUtf8.stringFromCode may be used instead.

pUtf8.getCheckSurrogates

Gets the library’s setting for checking surrogate values.

local enabled = pUtf8.getCheckSurrogates()

Returns: true if surrogates are rejected as invalid, false if they are ignored.

pUtf8.scrub

Replaces bad UTF-8 Sequences in a string.

local str = pUtf8.scrub(s, repl)

s: The string to scrub.
repl: A replacement string to use in place of the bad UTF-8 Sequences. Use an empty string to remove the invalid bytes.

Returns: The scrubbed UTF-8 string.

pUtf8.setCheckSurrogates

Default: true

Sets the library to check or ignore surrogate values.

pUtf8.setCheckSurrogates(enabled)

enabled: true to reject surrogates as invalid, false/nil to ignore them.

pUtf8.step

Looks for a Start Byte from a byte position through to the end of the string.

This function does not validate the encoding.

local index = pUtf8.step(s, i)

s: The string to search.
i: Starting position; bytes after this index are checked. Can be from 0 to #str.

Returns: Index of the next Start Byte, or nil if the end of the string is reached.

Notes

With empty strings, the only accepted position for i is 0.

pUtf8.stepBack

Looks for a Start Byte from a byte position, traveling backwards, to the start of the string.

This function does not validate the encoding.

local index = pUtf8.stepBack(s, i)

s: The string to search.
i: Starting position; bytes before this index are checked. Can be from 1 to #str + 1.

Returns: Index of the previous Start Byte, or nil if the start of the string is reached.

Notes

With empty strings, the only accepted position for i is 1.

pUtf8.stringFromCode

Converts a code point in numeric form to a UTF-8 Sequence (string).

local u8_seq, err = pUtf8.stringFromCode(c)

c: The code point number.

Returns: the UTF-8 Sequence (string), or nil plus an error string if unsuccessful.

Module Notes

Terminology

Code Point: a Unicode Code Point, stored as a Lua number. E.g. 65 (for A)

UTF-8 Sequence: A single Unicode Code Point, encoded in UTF-8 and stored as a Lua string. E.g. "A"

Start Byte: The first byte in a UTF-8 Sequence. The length of the sequence is encoded in the start byte.

Continuation Byte: The second, third or fourth byte in a UTF-8 Sequence. A UTF-8 Sequence may not be longer than 4 bytes.

Surrogate: Values in the range of U+D800 to U+DFFF are reserved for surrogate pairs in UTF-16, and are not valid code points.

Links

VERSION: 2.106