pUtf8 provides functions for working with UTF-8 strings.

Dependencies: pAssert, pInterp, pName

See also: pUtf8Conv, for converting between UTF-8 and UTF-16.

local pUtf8 = require("p_utf8")

print(pUtf8.check("good string"))
--> 11

print(pUtf8.check("bad" .. string.char(0xff) .. "string"))
--> nil; unknown UTF-8 byte length marker; 4

API

pUtf8.check

Checks a UTF-8 string for encoding problems.

local ok, err, byte = pUtf8.check(s, [i], [j])
  • s: The string to check.

  • [i]: (empty string: 0; non-empty string: 1) The first byte index.

  • [j]: (#str) The last byte index. Cannot be lower than i.

Returns: If no problems were found, the total number of code points scanned. Otherwise, nil, error string, and byte index.

Notes

As a special case, this function will return 0 when given an empty string and values of zero for i and j. (In other words, pUtf8.check("") will always return 0.)

For non-empty strings, if the range arguments are specified, then i needs to point to a UTF-8 Start Byte, and j needs to point to the last byte of a UTF-8-encoded character.


pUtf8.codes

A loop iterator for code points in a UTF-8 string, where i is the byte position, c is the code point number, and u is the code point’s UTF-8 substring.

This function raises an error if it encounters a problem with the UTF-8 encoding.

for i, c, u in utf8.codes(s) do --[[...]] end
  • s: The string to iterate.

Returns: The byte position i, the code point number c, and the code point’s UTF-8 string representation u.


pUtf8.codeFromString

Gets a Unicode Code Point and its isolated UTF-8 Sequence from a string.

local code, u8_seq = pUtf8.codeFromString(s, [i])
  • s: The UTF-8 string to read. Cannot be empty.

  • [i]: (1) The byte position to read from. Must point to a valid UTF-8 Start Byte.

Returns: The code point number and its equivalent UTF-8 Sequence as a string, or nil plus an error string if unsuccessful.


pUtf8.concatCodes

Creates a UTF-8 string from one or more code point numbers.

This function raises an error if it encounters a problem with the code point numbers.

local str = pUtf8.concatCodes(...)
  • …​: Code point numbers.

Returns: A concatenated UTF-8 string.

Notes

This function allocates a temporary table. To convert single code points, pUtf8.stringFromCode may be used instead.


pUtf8.getCheckSurrogates

Gets the library’s setting for checking surrogate values.

local enabled = pUtf8.getCheckSurrogates()

Returns: true if surrogates are rejected as invalid, false if they are ignored.


pUtf8.scrub

Replaces bad UTF-8 Sequences in a string.

local str = pUtf8.scrub(s, repl)
  • s: The string to scrub.

  • repl: A replacement string to use in place of the bad UTF-8 Sequences. Use an empty string to remove the invalid bytes.

Returns: The scrubbed UTF-8 string.


pUtf8.setCheckSurrogates

Default: true

Sets the library to check or ignore surrogate values.

pUtf8.setCheckSurrogates(enabled)
  • enabled: true to reject surrogates as invalid, false/nil to ignore them.


pUtf8.step

Looks for a Start Byte from a byte position through to the end of the string.

This function does not validate the encoding.

local index = pUtf8.step(s, i)
  • s: The string to search.

  • i: Starting position; bytes after this index are checked. Can be from 0 to #str.

Returns: Index of the next Start Byte, or nil if the end of the string is reached.

Notes

With empty strings, the only accepted position for i is 0.


pUtf8.stepBack

Looks for a Start Byte from a byte position, traveling backwards, to the start of the string.

This function does not validate the encoding.

local index = pUtf8.stepBack(s, i)
  • s: The string to search.

  • i: Starting position; bytes before this index are checked. Can be from 1 to #str + 1.

Returns: Index of the previous Start Byte, or nil if the start of the string is reached.

Notes

With empty strings, the only accepted position for i is 1.


pUtf8.stringFromCode

Converts a code point in numeric form to a UTF-8 Sequence (string).

local u8_seq, err = pUtf8.stringFromCode(c)
  • c: The code point number.

Returns: the UTF-8 Sequence (string), or nil plus an error string if unsuccessful.

Module Notes

Terminology

Code Point: a Unicode Code Point, stored as a Lua number. E.g. 65 (for A)

UTF-8 Sequence: A single Unicode Code Point, encoded in UTF-8 and stored as a Lua string. E.g. "A"

Start Byte: The first byte in a UTF-8 Sequence. The length of the sequence is encoded in the start byte.

Continuation Byte: The second, third or fourth byte in a UTF-8 Sequence. A UTF-8 Sequence may not be longer than 4 bytes.

Surrogate: Values in the range of U+D800 to U+DFFF are reserved for surrogate pairs in UTF-16, and are not valid code points.