Terminology

  • The term URI is used in place of namespace name because there are already too many things in the XML spec with the word name in their names.

  • PEReference is used in place of parameter entity reference.

Handling of PEReferences

(This section originates from a pile of miscellaneous comments for the function sym.PEReference() in lxl_in.lua.)

As a non-validating processor, we are not obligated to process entity declarations within the replacement text of a PEReference. (See: Well-formedness constraint: Entity Declared.) In practice, this means that LXL never expands PEReferences.

Unless standalone is yes, we are not required to process declarations after the first PEReference that is ignored in the internal subset (see: §4.4.8 Included as PE).

In the DTD internal subset, PEReferences cannot appear inside of markup declarations. (No nesting, basically.) This rule doesn’t apply in the case of external subsets, which LXL doesn’t touch because it’s non-validating. (See: Well-formedness constraint: PEs in Internal Subset

There is a rule to expand PEReferences when they appear in the text of an EntityValue (see: §4.4.5 Included in Literal). In the internal subset, this rule never comes into play because it is overridden by Well-formedness constraint: PEs in Internal Subset.

As mentioned in Tim Bray’s annotated version of the original XML spec, the text of Well-formedness constraint: In DTD is a little misleading. It says that PEReferences must not appear outside of the DTD. What it means is that substrings like %foobar; just won’t be recognized as PEReferences elsewhere.

Unexpanded Entity References

Here are some more notes about Unexp nodes.

Normally, an unexpanded entity reference halts processing. These nodes can appear in the output when the following conditions are met:

  • The standalone property is no or was not set.

  • A PEReference is encountered in the DTD internal subset and not read. (This processor skips all PEReferences; see §4.4.8 Included as PE.)

  • Later, a general entity reference is encountered in the document content, and the XML processor did not collect a declaration that defines it.

You will not encounter unexpanded references in a document that lacks a DTD internal subset, or if standalone is yes, or if the internal subset contains no PEReferences. You will also not see them within attribute values, where failing to dereference a general entity is always an error.

This object addresses the requirement in §4.4.3 Included If Validating to inform the application that an entity was recognized but not dereferenced.

You can halt on all unexpanded entities with XmlParser:setRejectUnexpandedEntities(true).

Serializing DOCTYPE

LXL does not fully capture the state of the DOCTYPE tag, and it does not serialize out the Doctype node (which just contains the DOCTYPE name, and any comments or PIs found within the DTD internal subset). That said, the following methods are provided for collecting the DOCTYPE substring from a document and writing it out.

Collecting DOCTYPE with XmlParser

Call XmlParser:setCopyDoctype(true). If the incoming document contains a DOCTYPE tag, then its substring will be assigned to xml_object.doctype_str.

Writing DOCTYPE with XmlObject

Call XmlParser:setWriteDoctype(true). If xml_object.doctype_str is a string, then its contents will be inserted before the root element when serializing out. Note that the well-formedness of doctype_str is not verified at all.

Invalid Namespace State

When an XmlParser is configured to handle XML Namespaces, while converting a string to an XmlObject, it will halt when it encounters a namespace-related problem.

The namespace state is invalid when:

  • Entity names, processing instruction targets, or notation declaration names contain :.

  • An element or attribute name contains more than one :, or the prefix or local parts of a QName are empty (:bar, foo:).

  • Any prefix used in an element or attribute name is not declared in the current scope.

  • A namespace declaration attempts to bind the namespace http://www.w3.org/XML/1998/namespace to any prefix other than xml.

  • A namespace declaration attempts to bind the namespace http://www.w3.org/2000/xmlns/.

  • (XML Namespaces 1.0) A namespace declaration undeclares a prefixed namespace with an empty string.

  • An element contains duplicate namespaced attributes which resolve to the same URI + local name pair.

The situation is different when manipulating an XmlObject node tree. The namespace state in elements is not cached (attempts to do so resulted in brittle code), and the tree’s namespace state can be rendered invalid by just renaming an element. You can call XmlObject:checkNamespaceState() after changing the tree to perform the same checks as the parser.