This document records all known errors in the Second Edition of the Extensible Markup Language (XML) 1.0 Specification; for updates see the latest version.
The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Changes to the text of the spec are indicated thus: deleted text, new text, modified text.
Please email error reports to xml-editor@w3.org.
Last paragraph: add a new 3rd sentence:
"Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregular code unit sequences, as defined in Unicode 3.1."
with a reference to Unicode 3.1.
Change the [Unicode3] entry (leaving the anchor name unchanged) to read:
The Unicode Consortium. The Unicode Standard, Version 3.1, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27).
Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity's system identifier.", the following paragraph and the following numbered list, so that they read:
[
Definition: The SystemLiteral
is called the entity's system identifier. It is
meant to be converted to a URI
reference (as defined in [IETF RFC 2396],
updated by [IETF RFC 2732]),
as part of the process of dereferencing
it to obtain input for the XML processor to construct
the entity's replacement text.] It is an error for a fragment
identifier (beginning with a # character) to be
part of a system identifier. Unless otherwise provided by
information outside the scope of this specification (e.g. a
special XML element type defined by a particular DTD, or a
processing instruction defined by a particular application
specification), relative URIs are relative to the location of
the resource within which the entity declaration occurs. A URI
might thus be relative to the document entity, to the entity containing the
external
DTD subset, or to some other external parameter entity.
System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:
Each disallowed character to be escaped is represented in UTF-8 [IETF RFC 2279] as one or more bytes.
The resulting bytes are escaped with
the URI escaping mechanism (that is, converted to %HH,
where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
Amend the second sentence of the next-to-last paragraph to read:
An XML processor attempting to retrieve the entity's content may use any combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference.
Change the last sentence of the third paragraph to read:
The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
Amend the last sentence of the next-to-last paragraph to read:
Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
Add a new production [28b] and modify production [28] to refer to it:
| [28] | doctypedecl |
::= | '<!DOCTYPE' S Name
(S ExternalID)? S?
('[' intSubset ']'
S?)? '>' |
[VC: Root Element Type] |
| [WFC: External Subset] | ||||
| [28a] | DeclSep |
::= | PEReference | S |
[WFC: PE Between Declarations] |
| [28b] | intSubset |
::= | (markupdecl | DeclSep)* |
|
| [29] | markupdecl |
::= | elementdecl | AttlistDecl
| EntityDecl | NotationDecl
| PI | Comment |
[VC: Proper Declaration/PE Nesting] |
| [WFC: PEs in Internal Subset] |
Change productions [6] Names and [8] Nmtokens to use #x20 (a single space character) instead of S:
[6] Names ::= Name (#x20 Name)*
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
Add a note after production 8:
Note: The Names and Nmtokens productions are used to define the validity of tokenized attribute values after normalization (see 3.3.1 Attribute Types).
This restores first edition erratum E62, which was rescinded by E108. It seems likely that when E108 was adopted the productions were incorrectly thought to apply to unnormalized attribute values, which would have prevented the use of non-#x20 whitespace (tabs and newlines) as separators in tokenized attribute values. In fact, it only prohibits the use of character references to these characters.
This change restores SGML compatibility (cf. the "name list" and "name token list" productions in SGML).
Modify the third sentence of the first paragraph, so that it reads:
The actual replacement text that is included (or included in literal) as described above must contain the replacement text of any parameter entities referred to, and must contain the character referred to, in place of any character references in the literal entity value; however, general-entity references must be left as-is, unexpanded.
To the sentence:
Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs.
(inside the paragraph following the Notation declared VC), append the following:
This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration.
This clarifies exactly where a declaration occurs, for purposes of determining the base for relative URIs. Given the example:
example.xml: <!DOCTYPE foo [ <!ENTITY % pe SYSTEM "subdir1/pe"> %pe; %intpe; ]> <foo>&ent;</foo> subdir1/pe: <!ENTITY % extpe SYSTEM "../subdir2/extpe"> <!ENTITY % intpe "%extpe;"> subdir2/extpe <!ENTITY ent SYSTEM 'entfile'>
Though the characters making up the declaration of ent appear in
subdir2/extpe, they are not parsed as a declaration there. They are
just treated as characters making up the replacement text of intpe.
They are not parsed as a declaration until intpe is parsed, at which
point the containing external entity is the document entity, so the
relevant base URI is that of example.xml.
The fact that it is the containing external entity that is used may be summed up by saying that internal entities do not carry any base URI with them; indeed, they consist only of their replacement text.
If example.xml contained %extpe; instead of %intpe; the situation
would be different: the contents of subdir2/extpe would be parsed as
a declaration, and the relevant base URI would be that of subdir2
From the definition for "A | B", delete "but not both":
Move the entries for [IETF RFC 2396] and [IETF RFC 2732] from A.2 (informative) to A.1 (normative).
Rewrite the Element valid VC as follows:
Validity constraint: Element Valid
An element is valid if there is a declaration matching elementdecl where the Name matches the element type, and one of the following holds:
The declaration matches EMPTY and the element has no content (not even entity references, comments, PIs or white space).
The declaration matches children and the sequence of child elements (after replacing any entity references with their replacement text) belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S), comments and PIs (i.e. markup matching production [27] Misc) between the start-tag and the first child element, between child elements, or between the last child element and the end-tag. Note that a CDATA section containing only white space or a reference to an entity whose replacement text is character references expanding to white space do not match the nonterminal S, and hence cannot appear in these positions; however, a reference to an internal entity with a literal value consisting of character references expanding to white space does match S, since its replacement text is the white space resulting from expansion of the character references.
The declaration matches Mixed and the content (after replacing any entity references with their replacement text) consists of character data, comments, PIs and child elements whose types match names in the content model.
The declaration matches ANY, and the types of any child elements (after replacing any entity references with their replacement text) have been declared.
In the paragraph just after production [43] content, amend the definition of empty element so that the word "content" within the definition is a link to production [43].
Amend the last paragraph so that it reads:
A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.
"General" is added because:
This clarifies that the following from the OASIS test suite:
xmltest/invalid/001.xml: <!DOCTYPE doc SYSTEM "001.ent"> <doc></doc> with 001.ent: <!ELEMENT doc EMPTY> <!ENTITY % e "<!--"> %e; -->
is well-formed but violates a validity constraint.
In the first paragraph after the example, replace "overriden" with "overridden" (two d's) in the sentence "This declared intent is considered to apply to all elements within the content of the element where it
is specified, unless overridden with another instance of the xml:space
attribute."
Change the [IETF RFC 2376] reference to [IETF RFC 3023] (keeping the same #RFC2376 fragment identifier in order not to break existing links).
Change the IETF RFC 2376 entry to:
Amend the next to last paragraph so that it reads:
This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it.
[The only change is that "RFC 1766" becomes "RFC 3066".]
Change all [IETF RFC 1766] references to [IETF RFC 3066] (keeping the same #RFC1766 fragment identifier in order not to break existing links).
Remove the last sentence of the Note: "It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639]."
Change the IETF RFC 1766 entry to:
Just after the paragraph beginning "All attributes for which no declaration has been read..." (just before the examples), append the following paragraph:
It is an error if an attribute refers to an entity when there is a declaration for that entity which the processor has not read. This can happen only when a non-validating processor is being used.
Change the title and the text of Attribute Default Legal Validity Constraint to:
Validity Constraint: Attribute Default Value Syntactically Correct
The declared default value must meet the syntactic constraints of the declared attribute type.
Note that only the syntactic constraints of the type are required here; other constraints (e.g. that the value be the name of a declared unparsed entity, for an attribute of type ENTITY) may come into play if the declared default value is actually used (an element without a specification for this attribute occurs).
Change the first sentence of the second paragraph of the Entity Declared WFC (not the VC of the same name) to read:
Note that non-validating processors are not obligated to read and process entity declarations occurring in parameter entities or in the external subset.
Remove the word "internal" from the title of the section.
Change the first paragraph, in particular removing the word "internal", so that it reads:
In discussing the treatment of internal entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity, the literal entity value is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding whitespace) if there is one but without any replacement of character references or parameter-entity references.]
Modify the second example in the table at the end of the section to read as follows (add a   in the middle):
|
A #x20 B |
#x20 #x20 A #x20 #x20 #x20 B #x20 #x20 |
Replace the last sentence of the paragraph beginning with "URI references require encoding and escaping of certain characters." with the following:
The XML processor must escape disallowed characters as follows:
After the sentence reading "A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.", which follows the definition of SystemLiteral, add the following:
Attempts to retrieve the resource identified by a URI may be redirected at the parser
level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTP
Location: header). In the absence of additional information outside the scope of this specification
within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words,
it is the URI of the resource retrieved after all redirection has occurred.
Add a validity constraint applying to productions [58] NotationType and [59] Enumeration
as follows:
Validity constraint: No duplicate tokens
The notation names in a single NotationType attribute declaration, as well as the NmTokens in a single Enumeration attribute declaration, must all be distinct.