中文XML论坛--如何阅读XML1.0规范中的语法规则（How to read the syntax rules in the XML 1.0 Recommendation）

如何阅读XML1.0规范中的语法规则（How to read the syntax rules in the XML 1.0 Recommendation）

http://hi.baidu.com/sunnybill/blog/item/a19a9bd4c5738002a08bb738.html

学习的目的：要想编写XML文档，就必须了解字符串以那种方式排列在XML文档中是被允许的，了解如何找到并解释规范中定义的语法规则。

XML文档是按照某种模式组织在一起的UCS字符序列，模式用来提供表达逻辑层次结构（树结构）数据的方法。

XML1.0规范建立了一个范例，用一些UCS字符序列表达数据，另外一些UCS字符序列表示标记。标记能够使数据及其逻辑层次结构一同表示。

规范文档的范例有些是用普通的说明来定义的，有些是用EBNF(Extended Backus-Naur Form )符号书写的形式语法来定义的。这些符号在规范的第6部分有简要说明。了解如何阅读EBNF文档非常有用，因为他们有着非常精准的语法含义（或者说他们对于表达标准语法有着精准的含义）。

用EBNF写的东西跟UCS列举的允许使用的字符序列基本一样。基本序列分配给了符号，符号构成更复杂的符号和其他字符序列组合的基础。这些序列层层叠加直到整个XML文档可以用下列由EBNF写的东西表示出来：

document ::= prolog element Misc*

上面式子的意思是，符号 document （代表结构良好的XML文档），由一个 prolog （前导）跟一个element（元素），后接0到多个 Miscs（杂项）组成。每个这样的符号都用其他符号和字符序列定义。

需要注意的是XML1.0规范通过Uincode标量（USV-Unicode scalar values）引用UCS字符，用#x符号加恰好够用的十六进制数字表示。像用EBNF写 #x9 ，意味着这个抽象字符可以用Unicode3.1的符号 "U+" 表示成 U+0009。十六进制的9没有必要是一个字节。

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
S ::= (#x20 | #x9 | #xD | #xA)+

第一行的意思是： Char 是一个字符，其范围如右式所列。注意，从U+0000 到U+0008 的字符以及几个其他的范围不是Chars ，不能在XML文档中使用。第二行是说， S是由一个或多个空格字符的实例组成的序列，这样的空格字符有4个。对 Comment 的定义如下：

Comment ::= ''

上式的意思是说 Comment 的构成为：“ ”等3个字符中间加0到多个字符，他们既可以是单个字符（但不能是“-”），也可以是“-”后跟一个不是“-”的其他字符。

Misc ::= Comment | PI | S

它的意思是 Misc 是 Comment, PI, S三者之一。 PI 的定义太长了，我们这里不把它列出（注： PI 的意思是处理标记Process Instruction）。

因为Comment 和 S 都已经定义，所以Misc 可以写成：

Misc ::= '' | PI | (#x20 | #x9 | #xD | #xA)+

文档的其他部分也使用同样的方法定义的，也就是说一个结构良好的XML文档是按一定模式组织的UCS字符序列。

===Original context from: http://skew.org/xml/tutorial/===============================

How to read the syntax rules in the XML 1.0 Recommendation

Why you need to know this: In order to author XML documents, one must understand what sequences of what characters are allowed in an XML document, and how to find and interpret the syntax rules that are defined in the spec.

An XML document is a UCS character sequence that follows certain patterns. These patterns provide a means of representing a logical hierarchy (a tree) of data.

The XML 1.0 Recommendation establishes conventions for using certain UCS character sequences to represent data and certain other UCS character sequences to represent markup. The markup allows the logical hierarchy to be expressed in the document along with the data itself.

The Recommendation defines these conventions partly with prose explanations and partly with a formal grammar written as a set of "productions" in Extended Backus-Naur Form (EBNF) notation. This notation is described briefly in section 6 of the spec. It is helpful to know how to read the EBNF productions because they are the definitive reference for proper syntax.

The EBNF productions do little more than enumerate allowable UCS character sequences. Basic sequences are assigned to symbols, which in turn are the foundation for more advanced combinations of symbols and other character sequences. These sequences build upon each other to the point where an entire XML document can be expressed with the following EBNF production:

document ::= prolog element Misc*

This production says that the symbol named document (which represents a well-formed XML document), consists simply of one prolog followed by one element followed by zero or more Miscs. Each of these symbols is defined in terms of other symbols and character sequences.

Note that the XML 1.0 Recommendation refers to UCS characters by their Unicode scalar values, using a notation of #x followed by only as many hex digits as needed. So #x9 in the EBNF productions means the abstract character that would be represented in Unicode 3.1's "U+" notation as U+0009. It does not necessarily mean a byte with hex value 9.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
S ::= (#x20 | #x9 | #xD | #xA)+

The first line means that Char is the one character that is in those ranges listed. Note that characters U+0000 through U+0008 and several other ranges are not considered Chars and are not allowed in XML documents. The second line shows that S is a sequence of one or more instances of any of the 4 "whitespace" characters. The definition of a Comment is given as:

Comment ::= ''

This means that Comment is the 4 characters , in between which are 0 or more instances of either a Char that is not -, or the character - followed by a Char that is not -.

Misc ::= Comment | PI | S

This means that Misc is one of Comment, PI, or S. The definition of PI is too lengthy to include here, so we'll just leave it as it is.

Since Comment and S have been defined, it would be just as accurate to say:

The other components of document are defined in the same way. It follows that a well-formed XML document is a UCS character sequence that follows certain patterns.


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	78.125ms