第12章 XML

2016年4月19日 skiron Comments 0 Comment

（1）XML是一种描述层次结构数据的通用方法。一个XML文件包含了一个或多个element，它被限定在一对tag中。

<foo> ①
</foo> ②

这是foo element的开始。
这是foo element的结尾。每个一个开始tag必须有一个结束tag。

element可以嵌套任意深度：
<foo>
<bar></bar>
</foo>

在XML文档中第一个element被称作root。一个XML文档仅能有一个root element。下面的例子不是一个XML文档，因为它有两个根：
<foo></foo>
<bar></bar>

element能有属性，一个“名字–值”对。属性放在开始tag中，每个属性用空格分隔。在一个element中属性名不能重复，而属性的值放在引号里（可以用双引号，也可以用单引号）。

<foo lang='en'> ①
<bar id=xml-'papayawhip' lang="fr"></bar> ②
</foo>

foo element有一个属性，名为：lang；值为：en。
bar element有两个属性，名为：id和lang。lang的值为fr.这和foo中的lang没有冲突。每个element都有自己的一组属性。多个属性之间没有先后顺序，也没有限制属性的数量。

element可以有文本上下文：
<foo lang='en'>
<bar lang='fr'>PapayaWhip</bar>
</foo>

element不包含文本也没有嵌套，那么它是空的：
<foo></foo>

在element的开始tag中放一个斜线可以没有结尾tag，上面的例子可以写成：
<foo />

像python可以定义不同的模块一样，XML也能定义不同的名字空间。名字空间看起来像是一个URL。使用xmlns定义默认名字空间，名字空间像是一个属性，但是有不同的目地。

<feed xmlns='http://www.w3.org/2005/Atom'> ①
<title>dive into mark</title> ②
</feed>

feed element是在http://www.w3.org/2005/Atom名字空间中。
title element同样是在http://www.w3.org/2005/Atom名字空间中。名字空间声明影响element及其子元素在哪里声名。

可以使用xmlns:prefix定义名字空间和前缀。然后在名字空间中的每个element声名时必需显示带上前缀。

<atom:feed xmlns:atom='http://www.w3.org/2005/Atom'> ①
<atom:title>dive into mark</atom:title> ②
</atom:feed>

feed element在http://www.w3.org/2005/Atom名字空间中。
title element也在http://www.w3.org/2005/Atom名字空间中。

XML分析器担心之前有两个XML文档有相同的标示（名字空间 + element名 = XML标示）。前缀的存在只是为了指向名字空间，而实际上前缀（atom:）是无关紧要的。名字空间、element名、属性名和每个element的文本上下文都相同，那么两个XML文档是相同的。

最后，XML文档可以在第一行包含字符编码信息（在root element之前）
<?xml version=’1.0′ encoding=’utf-8′?>

（2）atom feed的结构

让我们想一下博客或任何经常更新的内容。站点自己有标题、子标题、最后更新日期、不同时间的文章列表。每个文章同样有标题、第一次发布时间（可能还有最后更新日期）和唯一的URL。

Atom syndication格式被设计出来，用标准格式捕获所有的这些信息。虽然不同的站点的设计、区域和视频，但是它们都有一些相同的基本信息。如：标题和作者

在root element中，每个atom feed 在名字空间中共享：

<feed xmlns='http://www.w3.org/2005/Atom' ①
xml:lang='en'> ②

http://www.w3.org/2005/Atom是atom名字空间。
任何element能包含xml:lang属性，它声明了element及其子元素的语言，在我们的例子中xml:lang属性在root element声明了一次，这意味着全部的feed都是英语。

atom feed关于它自己包含了几部分信息。这里声明了几个root级feed元素的子元素。

<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into mark</title> ①
<subtitle>currently between addictions</subtitle> ②
<id>tag:diveintomark.org,2001-07-29:/</id> ③
<updated>2009-03-27T21:56:07Z</updated> ④
<link rel='alternate' type='text/html' href='http://diveintomark.org/'/> ⑤

这个feed的标题是：dive into mark
这个feed的子标题是：currently between addictions.
每个feed需要一个全局唯一标示符。（创建方法详见：http://www.ietf.org/rfc/rfc4151.txt）
这个feed最后更新日期是：2009年3月27日21点56分07秒。
现在是最趣的部分。link元素没有文本上下文，但是它有三个属性：rel，type和href。rel值告诉我们这是什么类型的值，rel=’alternate’意味着这是一个链接，链接到当前feed描述的另一个地址（ means that this is a link
to an alternate representation of this feed.
）。type=’text/html’属性意味着，这个链接是一个html页面。链接目标的址址在href属性中。
整体下来的意思就是：这是一个站点名为“dive into mark”的feed，它的可用地址是http://diveintomark.org/ 并且最后的更新时间为2009年3月27日。

注意：虽然元素的顺序和XML文档有关，但atom feed和顺序无关。

在feed级meta数据后是最近文章列表。文章列表看起来是这样的：
<entry>
<author> ①
   <name>Mark</name>
   <uri>http://diveintomark.org/</uri>
</author>
<title>Dive into history, 2009 edition</title> ②
<link rel='alternate' type='text/html' ③
   href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id> ④
<updated>2009-03-27T21:56:07Z</updated> ⑤
<published>2009-03-27T17:20:42Z</published>
<category scheme='http://diveintomark.org' term='diveintopython'/> ⑥
<category scheme='http://diveintomark.org' term='docbook'/>
<category scheme='http://diveintomark.org' term='html'/>
<summary type='html'>Putting an entire chapter on one page sounds ⑦
    bloated, but consider this — my longest chapter so far
    would be 75 printed pages, and it loads in under 5 seconds…
    On dialup.</summary>
</entry> ⑧

author元素告知谁是文章的作者。
title元素告知文章标题。
做为feed级链接，这个link元素给出了HTML的地址，文章的版本。
entry和feed一样，需要唯一标示符。
entry有两个日期：一个是发布日期别一个最后修改日期。
entry有任意数量的分类。这篇文章在diveintopython, docbook和html分类下。
summary元素给出了一个文章概要（如果要在feed中显示完整的文章内容，这里也可以是content元素）。summary元素有一个type=’html’的属性，它表示这是一个HTML片段，而不是明文字符。这很重要，因为可以在其中插入HTML特定元素（&mdash和…这两个会被翻译成“——”和“……”，而不是直接显示它们）
最后，是结束entry元素的tag，代表这篇文章的meta数据的结尾。

（2）分析XML

python可以通过几种方法分析XML文档。有传统的DOM和SAX分析器，不过现在要介绍的是另一种库ElementTree.

>>> import xml.etree.ElementTree as etree ①
>>> tree = etree.parse('examples/feed.xml') ②
>>> root = tree.getroot() ③
>>> root ④
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>

ElementTree是python标准库的一部分，在xml.etree.ElementTree里。
该库的重点是parse()函数，它可以根据文件名或类似文件的对象进行分析。它只对文档进行一次分析，如果内存很少的话，它可以逐步递增读取XML文档。
parse()函数返回一个对象，它描述了所有的文档。这并不是root element。得到root element是调用getroot()方法。
就像预计的那样，root element是http://www.w3.org/2005/Atom名字空间中的feed元素。这个对象的字符串描述加强了一个重要观点：xml元素结合了它的名字空间和它的tag名字（有时也叫本地名）。这个文档的每一个元素都在Atom名字空间里，所以root element被描述成 {http://www.w3.org/2005/Atom}feed。

注：ElementTree描述XML元素的格式是{namespace}localname。这种格式将会在ElementTree API中经常看到。

（3）元素是列表

在ElementTree APT中元素行为就像是列表一样。列表中的每一项都是元素的子元素。

# 接上面的例子
>>> root.tag ①
'{http://www.w3.org/2005/Atom}feed'
>>> len(root) ②
8
>>> for child in root: ③
... print(child) ④
...
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>

接刚才的列子，root元素是{http://www.w3.org/2005/Atom}feed。
root元素的长度代表它的子元素的数量。
元素本身也是迭代器，可以迭代它的每一个子元素。
就像你看到的那样输出了8个元素，前五个是feed级meta数据，之后跟了三个entry元素。

这里有一点要注意feed级别的列表只包含直接子元素，不能包含间接子元素，如entry的子元素是无法包含在feed级列表中的，entry的子元素只能在entry自己的列表中。以后会介绍二种方法，这两种方法可以列出所有深度的元素。

（4）属性是字典

XML并不仅仅是汇集元素；每个元素也能有它自己的属性集。一但指向了某个特定的元素，可以很容易获得它的属性并做为python的字典（dictionary）。

# continuing from the previous example
>>> root.attrib ①
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
>>> root[4] ②
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib ③
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> root[3] ④
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib ⑤
{}

attrib是元素属性的字典
第5个子元素是link元素
link元素有三个属性：href, type, 和rel。
第4个子元素是updated元素。
因为updated元素没有属性，所以它的attrib是空字典。

（5）在XML文档中搜索节点

>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root.findall('{http://www.w3.org/2005/Atom}entry') ①
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> root.tag
'{http://www.w3.org/2005/Atom}feed'
>>> root.findall('{http://www.w3.org/2005/Atom}feed') ②
[]
>>> root.findall('{http://www.w3.org/2005/Atom}author') ③
[]

findall()方法找到所有匹配的子元素。
所有的元素都在root元素之下，每个元素都有自己的findall()方法。匹配在其下方所有子元素。没有返回值是因为在root下面没有名为feed的元素。
没有返回值的原因是author不是root的直接子元素。

还有一个find()方法，该方法返回第一个匹配的元素。

>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry') ①
>>> len(entries)
3
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title') ②
>>> title_element.text
'Dive into history, 2009 edition'
>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo') ③
>>> foo_element
>>> type(foo_element)
<class 'NoneType'>

找到所有atom:entry元素。
find()方法返回第一个匹配的元素。
没有找到任何元素所以反回None。
在布尔上下文中判断find()是否返回了值，请使用：
if element.find('...') is not None

（6）搜索任意深度

>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link') ①
>>> all_links
[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
>>> all_links[0].attrib ②
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[1].attrib ③
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[2].attrib
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[3].attrib
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
'type': 'text/html',
'rel': 'alternate'}

这个查询和之前的查询没有什么不同，只是多了两个斜线。这两个斜线的意思是“搜索任意深度”。所以这里列出了4条link元素，而不是一个。
第一个结果是root元素的直接子元素。可以从它的属性中看出这是一个feed级链接到一个HTML页面。
其他三个是entry级链接。

总之，ElementTree的findall()方法有很强大的功能，但是查询语言可能会让你感到惊讶。它的管方描述为“limited support for XPath expressions.”XPath是查询XML文档的W3C标准。ElementTree的查询语言和XPath的基本查询很相似，但是如果你已经了解了XPath的话，不同的地方会让你感觉很烦人。现在让我们看一下ElementTree api的第三方XML库扩展，支持完整的XPath。

（7）LXML

lxml是一个建立在libxml2分析器之上的开源第三方库。它提供了100%兼容ElementTree API，并支持扩展了完整的XPath 1.0标准和少量其它功能。该库需要单独安装。

>>> from lxml import etree ①
>>> tree = etree.parse('examples/feed.xml') ②
>>> root = tree.getroot() ③
>>> root.findall('{http://www.w3.org/2005/Atom}entry') ④
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
1. Once imported, lxml provides the same API as the built-in ElementTree library.
2. parse() function: same as ElementTree.
3. getroot() method: also the same.
4. findall() method: exactly the same.

lxml不仅仅比ElementTree快，它的findall()方法支持更复杂的表达式。

>>> import lxml.etree ①
>>> tree = lxml.etree.parse('examples/feed.xml')
>>> tree.findall('//{http://www.w3.org/2005/Atom}*[@href]') ②
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
>>> tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']") ③
[<Element {http://www.w3.org/2005/Atom}link at eeb930>]
>>> NS = '{http://www.w3.org/2005/Atom}'
>>> tree.findall('//{NS}author[{NS}uri]'.format(NS=NS)) ④
[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
<Element {http://www.w3.org/2005/Atom}author at eebba0>]

本例使用了lxml库
这个查询找所有Atom名字空间的元素，它有一个href属性。//表示无论什么深度都去找。{http://www.w3.org/2005/Atom}意味着元素只在Atom名字空间中。*意味着：任意本地名。[@href]意味着有一个href属性。
查找所有Atom名字空间的元素，href的值是http://diveintomark.org/。
After doing some quick string formatting (because otherwise these compound queries get ridiculously long),
this query searches for Atom author elements that have an Atom uri element as a child. This only returns
two author elements, the ones in the first and second entry. The author in the last entry contains only a
name, not a uri.

算了，不看这章了，XML现在我还用不到，以后能用到的时候再看吧！应该看12.7节了，记录一下。

Hello World

New Begining

第12章 XML

2016年4月19日 skiron Comments 0 Comment

（2）分析XML

（3）元素是列表

（4）属性是字典

（5）在XML文档中搜索节点

（6）搜索任意深度

（7）LXML

发表回复取消回复

（2）分析XML

（3）元素是列表

（4）属性是字典

（5）在XML文档中搜索节点

（6）搜索任意深度

（7）LXML

发表回复 取消回复

发表回复取消回复