第13章序列化python对象

2016年4月26日 skiron Comments 0 Comment

（1）pickle模块

pickle模块可以存储哪些类型？

python支持的所有本地数据类型：booleans, intergers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.
所有本地复合类型：list, tuples, dictionaries, sets containing
各种本地基本类型和复合类型的组合（深度直到python所支持的最大值sys.getrecursionlimit()）
函数、类和类的实例（会有警告）

（2）pickle模块保存数据

>>> entry = {}
>>> entry['title'] = 'Dive into history, 2009 edition'
>>> entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
>>> entry['comments_link'] = None
>>> entry['internal_id'] = b'\xDE\xD5\xB4\xF8'
>>> entry['tags'] = ('diveintopython', 'docbook', 'html')
>>> entry['published'] = True
>>> import time
>>> entry['published_date'] = time.strptime('Fri Mar 27 22:20:42 2009')
>>> entry['published_date']
time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1)

>>> import pickle ①
>>> with open('entry.pickle', 'wb') as f: ②
... pickle.dump(entry, f) ③

导入pickle模块
以二进制方式打开一个文件，流对象为f
pickle模块中dump()函数负责序列化数据结构，将数据结构序列化成二进制python指定的格式，这个格式使用最新的pickle协议版本，然后保存到打开的文件。

最后一句很重要：

pickle模块把python数据结构保存到一个文件。
使用pickle协议序列化数据结构
pickle协议是python指定的，不保证其它的语言可以读取。
并不是所有的python数据结构都可以序列化。虽然pickle协议已经更改过几次，每次都是因为有新的数据类型加入python，但这仍然是限制。
序列化结果并不保证每个python版本都能读取。新版本可以支持旧版本，但旧版本可能无法解释新版本（因为不支持新的数据类型）
除非指定pickle协议版本号，不然会默认使用最新的。
确定以二进制方式打开文件，不然写入序列化时会出错。

（3）从pickle文件中取回数据

>>> import pickle
>>> with open('entry.pickle', 'rb') as f: ①
... entry = pickle.load(f) ②
...
>>> entry
{'comments_link': None,
'internal_id': b'\xDE\xD5\xB4\xF8',
'title': 'Dive into history, 2009 edition',
'tags': ('diveintopython', 'docbook', 'html'),
'article_link':
'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}

使用with语句以二进制方式打开pickle文件，流对象是f。
pickle.load()有一个流对象参数，从这个流中读取被序列化的数据，创建一个新的python对象，在这个新对象中重建被序列化的数据，最后返回新对象。

（4）pickling without a file

序列化的结果可以不保存到文件中，而保存到内存中。

>>> b = pickle.dumps(entry) ①
>>> type(b) ②
<class 'bytes'>
>>> entry3 = pickle.loads(b) ③
>>> entry3 == entry ④
True

pickle.dumps()函数（注意在函数名结尾的s）执行和pickle.dump()一样的序列化功能。但是不同的时它把序列化结果返回，而不是写入硬盘。
因为pickle协议使用二进制数据格式，所以pickle.dumps()函数返回bytes对象。
pickle.loads函数（同样注意函数名结尾的s）执行和pickle.load()一样的恢复功能，但是它的参数不是流对象，而是一个保存了序列化数据的bytes对象。
当然和必然的结果——它们是一样的。

（5）往事必须再次回首——BYTES和STRINGS

pickle协议已经开发出来很多年了，就像python本身一样成熟。这里有四种不同的pickle协议版本。

python 1.x有两个pickle协议，一个是基于文本格式的初版本和一个二进制格式的版本1.
python 2.3引入了一个新的pickle协议版本2，可以处理类对象。
python 3.0介绍了另一个pickle协议版本3，增加了对bytes对象和byte数组的显示支持。

好吧！看一下bytes和strings之前的不同（原文指出如果不知道的话，那说明没仔细看之前的内容。。。好吧！翻译了半天我也忘了之前怎么写的啦！第四章第六节strings VS. BYTES）。

（6）可以让其它语言打开python的序列化对象——JSON

JSON是基于文本的，并且大小写敏感。因为是基于文本的，所以有空格的问题，在每个值之间可以有多个空格，在编解码时JSON会忽略这些值之间的空格。这可以让你以适当的缩进方式在显示器上显示数据或打印出来。JSON必须以UNICODE编码存储（UTF-32,UTF-16或UTF-8）。

可以使用JavaScript的eval()函数去“解码”JSON序列化数据。

>>> basic_entry = {} ①
>>> basic_entry['id'] = 256
>>> basic_entry['title'] = 'Dive into history, 2009 edition'
>>> basic_entry['tags'] = ('diveintopython', 'docbook', 'html')
>>> basic_entry['published'] = True
>>> basic_entry['comments_link'] = None
>>> import json
>>> with open('basic.json', mode='w', encoding='utf-8') as f: ②
... json.dump(basic_entry, f) ③

建立一个新的数据结构（构建一个字典）
因为JSON是基于文本的，以写模式打开一个文件，编码方式是UTF-8。（尽量使用UTF-8，这不会有什么错误）
就像pickle模块一样，JSON模块也定义了dump()函数，参数是一个python数据类型和一个流对象。

JSON序列化后，存储在文件中的数据结构是这样的：

{“published”: true, “tags”: [“diveintopython”, “docbook”, “html”], “comments_link”: null,
“id”: 256, “title”: “Dive into history, 2009 edition”}

为更好的去阅读可以在dump()函数中加入indent参数：

>>> with open('basic-pretty.json', mode='w', encoding='utf-8') as f:
... json.dump(basic_entry, f, indent=2)

indent参数会让序列化后的数据更具可读性，indent为0时表示“每个值单独一行”。当indent大于0时表示“每个值单独一行，并且数据结构的缩进使用这个数值”

{
"published": true,
"tags": [
    "diveintopython",
    "docbook",
    "html"
],
"comments_link": null,
"id": 256,
"title": "Dive into history, 2009 edition"
}

JSON和PYTHON3数据类型的对应关系：

注意	JSON	Python3
	object	directory
	array	list
	string	string
	integer	integer
	实际数字	float
*	true	True
*	false	False
*	null	None
*表示所有JSON值是大小写敏感的

你注意到少了什么吗？Tuples和bytes！JSON有一个array类型，它在json模块中映设到python的list，但是没有单独的类型映射到tuples。并且JSON对字符串的支持十分好，所以也没有单独支持bytes对象和byte数组。

（7）序列化JSON不支持的数据类型

序列化JSON不支持的数据类型，需要提供编解码习惯。

>>> entry ①
{'comments_link': None,
'internal_id': b'\xDE\xD5\xB4\xF8',
'title': 'Dive into history, 2009 edition',
'tags': ('diveintopython', 'docbook', 'html'),
'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}
>>> import json
>>> with open('entry.json', 'w', encoding='utf-8') as f: ②
... json.dump(entry, f) ③
...
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File "C:\Python31\lib\json\__init__.py", line 178, in dump
for chunk in iterable:
File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict
for chunk in chunks:
File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode
o = _default(o)
File "C:\Python31\lib\json\encoder.py", line 170, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable

重新看一下这个数据结构，里面有一个bytes类型。
写入编码为UTF8
出错了。。。信息：TypeError: b'\xDE\xD5\xB4\xF8' is not JSON serializable

出错的原因是json.dump()并不支持bytes对象的序列化。如果一定要存储bytes对象，就要自己定义序列化格式：

def to_json(python_object): ①
    if isinstance(python_object, bytes): ②
        return {'__class__': 'bytes', '__value__': list(python_object)} ③
    raise TypeError(repr(python_object) + ' is not JSON serializable') ④

定义序列化格式，参数就是JSON不支持的类型。
类型检查，这虽然不是非常严格的要求，但这很重要。想一下，如果你只有一个类型要序列化到没什么，一但类型多了的话还是要区分一下。
在本例中，我已经选择了转换bytes对象到dictionary。__class__关键字会保存原始的数据类型（以字符串型式标记“bytes”），并且__value__关键字保存“实际的”值。这里所说的“实际”指的并不是bytes，而是一个可以被JSON序列化的类型，本例中是list，因为bytes是一串整数，每个整数的范围是0~255。所以 b'\xDE\xD5\xB4\xF8'会通过list()转换成一个列表 [222, 213, 180, 248]。（十六进制\xDE是十进制的222）
这一行很重要。如果这个数据类型即不是JSON内建支持的，也不是自定义的，那么必须引发一个TypeError异常，这样json.dump()函数才会知道你自定义序列化函数没有注册这个类型。

这就好了，不你需要再做任何其它事情。你不需要做全部的序列化工作，你只需要转换数据类型到JSON支持的数据类型就可以。剩下的事情json.dump()函数会去做。

>>> import customserializer ①
>>> with open('entry.json', 'w', encoding='utf-8') as f: ②
... json.dump(entry, f,default=customserializer.to_json) ③
...
Traceback (most recent call last):
File "<stdin>", line 9, in <module>
json.dump(entry, f, default=customserializer.to_json)
File "C:\Python31\lib\json\__init__.py", line 178, in dump
for chunk in iterable:
File "C:\Python31\lib\json\encoder.py", line 408, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "C:\Python31\lib\json\encoder.py", line 382, in _iterencode_dict
for chunk in chunks:
File "C:\Python31\lib\json\encoder.py", line 416, in _iterencode
o = _default(o)
File "/Users/pilgrim/diveintopython3/examples/customserializer.py", line 12, in to_json
raise TypeError(repr(python_object) + ' is not JSON serializable') ④
TypeError: time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1) is not JSON serializable

customserializer模块就是刚刚定义的to_json()所在的模块。
文本写入模式，UTF-8编码
这里很重要，把自定义的转换函数挂载（hook）到json.dump()上，用自己的函数替换默认参数。
还是输出了一个错误，因为time.struct_time对象不能序列化。

再次进行修改：

import time
def to_json(python_object):
    if isinstance(python_object, time.struct_time): ①
        return {'__class__': 'time.asctime', '__value__': time.asctime(python_object)} ②
    if isinstance(python_object, bytes):
        return {'__class__': 'bytes', '__value__': list(python_object)}
    raise TypeError(repr(python_object) + ' is not JSON serializable')

添加到现有的customserializer.to_json()函数中，并检查python_object的类型是否为time.struct_time。
如果是的话，我们将会做一些类似bytes对象的转换：转换time.struct_time对象到个可以被json序列化的类型。本例中time.asctime()函数转换time.struct_time结构到string类型” Fri Mar 27 22:20:42 2009″

这时再次执行：

>>> with open('entry.json', 'w', encoding='utf-8') as f:
... json.dump(entry, f,default=customserializer.to_json)

就不会报任何错误了。

（8）从文件中恢复JSON数据

就像pickle模块一样，json模块同样使用load()函数带一个流对象参数，从中读取JSON格式字符串。

>>> import json
>>> with open('entry.json', 'r', encoding='utf-8') as f:
... entry = json.load(f) ①
...
>>> entry ②
{'comments_link': None,
'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]},
'title': 'Dive into history, 2009 edition',
'tags': ['diveintopython', 'docbook', 'html'],
'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': {'__class__': 'time.asctime', '__value__': 'Fri Mar 27 22:20:42 2009'},
'published': True}

和pickle模块的load用法一样，传给load()一个流对象，之后它返回一个python对象。
恢复数据结构，不过‘internal_id’和‘published_date’被恢复成字曲类型，因为load()并不知道转换之前的自定义函数。

要想恢复之前原来的数据结构，还要定义一个与to_json做用相反的程序——from_json：

def from_json(json_object): ①
    if '__class__' in json_object: ②
        if json_object['__class__'] == 'time.asctime':
            return time.strptime(json_object['__value__']) ③
        if json_object['__class__'] == 'bytes':
            return bytes(json_object['__value__']) ④
    return json_object

这个转换函数有一个参数并返回一个值。但参数并不是一个字符串，它是一个python对象——一个JSON编码的字符串传到PYTHON后的对象（详见：JSON和PYTHON3数据类型的对应关系）。
类型检查，如果发现了’__class__’，就按我们定义的方式进行恢复
time.asctime()和time.strptime()是一对相反的函数，asctime()将time.sruct_time结构转成字符串，而strptime()将字符串转成struct_time结构。
把一个列表转成bytes类型。

准备工作完成，调用方法如下：

>>> import customserializer
>>> with open('entry.json', 'r', encoding='utf-8') as f:
... entry = json.load(f, object_hook=customserializer.from_json) ①
...
>>> entry ②
{'comments_link': None,
'internal_id': b'\xDE\xD5\xB4\xF8',
'title': 'Dive into history, 2009 edition',
'tags': ['diveintopython', 'docbook', 'html'],
'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'published_date': time.struct_time(tm_year=2009, tm_mon=3, tm_mday=27, tm_hour=22, tm_min=20, tm_sec=42, tm_wday=4, tm_yday=86, tm_isdst=-1),
'published': True}

将我们自定义的from_json()函数挂载（hook）到load()上，object_hook参数是传递给json.load()函数的。
输出entry，查看它的结构。

Hello World

New Begining

第13章序列化python对象

2016年4月26日 skiron Comments 0 Comment

（1）pickle模块

（2）pickle模块保存数据

（3）从pickle文件中取回数据

（4）pickling without a file

（5）往事必须再次回首——BYTES和STRINGS

（6）可以让其它语言打开python的序列化对象——JSON

（7）序列化JSON不支持的数据类型

（8）从文件中恢复JSON数据

发表回复取消回复

（1）pickle模块

（2）pickle模块保存数据

（3）从pickle文件中取回数据

（4）pickling without a file

（5）往事必须再次回首——BYTES和STRINGS

（6）可以让其它语言打开python的序列化对象——JSON

（7）序列化JSON不支持的数据类型

（8）从文件中恢复JSON数据

发表回复 取消回复

发表回复取消回复