基本类型

数据类型

PyODPS 对 MaxCompute 中的类型支持实现于 odps.types 包中。所有的数据类型均表示为 odps.types.DataType 类的子类生成的实例。例如，64位整数类型 bigint 使用 odps.types.Bigint 的实例表示，而32位整数数组类型 array<int> 使用 odps.types.Array 的实例表示，且其 value_type 属性为 odps.types.Int 类型。

备注

PyODPS 默认不开放对 bigint、string、double、boolean、datetime、decimal 类型外其他类型的完整支持。需要完整使用除这些类型外的其他类型，需要设置选项 options.sql.use_odps2_extension = True。关于设置选项可参考这份文档。

通过字符串指定类型实例

通常情况下，在 PyODPS 中，你都可以直接用 MaxCompute DDL 中表示类型的字符串来表示类型，这可以避免了解类型的实现细节。例如，当我们创建一个列实例，可以直接传入 array<int> 代表一个32位整数数组，而不需要关心使用哪个类去实现：

>>> import odps.types as odps_types
>>>
>>> column = odps_types.Column("col", "array<int>")
>>> print(type(column.type))
<class 'odps.types.Array'>
>>> print(type(column.type.value_type))
<class 'odps.types.Int'>

如果需要，你可以使用 validate_data_type() 函数获取字符串表示的 MaxCompute 类型实例。

>>> from odps.types import validate_data_type
>>>
>>> array_type = validate_data_type("array<bigint>")
>>> print(array_type.value_type)
bigint

可定义大小的类型

MaxCompute 部分类型可定义类型的大小，例如 char / varchar 可以定义最大长度，Decimal 可以定义精度（precision）和小数位数（scale）。定义这些类型时，可以构造对应类型描述类的实例，例如

>>> from odps.types import validate_data_type
>>>
>>> # 定义 char / varchar 类型实例，长度为 10
>>> char_type = validate_data_type('char(10)')
>>> varchar_type = validate_data_type('varchar(10)')
>>> # 定义 decimal 类型实例，精度为 10，小数位数为 2
>>> decimal_type = validate_data_type('decimal(10, 2)')

char / varchar 类型实例的大小可通过 size_limit 属性获取，而 decimal 类型实例的精度和小数位数可通过 precision 和 scale 属性获取。

>>> from odps.types import validate_data_type
>>>
>>> # 获取 char / varchar 类型长度
>>> char_type = validate_data_type('char(10)')
>>> print("size_limit:", char_type.size_limit)
size_limit: 10
>>> # 获取 decimal 类型精度和小数位数
>>> decimal_type = validate_data_type('decimal(10, 2)')
>>> print("precision:", decimal_type.precision, "scale:", decimal_type.scale)
precision: 10 scale: 2

复合类型

MaxCompute 支持的复合类型有 Array、Map 和 Struct，可通过构造函数或者类型字符串获取对应的类型描述类实例。下面的例子展示了如何创建 Array 和 Map 类型描述实例。

>>> import odps.types as odps_types
>>>
>>> # 创建值类型为 bigint 的 Array 类型描述实例
>>> array_type = odps_types.Array(odps_types.bigint)
>>> # 创建关键字类型为 string，值类型为 array<bigint> 的 Map 类型描述实例
>>> map_type = odps_types.Map(odps_types.string, odps_types.Array(odps_types.bigint))

使用字符串生成相同的类型：

>>> from odps.types import validate_data_type
>>>
>>> # 创建值类型为 bigint 的 Array 类型描述实例
>>> array_type = validate_data_type("array<bigint>")
>>> # 创建关键字类型为 string，值类型为 array<bigint> 的 Map 类型描述实例
>>> map_type = validate_data_type("map<string, array<bigint>>")

Array 类型描述实例的元素类型可通过 value_type 属性获取。Map 类型描述实例的关键字类型可通过 key_type 属性获取，而值类型可通过 value_type 属性获取。

>>> from odps.types import validate_data_type
>>>
>>> # 获取 Array 类型元素类型
>>> array_type = validate_data_type("array<bigint>")
>>> print("value_type:", array_type.value_type)
value_type: bigint
>>> # 获取 Map 类型关键字类型和值类型
>>> map_type = validate_data_type("map<string, array<bigint>>")
>>> print("key_type:", map_type.key_type, "value_type:", map_type.value_type)
key_type: string value_type: array<bigint>

你可以通过 dict[str, DataType] 或者 list[tuple[str, DataType]] 创建 Struct 类型描述实例。对于 dict 类型，需要注意在 Python 3.6 及之前版本，Python 不保证 dict 的顺序，这可能导致定义的字段类型与预期不符。下面的例子展示了如何创建 Struct 类型描述实例。

>>> import odps.types as odps_types
>>>
>>> # 通过 tuple 列表创建一个 Struct 类型描述实例，其中包含两个字段，
>>> # 分别名为 a 和 b，类型分别为 bigint 和 string
>>> struct_type = odps_types.Struct(
>>>     [("a", odps_types.bigint), ("b", odps_types.string)]
>>> )
>>> # 通过 dict 创建一个相同的 Struct 类型描述实例
>>> struct_type = odps_types.Struct(
>>>     {"a": odps_types.bigint, "b": odps_types.string}
>>> )

使用字符串生成相同的类型：

>>> from odps.types import validate_data_type
>>>
>>> struct_type = validate_data_type("struct<a:bigint, b:string>")

Struct 类型描述实例的各个字段类型可通过 field_types 属性获取，该属性为一个由字段名和字段类型组成的 OrderedDict 实例。

>>> from odps.types import validate_data_type
>>>
>>> # 获取 Struct 类型各个字段类型
>>> struct_type = validate_data_type("struct<a:bigint, b:string>")
>>> for field_name, field_type in struct_type.field_types.items():
>>>     print("field_name:", field_name, "field_type:", field_type)
field_name: a field_type: bigint
field_name: b field_type: string

表结构及相关类

备注

本章节中的代码对 PyODPS 0.11.3 及后续版本有效。对早于 0.11.3 版本的 PyODPS，请使用 odps.models.Schema 代替 odps.models.TableSchema。

TableSchema 类型用于表示表的结构，其中包含字段名称和类型。你可以使用表的列以及（可选的）分区来初始化。

>>> from odps.models import TableSchema, Column, Partition
>>>
>>> columns = [
>>>     Column(name='num', type='bigint', comment='the column'),
>>>     Column(name='num2', type='double', comment='the column2'),
>>>     Column(name='arr', type='array<int>', comment='the column3'),
>>> ]
>>> partitions = [Partition(name='pt', type='string', comment='the partition')]
>>> schema = TableSchema(columns=columns, partitions=partitions)
>>> print(schema)
odps.Schema {
  num     bigint      # the column
  num2    double      # the column2
  arr     array<int>  # the column3
}
Partitions {
  pt      string      # the partition
}

第二种方法是使用 TableSchema.from_lists() 方法。这种方法更容易调用，但无法直接设置列和分区的注释。

>>> from odps.models import TableSchema, Column, Partition
>>>
>>> schema = TableSchema.from_lists(
>>>    ['num', 'num2', 'arr'], ['bigint', 'double', 'array<int>'], ['pt'], ['string']
>>> )
>>> print(schema)
odps.Schema {
  num     bigint
  num2    double
  arr     array<int>
}
Partitions {
  pt      string
}

你可以从 TableSchema 实例中获取表的一般字段和分区字段。simple_columns 和 partitions 属性分别指代一般列和分区列，而 columns 属性则指代所有字段。这三个属性的返回值均为 Column 或 Partition 类型组成的列表。你也可以通过 names 和 types 属性分别获取非分区字段的名称和类型。

>>> from odps.models import TableSchema, Column, Partition
>>>
>>> schema = TableSchema.from_lists(
>>>    ['num', 'num2', 'arr'], ['bigint', 'double', 'array<int>'], ['pt'], ['string']
>>> )
>>> print(schema.columns)  # 类型为 Column 的列表
[<column num, type bigint>,
 <column num2, type double>,
 <column arr, type array<int>>,
 <partition pt, type string>]
>>> print(schema.simple_columns)  # 类型为 Column 的列表
[<column num, type bigint>,
 <column num2, type double>,
 <column arr, type array<int>>]
>>> print(schema.partitions)  # 类型为 Partition 的列表
[<partition pt, type string>]
>>> print(schema.simple_columns[-1].type.value_type)  # 获取最后一列数组的值类型
int
>>> print(schema.names)  # 获取非分区字段的字段名
['num', 'num2']
>>> print(schema.types)  # 获取非分区字段的字段类型
[bigint, double]

在使用 TableSchema 时，Column 和 Partition 类型分别用于表示表的字段和分区。你可以通过字段名和类型创建 Column 实例，也可以同时指定列注释以及字段是否可以为空。你也可以通过相应的字段获取字段的名称、类型等属性，其中类型为:ref:`数据类型 <data_types>`中的类型实例。

>>> from odps.models import Column
>>>
>>> col = Column(name='num_col', type='array<int>', comment='comment of the col', nullable=False)
>>> print(col)
<column num_col, type array<int>, not null>
>>> print(col.name)
num_col
>>> print(col.type)
array<int>
>>> print(col.type.value_type)
int
>>> print(col.comment)
comment of the col
>>> print(col.nullable)
False

相比 Column 类型，Partition 类型仅仅是类名有差异，此处不再介绍。

行记录（Record）

Record 类型表示表的一行记录，为 Table.open_reader() / Table.open_reader() 当 arrow=False 时所使用的数据结构，也用于 TableDownloadSession.open_record_reader() / TableUploadSession.open_record_writer() 。我们在 Table 对象上调用 new_record 就可以创建一个新的 Record。

下面的例子中，假定表结构为

odps.Schema {
  c_int_a                 bigint
  c_string_a              string
  c_bool_a                boolean
  c_datetime_a            datetime
  c_array_a               array<string>
  c_map_a                 map<bigint,string>
  c_struct_a              struct<a:bigint,b:string>
}

该表对应 record 的修改和读取示例为

>>> import datetime
>>> t = o.get_table('mytable')
>>> r = t.new_record([1024, 'val1', False, datetime.datetime.now(), None, None])  # 值的个数必须等于表schema的字段数
>>> r2 = t.new_record()  # 初始化时也可以不传入值
>>> r2[0] = 1024  # 可以通过偏移设置值
>>> r2['c_string_a'] = 'val1'  # 也可以通过字段名设置值
>>> r2.c_string_a = 'val1'  # 通过属性设置值
>>> r2.c_array_a = ['val1', 'val2']  # 设置 array 类型的值
>>> r2.c_map_a = {1: 'val1'}  # 设置 map 类型的值
>>> r2.c_struct_a = (1, 'val1')  # 使用 tuple 设置 struct 类型的值，当 PyODPS >= 0.11.5
>>> r2.c_struct_a = {"a": 1, "b": 'val1'}  # 也可以使用 dict 设置 struct 类型的值
>>>
>>> print(record[0])  # 取第0个位置的值
>>> print(record['c_string_a'])  # 通过字段取值
>>> print(record.c_string_a)  # 通过属性取值
>>> print(record[0: 3])  # 切片操作
>>> print(record[0, 2, 3])  # 取多个位置的值
>>> print(record['c_int_a', 'c_double_a'])  # 通过多个字段取值

MaxCompute 不同数据类型在 Record 中对应 Python 类型的关系如下：

MaxCompute 类型	Python 类型	说明
`tinyint`, `smallint`, `int`, `bigint`	`int`
`float`, `double`	`float`
`string`	`str`	见说明1
`binary`	`bytes`
`datetime`	`datetime.datetime`	见说明2
`date`	`datetime.date`
`boolean`	`bool`
`decimal`	`decimal.Decimal`	见说明3
`map`	`dict`
`array`	`list`
`struct`	`tuple` / `namedtuple`	见说明4
`timestamp`	`pandas.Timestamp`	见说明2，需要安装 pandas
`timestamp_ntz`	`pandas.Timestamp`	结果不受时区影响，需要安装 pandas
`interval_day_time`	`pandas.Timedelta`	需要安装 pandas
`interval_year_month`	`odps.Monthdelta`	见说明5

对部分类型的说明如下。

PyODPS 默认 string 类型对应 Unicode 字符串，在 Python 3 中为 str，在 Python 2 中为 unicode。对于部分在 string 中存储 binary 的情形，可能需要设置 options.tunnel.string_as_binary = True 以避免可能的编码问题。
PyODPS 默认使用 Local Time 作为时区，如果要使用 UTC 则需要设置 options.local_timezone = False。如果要使用其他时区，需要设置该选项为指定时区，例如 Asia/Shanghai。MaxCompute 不会存储时区值，因而在写入数据时，会将该时间转换为 Unix Timestamp 进行存储。
对于 Python 2，当安装 cdecimal 包时，会使用 cdecimal.Decimal。
对于 PyODPS < 0.11.5，MaxCompute struct 对应 Python dict 类型。PyODPS >= 0.11.5 则默认对应 namedtuple 类型。如果要使用旧版行为则需要设置选项 options.struct_as_dict = True。DataWorks 环境下，为保持历史兼容性，该值默认为 False。为 Record 设置 struct 类型的字段值时，PyODPS >= 0.11.5 可同时接受 dict 和 tuple 类型，旧版则只接受 dict 类型。

Monthdelta 可使用年 / 月进行初始化，使用示例如下：

>>> from odps import Monthdelta
>>>
>>> md = Monthdelta(years=1, months=2)
>>> print(md.years)
1
>>> print(md.months)
1
>>> print(md.total_months)
14

关于如何设置 options.xxx，请参考文档配置选项。