Basic types
Data types
Supports for MaxCompute data types are located in odps.types package.All data types are represented as instances of sub-classes of odps.types.DataType. For instance, 64 bit integer type is represented by instance of odps.types.Bigint, and array of 32-bit integers, array<int>, is represented by instance of odps.types.Array whose value_type attribute is the instance of odps.types.Int.
Note
PyODPS 默认不开放对 bigint、string、double、boolean、datetime、decimal 类型外其他类型的完整支持。需要完整使用除这些类型外的其他类型,需要设置选项
options.sql.use_odps2_extension = True。关于设置选项可参考这份文档 。
Specify types by strings
通常情况下,在 PyODPS 中,你都可以直接用 MaxCompute DDL 中表示类型的字符串来表示类型,这可以避免了解类型的实现细节。例如,当我们创建一个列实例,可以直接传入 array<int> 代表一个32位整数数组,而不需要关心使用哪个类去实现:
>>> import odps.types as odps_types
>>>
>>> column = odps_types.Column("col", "array<int>")
>>> print(type(column.type))
<class 'odps.types.Array'>
>>> print(type(column.type.value_type))
<class 'odps.types.Int'>
You can use the validate_data_type() function to get type instances from string representations if needed.
>>> from odps.types import validate_data_type
>>>
>>> array_type = validate_data_type("array<bigint>")
>>> print(array_type.value_type)
bigint
Variable-length types
MaxCompute supports variable-length types such as char / varchar, which can define the maximum length of the type, and decimal, which can define the precision (precision) and decimal digits (scale) of the type.You can construct type instances of these types by calling the constructors of corresponding type descriptor class. For instance,
>>> from odps.types import validate_data_type
>>>
>>> # define char / varchar type instances with size limit 10
>>> char_type = validate_data_type('char(10)')
>>> varchar_type = validate_data_type('varchar(10)')
>>> # define decimal type instance with precision 10 and decimal scale 2
>>> decimal_type = validate_data_type('decimal(10, 2)')
The size limit of char / varchar type instances can be obtained by size_limit attribute, while the precision and decimal scale of decimal type instances can be obtained by precision and scale attribute.
>>> from odps.types import validate_data_type
>>>
>>> # get size limtation of char and varchar type
>>> char_type = validate_data_type('char(10)')
>>> print("size_limit:", char_type.size_limit)
size_limit: 10
>>> # get precision and decimal scale of decimal type
>>> decimal_type = validate_data_type('decimal(10, 2)')
>>> print("precision:", decimal_type.precision, "scale:", decimal_type.scale)
precision: 10 scale: 2
Composite types
MaxCompute 支持的复合类型有 Array、Map 和 Struct,可通过构造函数或者类型字符串获取对应的类型描述类实例。下面的例子展示了如何创建 Array 和 Map 类型描述实例。
>>> import odps.types as odps_types
>>>
>>> # create an array type descriptor with value type as bigint
>>> array_type = odps_types.Array(odps_types.bigint)
>>> # create a map type descriptor with key type as string and value type as array<bigint>
>>> map_type = odps_types.Map(odps_types.string, odps_types.Array(odps_types.bigint))
Use a type string to create the same type instance:
>>> from odps.types import validate_data_type
>>>
>>> # create an array type descriptor with value type as bigint
>>> array_type = validate_data_type("array<bigint>")
>>> # create a map type descriptor with key type as string and value type as array<bigint>
>>> map_type = validate_data_type("map<string, array<bigint>>")
Array 类型描述实例的元素类型可通过 value_type 属性获取。Map 类型描述实例的关键字类型可通过 key_type 属性获取,而值类型可通过 value_type 属性获取。
>>> from odps.types import validate_data_type
>>>
>>> # get value type of an array instance
>>> array_type = validate_data_type("array<bigint>")
>>> print("value_type:", array_type.value_type)
value_type: bigint
>>> # get key and value type of a map instance
>>> map_type = validate_data_type("map<string, array<bigint>>")
>>> print("key_type:", map_type.key_type, "value_type:", map_type.value_type)
key_type: string value_type: array<bigint>
你可以通过 dict[str, DataType] 或者 list[tuple[str, DataType]] 创建 Struct 类型描述实例。对于 dict 类型,需要注意在 Python 3.6 及之前版本,Python 不保证 dict 的顺序,这可能导致定义的字段类型与预期不符。下面的例子展示了如何创建 Struct 类型描述实例。
>>> import odps.types as odps_types
>>>
>>> # create a Struct instance with a list of tuples, containing two fields
>>> # a and b, whose types are bigint and string
>>> struct_type = odps_types.Struct(
>>> [("a", odps_types.bigint), ("b", odps_types.string)]
>>> )
>>> # create a Struct instance sane as the instance above with a dict
>>> struct_type = odps_types.Struct(
>>> {"a": odps_types.bigint, "b": odps_types.string}
>>> )
Use a type string to create the same type instance:
>>> from odps.types import validate_data_type
>>>
>>> struct_type = validate_data_type("struct<a:bigint, b:string>")
Struct 类型描述实例的各个字段类型可通过 field_types 属性获取,该属性为一个由字段名和字段类型组成的 OrderedDict 实例。
>>> from odps.types import validate_data_type
>>>
>>> # obtain field types of the Struct instance
>>> struct_type = validate_data_type("struct<a:bigint, b:string>")
>>> for field_name, field_type in struct_type.field_types.items():
>>> print("field_name:", field_name, "field_type:", field_type)
field_name: a field_type: bigint
field_name: b field_type: string
Table schema and related classes
Note
Code in this section is only guaranteed to work under PyODPS 0.11.3 and later versions. For PyODPS earlier than 0.11.3, please replace class odps.models.Schema with odps.models.TableSchema.
TableSchema 类型用于表示表的结构,其中包含字段名称和类型。你可以使用表的列以及(可选的)分区来初始化。
>>> from odps.models import TableSchema, Column, Partition
>>> columns = [Column(name='num', type='bigint', comment='the column'),
>>> Column(name='num2', type='double', comment='the column2')]
>>> partitions = [Partition(name='pt', type='string', comment='the partition')]
>>> schema = TableSchema(columns=columns, partitions=partitions)
>>> print(schema)
odps.Schema {
num bigint # the column
num2 double # the column2
arr array<int> # the column3
}
Partitions {
pt string # the partition
}
Second, you can use TableSchema.from_lists() to initialize the table. This method is easier, but you cannot directly set the comments of the columns and the partitions.
>>> from odps.models import TableSchema, Column, Partition
>>>
>>> schema = TableSchema.from_lists(
>>> ['num', 'num2', 'arr'], ['bigint', 'double', 'array<int>'], ['pt'], ['string']
>>> )
>>> print(schema)
odps.Schema {
num bigint
num2 double
arr array<int>
}
Partitions {
pt string
}
你可以从 TableSchema 实例中获取表的一般字段和分区字段。simple_columns
和 partitions 属性分别指代一般列和分区列,而 columns
属性则指代所有字段。这三个属性的返回值均为 Column 或 Partition 类型组成的列表。你也可以通过 names 和 types 属性分别获取非分区字段的名称和类型。
>>> from odps.models import TableSchema, Column, Partition
>>>
>>> schema = TableSchema.from_lists(
>>> ['num', 'num2', 'arr'], ['bigint', 'double', 'array<int>'], ['pt'], ['string']
>>> )
>>> print(schema.columns) # list of Column type
[<column num, type bigint>,
<column num2, type double>,
<column arr, type array<int>>,
<partition pt, type string>]
>>> print(schema.simple_columns) # list of Column type
[<column num, type bigint>,
<column num2, type double>,
<column arr, type array<int>>]
>>> print(schema.partitions) # list of Partition type
[<partition pt, type string>]
>>> print(schema.simple_columns[-1].type.value_type) # value type of the last array column
int
>>> print(schema.names) # get column name of none-partition columns
['num', 'num2']
>>> print(schema.types) # get column type of none-partition columns
[bigint, double]
在使用 TableSchema 时,Column 和 Partition
类型分别用于表示表的字段和分区。你可以通过字段名和类型创建 Column 实例,也可以同时指定列注释以及字段是否可以为空。你也可以通过相应的字段获取字段的名称、类型等属性,其中类型为:ref:`数据类型 <data_types>`中的类型实例。
>>> from odps.models import Column
>>>
>>> col = Column(name='num_col', type='array<int>', comment='comment of the col', nullable=False)
>>> print(col)
<column num_col, type array<int>, not null>
>>> print(col.name)
num_col
>>> print(col.type)
array<int>
>>> print(col.type.value_type)
int
>>> print(col.comment)
comment of the col
>>> print(col.nullable)
False
As Partition is just a derived class of Column with name difference, we do not introduce it here.
Records
Record 类型表示表的一行记录,为
Table.open_reader() /
Table.open_reader() 当 arrow=False
时所使用的数据结构,也用于
TableDownloadSession.open_record_reader() /
TableUploadSession.open_record_writer() 。我们在 Table 对象上调用 new_record 就可以创建一个新的 Record。
Assuming that the table schema for the example below is
odps.Schema {
c_int_a bigint
c_string_a string
c_bool_a boolean
c_datetime_a datetime
c_array_a array<string>
c_map_a map<bigint,string>
c_struct_a struct<a:bigint,b:string>
}
Reading or writing operations on the record of the table are as follows:
>>> t = o.get_table('mytable')
>>> r = t.new_record([1024, 'val1', False, datetime.datetime.now(), None, None]) # the number of values must be the same with the number of columns in the schema
>>> r2 = t.new_record() # initializing without values is also acceptable
>>> r2[0] = 1024 # values can be set via column indices
>>> r2['c_string_a'] = 'val1' # values can also be set via column names
>>> r2.c_string_a = 'val1' # values can also be set via attributes
>>> r2.c_array_a = ['val1', 'val2'] # set value of fields with array type
>>> r2.c_map_a = {1: 'val1'} # set value of fields with map type
>>> r2.c_struct_a = (1, 'val1') # set value of fields with struct type with Python tuples when PyODPS >= 0.11.5
>>> r2.c_struct_a = {"a": 1, "b": 'val1'} # Python dicts can also be used to set fields of struct type
>>>
>>> print(record[0]) # get the value of Column 0
>>> print(record['c_string_a']) # get value via column name
>>> print(record.c_string_a) # get value via attributes
>>> print(record[0: 3]) # slice over the column
>>> print(record[0, 2, 3]) # get multiple values via indices
>>> print(record['c_int_a', 'c_double_a']) # get multiple values via column names
Relation between MaxCompute data types and Python types in Records are listed as follows.
MaxCompute Type |
Python Type |
Comments |
|---|---|---|
|
|
|
|
|
|
|
|
See Note 1 |
|
|
|
|
|
See Note 2 |
|
|
|
|
|
|
|
|
See Note 3 |
|
|
|
|
|
|
|
|
See Note 4 |
|
|
See Note 2. Pandas is needed. |
|
|
Results not affected by time zone, Pandas is needed. |
|
|
Pandas is needed. |
|
|
See Note 5 |
Comments for some types are listed as follows.
PyODPS reads MaxCompute strings as Python unicode strings, as str type in Python 3 and unicode type in Python 2. If you store binary data in string fields, you may have to set
options.tunnel.string_as_binary = Trueto avoid encoding issues.PyODPS uses local time as default time zone. If you want to use UTC instead, you need to configure
options.local_timezone = False. If you want to use other time zones, you need to configureoptions.local_timezoneas your expected time zone, for instance,Asia/Shanghai. Note that MaxCompute does not store time zone values, so time values will be converted to Unix Timestamp before storing.For Python 2, PyODPS will use
cdecimal.Decimalwhen cdecimal package is installed.对于 PyODPS < 0.11.5,MaxCompute struct 对应 Python dict 类型。PyODPS >= 0.11.5 则默认对应 namedtuple 类型。如果要使用旧版行为则需要设置选项
options.struct_as_dict = True。DataWorks 环境下,为保持历史兼容性,该值默认为 False。为 Record 设置 struct 类型的字段值时,PyODPS >= 0.11.5 可同时接受 dict 和 tuple 类型,旧版则只接受 dict 类型。Monthdelta can be initialized with years and months. Example of using Monthdelta class is shown below.
>>> from odps import Monthdelta >>> >>> md = Monthdelta(years=1, months=2) >>> print(md.years) 1 >>> print(md.months) 1 >>> print(md.total_months) 14
For details about how to configure
options.xxx, please take a look at configuration documentation.