Model objects
- class odps.models.Project(**kwargs)[源代码]
Project is the counterpart of database in a RDBMS.
By get an object of Project, users can get the properties like
name
,owner
,comment
,creation_time
,last_modified_time
, and so on.These properties will not load from remote ODPS service, unless users try to get them explicitly. If users want to check the newest status, try use
reload
method.- Example:
>>> project = odps.get_project('my_project') >>> project.last_modified_time # this property will be fetched from the remote ODPS service >>> project.last_modified_time # Once loaded, the property will not bring remote call >>> project.owner # so do the other properties, they are fetched together >>> project.reload() # force to update each properties >>> project.last_modified_time # already updated
- class odps.models.Table(**kwargs)[源代码]
Table means the same to the RDBMS table, besides, a table can consist of partitions.
Table's properties are the same to the ones of
odps.models.Project
, which will not load from remote ODPS service until users try to get them.In order to write data into table, users should call the
open_writer
method with with statement. At the same time, theopen_reader
method is used to provide the ability to read records from a table or its partition.- Example:
>>> table = odps.get_table('my_table') >>> table.owner # first will load from remote >>> table.reload() # reload to update the properties >>> >>> for record in table.head(5): >>> # check the first 5 records >>> for record in table.head(5, partition='pt=test', columns=['my_column']) >>> # only check the `my_column` column from certain partition of this table >>> >>> with table.open_reader() as reader: >>> count = reader.count # How many records of a table or its partition >>> for record in reader[0: count]: >>> # read all data, actually better to split into reading for many times >>> >>> with table.open_writer() as writer: >>> writer.write(records) >>> with table.open_writer(partition='pt=test', blocks=[0, 1]): >>> writer.write(0, gen_records(block=0)) >>> writer.write(1, gen_records(block=1)) # we can do this parallel
- name
Name of the table
- comment
Comment of the table
- owner
Owner of the table
- creation_time
Creation time of the table in local time.
- last_data_modified_time
Last data modified time of the table in local time.
- table_schema
Schema of the table, in
TableSchema
type.
- type
Type of the table, can be managed_table, external_table, view or materialized_view.
- size
Logical size of the table.
- lifecycle
Lifecycle of the table in days.
- add_columns(columns, if_not_exists=False, async_=False, hints=None)[源代码]
Add columns to the table.
- 参数:
columns -- columns to add, can be a list of
Column
or a string of column definitionsif_not_exists -- if True, will not raise exception when column exists
- Example:
>>> table = odps.create_table('test_table', schema=TableSchema.from_lists(['name', 'id'], ['sring', 'string'])) >>> # add column by Column instance >>> table.add_columns([Column('id2', 'string')]) >>> # add column by a string of column definitions >>> table.add_columns("fid double, fid2 double")
- change_partition_spec(old_partition_spec, new_partition_spec, async_=False, hints=None)[源代码]
Change partition spec of specified partition of the table.
- 参数:
old_partition_spec -- old partition spec
new_partition_spec -- new partition spec
- create_partition(partition_spec, if_not_exists=False, async_=False, hints=None)[源代码]
Create a partition within the table.
- 参数:
partition_spec -- specification of the partition.
if_not_exists
hints
async
- 返回:
partition object
- 返回类型:
- delete_columns(columns, async_=False, hints=None)[源代码]
Delete columns from the table.
- 参数:
columns -- columns to delete, can be a list of column names
- delete_partition(partition_spec, if_exists=False, async_=False, hints=None)[源代码]
Delete a partition within the table.
- 参数:
partition_spec -- specification of the partition.
if_exists
hints
async
- drop(async_=False, if_exists=False, hints=None)[源代码]
Drop this table.
- 参数:
async -- run asynchronously if True
if_exists
hints
- 返回:
None
- exist_partition(partition_spec)[源代码]
Check if a partition exists within the table.
- 参数:
partition_spec -- specification of the partition.
- exist_partitions(prefix_spec=None)[源代码]
Check if partitions with provided conditions exist.
- 参数:
prefix_spec -- prefix of partition
- 返回:
whether partitions exist
- get_ddl(with_comments=True, if_not_exists=False, force_table_ddl=False)[源代码]
Get DDL SQL statement for the given table.
- 参数:
with_comments -- append comment for table and each column
if_not_exists -- generate if not exists code for generated DDL
force_table_ddl -- force generate table DDL if object is a view
- 返回:
DDL statement
- get_max_partition(spec=None, skip_empty=True, reverse=False)[源代码]
Get partition with maximal values within certain spec.
- 参数:
spec -- parent partitions. if specified, will return partition with maximal value within specified parent partition
skip_empty -- if True, will skip partitions without data
reverse -- if True, will return minimal value
- 返回:
Partition
- get_partition(partition_spec)[源代码]
Get a partition with given specifications.
- 参数:
partition_spec -- specification of the partition.
- 返回:
partition object
- 返回类型:
- head(limit, partition=None, columns=None, use_legacy=True, timeout=None, tags=None)[源代码]
Get the head records of a table or its partition.
- 参数:
limit (int) -- records' size, 10000 at most
partition -- partition of this table
columns (list) -- the columns which is subset of the table columns
- 返回:
records
- 返回类型:
list
- iter_pandas(partition=None, columns=None, batch_size=None, start=None, count=None, quota_name=None, append_partitions=None, tags=None, **kwargs)[源代码]
Iterate table data in blocks as pandas DataFrame
- 参数:
partition -- partition of this table
columns (list) -- columns to read
batch_size (int) -- size of DataFrame batch to read
start (int) -- start row index from 0
count (int) -- data count to read
append_partitions (bool) -- if True, partition values will be appended to the output
quota_name (str) -- name of tunnel quota to use
- iterate_partitions(spec=None, reverse=False)[源代码]
Create an iterable object to iterate over partitions.
- 参数:
spec -- specification of the partition.
reverse -- output partitions in reversed order
- new_record(values=None)[源代码]
Generate a record of the table.
- 参数:
values (list) -- the values of this records
- 返回:
record
- 返回类型:
- Example:
>>> table = odps.create_table('test_table', schema=TableSchema.from_lists(['name', 'id'], ['sring', 'string'])) >>> record = table.new_record() >>> record[0] = 'my_name' >>> record[1] = 'my_id' >>> record = table.new_record(['my_name', 'my_id'])
- open_reader(partition=None, reopen=False, endpoint=None, download_id=None, timeout=None, arrow=False, columns=None, quota_name=None, async_mode=True, append_partitions=None, tags=None, **kw)[源代码]
Open the reader to read the entire records from this table or its partition.
- 参数:
partition -- partition of this table
reopen (bool) -- the reader will reuse last one, reopen is true means open a new reader.
endpoint -- the tunnel service URL
download_id -- use existing download_id to download table contents
arrow -- use arrow tunnel to read data
columns -- columns to read
quota_name -- name of tunnel quota
async_mode -- enable async mode to create tunnels, can set True if session creation takes a long time.
compress_option (
odps.tunnel.CompressOption
) -- compression algorithm, level and strategycompress_algo -- compression algorithm, work when
compress_option
is not provided, can bezlib
,snappy
compress_level -- used for
zlib
, work whencompress_option
is not providedcompress_strategy -- used for
zlib
, work whencompress_option
is not providedappend_partitions (bool) -- if True, partition values will be appended to the output
- 返回:
reader,
count
means the full size,status
means the tunnel status- Example:
>>> with table.open_reader() as reader: >>> count = reader.count # How many records of a table or its partition >>> for record in reader[0: count]: >>> # read all data, actually better to split into reading for many times
- open_writer(partition=None, blocks=None, reopen=False, create_partition=False, commit=True, endpoint=None, upload_id=None, arrow=False, quota_name=None, tags=None, mp_context=None, **kw)[源代码]
Open the writer to write records into this table or its partition.
- 参数:
partition -- partition of this table
blocks -- block ids to open
reopen (bool) -- the reader will reuse last one, reopen is true means open a new reader.
create_partition (bool) -- if true, the partition will be created if not exist
endpoint -- the tunnel service URL
upload_id -- use existing upload_id to upload data
arrow -- use arrow tunnel to write data
quota_name -- name of tunnel quota
overwrite (bool) -- if True, will overwrite existing data
compress_option (
odps.tunnel.CompressOption
) -- compression algorithm, level and strategycompress_algo -- compression algorithm, work when
compress_option
is not provided, can bezlib
,snappy
compress_level -- used for
zlib
, work whencompress_option
is not providedcompress_strategy -- used for
zlib
, work whencompress_option
is not provided
- 返回:
writer, status means the tunnel writer status
- Example:
>>> with table.open_writer() as writer: >>> writer.write(records) >>> with table.open_writer(partition='pt=test', blocks=[0, 1]): >>> writer.write(0, gen_records(block=0)) >>> writer.write(1, gen_records(block=1)) # we can do this parallel
- rename_column(old_column_name, new_column_name, comment=None, async_=False, hints=None)[源代码]
Rename a column in the table.
- 参数:
old_column_name -- old column name
new_column_name -- new column name
comment -- new column comment, optional
- set_cluster_info(new_cluster_info, async_=False, hints=None)[源代码]
Set cluster info of current table.
- set_comment(new_comment, async_=False, hints=None)[源代码]
Set comment of current table.
- 参数:
new_comment -- new comment
- set_lifecycle(days, async_=False, hints=None)[源代码]
Set lifecycle of current table.
- 参数:
days -- lifecycle in days
- set_owner(new_owner, async_=False, hints=None)[源代码]
Set owner of current table.
- 参数:
new_owner -- account of the new owner
- set_storage_tier(storage_tier, partition_spec=None, async_=False, hints=None)[源代码]
Set storage tier of current table or specific partition.
- to_pandas(partition=None, columns=None, start=None, count=None, n_process=1, quota_name=None, append_partitions=None, tags=None, **kwargs)[源代码]
Read table data into pandas DataFrame
- 参数:
partition -- partition of this table
columns (list) -- columns to read
start (int) -- start row index from 0
count (int) -- data count to read
n_process (int) -- number of processes to accelerate reading
append_partitions (bool) -- if True, partition values will be appended to the output
quota_name (str) -- name of tunnel quota to use
- class odps.models.partition.Partition(**kwargs)[源代码]
A partition is a collection of rows in a table whose partition columns are equal to specific values.
In order to write data into partition, users should call the
open_writer
method with with statement. At the same time, theopen_reader
method is used to provide the ability to read records from a partition. The behavior of these methods are the same as those in Table class except that there are no 'partition' params.- change_partition_spec(new_partition_spec, async_=False, hints=None)[源代码]
Change partition spec of current partition.
- 参数:
new_partition_spec -- new partition spec
- drop(async_=False, if_exists=False)[源代码]
Drop this partition.
- 参数:
async -- run asynchronously if True
if_exists
- 返回:
None
- head(limit, columns=None)[源代码]
Get the head records of a partition
- 参数:
limit -- records' size, 10000 at most
columns (list) -- the columns which is subset of the table columns
- 返回:
records
- 返回类型:
list
- iter_pandas(columns=None, batch_size=None, start=None, count=None, quota_name=None, append_partitions=None, tags=None, **kwargs)[源代码]
Read partition data into pandas DataFrame
- 参数:
columns (list) -- columns to read
batch_size (int) -- size of DataFrame batch to read
start (int) -- start row index from 0
count (int) -- data count to read
quota_name (str) -- name of tunnel quota to use
append_partitions (bool) -- if True, partition values will be appended to the output
- open_reader(**kw)[源代码]
Open the reader to read the entire records from this partition.
- 参数:
reopen (bool) -- the reader will reuse last one, reopen is true means open a new reader.
endpoint -- the tunnel service URL
compress_option (
odps.tunnel.CompressOption
) -- compression algorithm, level and strategycompress_algo -- compression algorithm, work when
compress_option
is not provided, can bezlib
,snappy
compress_level -- used for
zlib
, work whencompress_option
is not providedcompress_strategy -- used for
zlib
, work whencompress_option
is not provided
- 返回:
reader,
count
means the full size,status
means the tunnel status- Example:
>>> with partition.open_reader() as reader: >>> count = reader.count # How many records of a partition >>> for record in reader[0: count]: >>> # read all data, actually better to split into reading for many times
- set_storage_tier(storage_tier, async_=False, hints=None)[源代码]
Set storage tier of current partition.
- to_pandas(columns=None, start=None, count=None, n_process=1, quota_name=None, append_partitions=None, tags=None, **kwargs)[源代码]
Read partition data into pandas DataFrame
- 参数:
columns (list) -- columns to read
start (int) -- start row index from 0
count (int) -- data count to read
n_process (int) -- number of processes to accelerate reading
quota_name (str) -- name of tunnel quota to use
append_partitions (bool) -- if True, partition values will be appended to the output
- class odps.models.Instance(**kwargs)[源代码]
Instance means that a ODPS task will sometimes run as an instance.
status
can reflect the current situation of a instance.is_terminated
method indicates if the instance has finished.is_successful
method indicates if the instance runs successfully.wait_for_success
method will block the main process until the instance has finished.For a SQL instance, we can use open_reader to read the results.
- Example:
>>> instance = odps.execute_sql('select * from dual') # this sql return the structured data >>> with instance.open_reader() as reader: >>> # handle the record >>> >>> instance = odps.execute_sql('desc dual') # this sql do not return structured data >>> with instance.open_reader() as reader: >>> print(reader.raw) # just return the raw result
- exception DownloadSessionCreationError(msg, request_id=None, code=None, host_id=None, instance_id=None, endpoint=None, tag=None, response_headers=None)[源代码]
- class Task(**kwargs)[源代码]
Task stands for each task inside an instance.
It has a name, a task type, the start to end time, and a running status.
- class TaskProgress(**kwargs)[源代码]
TaskProgress reprents for the progress of a task.
A single TaskProgress may consist of several stages.
- Example:
>>> progress = instance.get_task_progress('task_name') >>> progress.get_stage_progress_formatted_string() 2015-11-19 16:39:07 M1_Stg1_job0:0/0/1[0%] R2_1_Stg1_job0:0/0/1[0%]
- get_logview_address(hours=None, use_legacy=None)[源代码]
Get logview address of the instance object by hours.
- 参数:
hours
- 返回:
logview address
- 返回类型:
str
- get_sql_task_cost()[源代码]
Get cost information of the sql cost task, including input data size, number of UDF, Complexity of the sql task.
NOTE that DO NOT use this function directly as it cannot be applied to instances returned from SQL. Use
o.execute_sql_cost
instead.- 返回:
cost info in dict format
- get_task_cost(task_name=None)[源代码]
Get task cost
- 参数:
task_name -- name of the task
- 返回:
task cost
- 返回类型:
Instance.TaskCost
- Example:
>>> cost = instance.get_task_cost(instance.get_task_names()[0]) >>> cost.cpu_cost 200 >>> cost.memory_cost 4096 >>> cost.input_size 0
- get_task_detail(task_name=None)[源代码]
Get task's detail
- 参数:
task_name -- task name
- 返回:
the task's detail
- 返回类型:
list or dict according to the JSON
- get_task_detail2(task_name=None, **kw)[源代码]
Get task's detail v2
- 参数:
task_name -- task name
- 返回:
the task's detail
- 返回类型:
list or dict according to the JSON
- get_task_info(task_name, key, raise_empty=False)[源代码]
Get task related information.
- 参数:
task_name -- name of the task
key -- key of the information item
raise_empty -- if True, will raise error when response is empty
- 返回:
a string of the task information
- get_task_progress(task_name=None)[源代码]
Get task's current progress
- 参数:
task_name -- task_name
- 返回:
the task's progress
- 返回类型:
- get_task_quota(task_name=None)[源代码]
Get queueing info of the task. Note that time between two calls should larger than 30 seconds, otherwise empty dict is returned.
- 参数:
task_name -- name of the task
- 返回:
quota info in dict format
- get_task_result(task_name=None, timeout=None, retry=True)[源代码]
Get a single task result.
- 参数:
task_name -- task name
- 返回:
task result
- 返回类型:
str
- get_task_results(timeout=None, retry=True)[源代码]
Get all the task results.
- 返回:
a dict which key is task name, and value is the task result as string
- 返回类型:
dict
- get_task_statuses(retry=True, timeout=None)[源代码]
Get all tasks' statuses
- 返回:
a dict which key is the task name and value is the
odps.models.Instance.Task
object- 返回类型:
dict
- get_task_summary(task_name=None)[源代码]
Get a task's summary, mostly used for MapReduce.
- 参数:
task_name -- task name
- 返回:
summary as a dict parsed from JSON
- 返回类型:
dict
- get_task_workers(task_name=None, json_obj=None)[源代码]
Get workers from task :param task_name: task name :param json_obj: json object parsed from get_task_detail2 :return: list of workers
- get_worker_log(log_id, log_type, size=0)[源代码]
Get logs from worker.
- 参数:
log_id -- id of log, can be retrieved from details.
log_type -- type of logs. Possible log types contains coreinfo, hs_err_log, jstack, pstack, stderr, stdout, waterfall_summary
size -- length of the log to retrieve
- 返回:
log content
- is_running(retry=True, blocking=False, retry_timeout=None)[源代码]
If this instance is still running.
- 返回:
True if still running else False
- 返回类型:
bool
- is_successful(retry=True, retry_timeout=None)[源代码]
If the instance runs successfully.
- 返回:
True if successful else False
- 返回类型:
bool
- is_terminated(retry=True, blocking=False, retry_timeout=None)[源代码]
If this instance has finished or not.
- 返回:
True if finished else False
- 返回类型:
bool
- iter_pandas(columns=None, limit=None, batch_size=None, start=None, count=None, quota_name=None, tags=None, **kwargs)[源代码]
Iterate table data in blocks as pandas DataFrame. The limit argument follows definition of open_reader API.
- 参数:
columns (list) -- columns to read
limit (bool) -- if True, enable the limitation
batch_size (int) -- size of DataFrame batch to read
start (int) -- start row index from 0
count (int) -- data count to read
quota_name (str) -- name of tunnel quota to use
- open_reader(*args, **kwargs)[源代码]
Open the reader to read records from the result of the instance. If tunnel is True, instance tunnel will be used. Otherwise conventional routine will be used. If instance tunnel is not available and tunnel is not specified, the method will fall back to the conventional routine. Note that the number of records returned is limited unless options.limited_instance_tunnel is set to True or limit=True is configured under instance tunnel mode. Otherwise the number of records returned is always limited.
- 参数:
tunnel -- if true, use instance tunnel to read from the instance. if false, use conventional routine. if absent, options.tunnel.use_instance_tunnel will be used and automatic fallback is enabled.
limit (bool) -- if True, enable the limitation
reopen (bool) -- the reader will reuse last one, reopen is true means open a new reader.
endpoint -- the tunnel service URL
compress_option (
odps.tunnel.CompressOption
) -- compression algorithm, level and strategycompress_algo -- compression algorithm, work when
compress_option
is not provided, can bezlib
,snappy
compress_level -- used for
zlib
, work whencompress_option
is not providedcompress_strategy -- used for
zlib
, work whencompress_option
is not provided
- 返回:
reader,
count
means the full size,status
means the tunnel status- Example:
>>> with instance.open_reader() as reader: >>> count = reader.count # How many records of a table or its partition >>> for record in reader[0: count]: >>> # read all data, actually better to split into reading for many times
- put_task_info(task_name, key, value, check_location=False, raise_empty=False)[源代码]
Put information into a task.
- 参数:
task_name -- name of the task
key -- key of the information item
value -- value of the information item
check_location -- raises if Location header is missing
raise_empty -- if True, will raise error when response is empty
- to_pandas(columns=None, limit=None, start=None, count=None, n_process=1, quota_name=None, tags=None, **kwargs)[源代码]
Read instance data into pandas DataFrame. The limit argument follows definition of open_reader API.
- 参数:
columns (list) -- columns to read
limit (bool) -- if True, enable the limitation
start (int) -- start row index from 0
count (int) -- data count to read
n_process (int) -- number of processes to accelerate reading
quota_name (str) -- name of tunnel quota to use
- wait_for_completion(interval=1, timeout=None, max_interval=None, blocking=True)[源代码]
Wait for the instance to complete, and neglect the consequence.
- 参数:
interval -- time interval to check
max_interval -- if specified, next check interval will be multiplied by 2 till max_interval is reached.
timeout -- time
blocking -- whether to block waiting at server side. Note that this option does not affect client behavior.
- 返回:
None
- wait_for_success(interval=1, timeout=None, max_interval=None, blocking=True)[源代码]
Wait for instance to complete, and check if the instance is successful.
- 参数:
interval -- time interval to check
max_interval -- if specified, next check interval will be multiplied by 2 till max_interval is reached.
timeout -- time
blocking -- whether to block waiting at server side. Note that this option does not affect client behavior.
- 返回:
None
- Raise:
odps.errors.ODPSError
if the instance failed
- class odps.models.Resource(**kwargs)[源代码]
Resource is useful when writing UDF or MapReduce. This is an abstract class.
Basically, resource can be either a file resource or a table resource. File resource can be
file
,py
,jar
,archive
in details.
- class odps.models.FileResource(**kwargs)[源代码]
File resource represents for a file.
Use
open
method to open this resource as a file-like object.- flush()[源代码]
Commit the change to ODPS if any change happens. Close will do this automatically.
- 返回:
None
- open(mode='r', encoding='utf-8', stream=False, overwrite=None)[源代码]
The argument
mode
stands for the open mode for this file resource. It can be binary mode if the 'b' is inside. For instance, 'rb' means opening the resource as read binary mode while 'r+b' means opening the resource as read+write binary mode. This is most import when the file is actually binary such as tar or jpeg file, so be aware of opening this file as a correct mode.Basically, the text mode can be 'r', 'w', 'a', 'r+', 'w+', 'a+' just like the builtin python
open
method.r
means read onlyw
means write only, the file will be truncated when openinga
means append onlyr+
means read+write without constraintw+
will truncate first then opening into read+writea+
can read+write, however the written content can only be appended to the end
- 参数:
mode -- the mode of opening file, described as above
encoding -- utf-8 as default
stream -- open in stream mode
overwrite -- if True, will overwrite existing resource. True by default.
- 返回:
file-like object
- Example:
>>> with resource.open('r') as fp: >>> fp.read(1) # read one unicode character >>> fp.write('test') # wrong, cannot write under read mode >>> >>> with resource.open('wb') as fp: >>> fp.readlines() # wrong, cannot read under write mode >>> fp.write('hello world') # write bytes >>> >>> with resource.open('test_resource', 'r+') as fp: # open as read-write mode >>> fp.seek(5) >>> fp.truncate() >>> fp.flush()
- read(size=-1)[源代码]
Read the file resource, read all as default.
- 参数:
size -- unicode or byte length depends on text mode or binary mode.
- 返回:
unicode or bytes depends on text mode or binary mode
- 返回类型:
str or unicode(Py2), bytes or str(Py3)
- readline(size=-1)[源代码]
Read a single line.
- 参数:
size -- If the size argument is present and non-negative, it is a maximum byte count (including the trailing newline) and an incomplete line may be returned. When size is not 0, an empty string is returned only when EOF is encountered immediately
- 返回:
unicode or bytes depends on text mode or binary mode
- 返回类型:
str or unicode(Py2), bytes or str(Py3)
- readlines(sizehint=-1)[源代码]
Read as lines.
- 参数:
sizehint -- If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.
- 返回:
lines
- 返回类型:
list
- seek(pos, whence=0)[源代码]
Seek to some place.
- 参数:
pos -- position to seek
whence -- if set to 2, will seek to the end
- 返回:
None
- class odps.models.ArchiveResource(**kwargs)[源代码]
File resource representing for the compressed file like .zip/.tgz/.tar.gz/.tar/jar
- class odps.models.TableResource(**kwargs)[源代码]
Take a table as a resource.
- property partition
Get the source table partition.
- 返回:
the source table partition
- property table
Get the table object.
- 返回:
source table
- 返回类型:
- class odps.models.Function(**kwargs)[源代码]
Function can be used in UDF when user writes a SQL.
- property resources
Return all the resources which this function refer to.
- 返回:
resources
- 返回类型:
list
- class odps.models.Worker(**kwargs)[源代码]
Worker information class for worker information and log retrieval.
- class odps.models.ml.OfflineModel(**kwargs)[源代码]
Representing an ODPS offline model.