Instructions for running MaxCompute in DataWorks¶
Create flow nodes¶
Flow nodes include the Python on MaxCompute (PyODPS) node. You can create the PyODPS node.

Use the ODPS object¶
The PyODPS node in DataWorks includes global variable odps
or o
, which is the ODPS object. You do not need to manually define the ODPS object.
print(o.exist_table('pyodps_iris'))
Execute SQL statements¶
For more information, see Execute SQL statements .
Note
Instance tunnel is not enabled by default on Dataworks, thus 10000 records can be fetched at most. When instance tunnel is enabled, reader.count
illustrates the number of records, and limitation should be disabled to fetch all data by iteration.
In order to enable instance tunnel globally, do as the code shown below.
options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False # disable limitation to fetch all data
with instance.open_reader() as reader:
# you can fetch all data by instance tunnel
Also you can add tunnel=True
to open_reader to enable instance tunnel for this reader only, and add limit=False
to disable limitation and fetch all data.
with instance.open_reader(tunnel=True, limit=False) as reader:
# use instance tunnel and fetch all data without limitation
Note that some project may limit downloading all data from tables, therefore you may get a permission error after configuring these options. You may contact your project owner for help, or process data in MaxCompute rather than download and process them locally.
DataFrame¶
Execution¶
To execute DataFrame in DataWorks, you need to explicitly call automatically executed actions such as execute and head .
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
for record in iris[iris.sepal_width < 3].execute(): # filtering will be executed immediately with execute() called
# process every record
To call automatically executed actions for print, set options.interactive
to True.
from odps import options
from odps.df import DataFrame
options.interactive = True # configure at the start of code
iris = DataFrame(o.get_table('pyodps_iris'))
print(iris.sepal_width.sum()) # sum() will be executed immediately because we use print here
Print details¶
To print details, you need to set options.verbose
. By default, this parameter is set to True in DataWorks. The system prints the logview and other details during operation.
Obtain scheduling parameters¶
Different from SQL nodes in DataWorks, to avoid invading your Python code which might lead to unpredictable consequences, PyODPS nodes DOES NOT automatically replace placeholder strings like ${param_name}. Instead, PyODPS node will create a dict named args
in global variables, which contains all the scheduling parameters. For instance, if you set ds=${yyyymmdd}
in Schedule -> Parameter in DataWorks, you can use the following code to obtain the value of ds
:
print('ds=' + args['ds'])
ds=20161116
Specifically, if you want to get the table partition ds=${yyyymmdd}
, the code below can be used:
o.get_table('table_name').get_partition('ds=' + args['ds'])
Feature restriction¶
DataWorks does not have the matplotlib
library. Therefore, the following features may be restricted:
- DataFrame plot function
Custom functions in DataFrame need to be submitted to MaxCompute before execution. Due to Python sandbox, third-party libraries which are written in pure Python or referencing merely numpy can be executed without uploading auxiliary libraries. Other libraries including Pandas should be uploaded before use. See support for third-party libraries for more details. Code outside custom functions can use pre-installed Numpy and Pandas in DataWorks. Other third-party libraries with binary codes are not supported currently.
For compatibility reasons, options.tunnel.use_instance_tunnel in DataWorks is set to False by default. To enable Instance Tunnel globally, you need to manually set options.tunnel.use_instance_tunnel to True.
For implementation reasons, the Python atexit package is not supported. You need to use the try - finally structure to implement related features.
Usage restrictions¶
To avoid pressure on the gateway of DataWorks when running PyODPS in DataWorks, the CPU and memory usage is restricted. DataWorks provides central management for this restriction.
If the system displays Got killed, this indicates an out-of-memory error and that the process has been terminated. Therefore, we do not recommend starting local data operations.
However, the preceding restriction does not work on SQL and DataFrame tasks (except to_pandas) that are initiated by PyODPS.