Instructions for running MaxCompute in DataWorks

Create flow nodes

Flow nodes include the Python on MaxCompute (PyODPS) node. You can create the PyODPS node.

_images/d2-node-en.png

Use the ODPS object

The PyODPS node in DataWorks includes global variable odps or o, which is the ODPS object. You do not need to manually define the ODPS object.

print(o.exist_table('pyodps_iris'))

Execute SQL statements

For more information, see Execute SQL statements .

Note

Instance tunnel is not enabled by default on Dataworks, thus 10000 records can be fetched at most. When instance tunnel is enabled, reader.count illustrates the number of records, and limitation should be disabled to fetch all data by iteration.

In order to enable instance tunnel globally, do as the code shown below.

options.tunnel.use_instance_tunnel = True
options.tunnel.limit_instance_tunnel = False  # disable limitation to fetch all data

with instance.open_reader() as reader:
    # you can fetch all data by instance tunnel

Also you can add tunnel=True to open_reader to enable instance tunnel for this reader only, and add limit=False to disable limitation and fetch all data.

with instance.open_reader(tunnel=True, limit=False) as reader:
    # use instance tunnel and fetch all data without limitation

Note that some project may limit downloading all data from tables, therefore you may get a permission error after configuring these options. You may contact your project owner for help, or process data in MaxCompute rather than download and process them locally.

DataFrame

Execution

To execute DataFrame in DataWorks, you need to explicitly call automatically executed actions such as execute and head .

from odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))
for record in iris[iris.sepal_width < 3].execute():  # filtering will be executed immediately with execute() called
    # process every record

To call automatically executed actions for print, set options.interactive to True.

from odps import options
from odps.df import DataFrame

options.interactive = True  # configure at the start of code

iris = DataFrame(o.get_table('pyodps_iris'))
print(iris.sepal_width.sum())  # sum() will be executed immediately because we use print here

Print details

To print details, you need to set options.verbose. By default, this parameter is set to True in DataWorks. The system prints the logview and other details during operation.

Obtain scheduling parameters

Different from SQL nodes in DataWorks, to avoid invading your Python code which might lead to unpredictable consequences, PyODPS nodes DOES NOT automatically replace placeholder strings like ${param_name}. Instead, PyODPS node will create a dict named args in global variables, which contains all the scheduling parameters. For instance, if you set ds=${yyyymmdd} in Schedule -> Parameter in DataWorks, you can use the following code to obtain the value of ds:

print('ds=' + args['ds'])
ds=20161116

Specifically, if you want to get the table partition ds=${yyyymmdd}, the code below can be used:

o.get_table('table_name').get_partition('ds=' + args['ds'])

Feature restriction

DataWorks does not have the matplotlib library. Therefore, the following features may be restricted:

  • DataFrame plot function

Custom functions in DataFrame need to be submitted to MaxCompute before execution. Due to Python sandbox, third-party libraries which are written in pure Python or referencing merely numpy can be executed without uploading auxiliary libraries. Other libraries including Pandas should be uploaded before use. See support for third-party libraries for more details. Code outside custom functions can use pre-installed Numpy and Pandas in DataWorks. Other third-party libraries with binary codes are not supported currently.

For compatibility reasons, options.tunnel.use_instance_tunnel in DataWorks is set to False by default. To enable Instance Tunnel globally, you need to manually set options.tunnel.use_instance_tunnel to True.

For implementation reasons, the Python atexit package is not supported. You need to use the try - finally structure to implement related features.

Usage restrictions

To avoid pressure on the gateway of DataWorks when running PyODPS in DataWorks, the CPU and memory usage is restricted. DataWorks provides central management for this restriction.

If the system displays Got killed, this indicates an out-of-memory error and that the process has been terminated. Therefore, we do not recommend starting local data operations.

However, the preceding restriction does not work on SQL and DataFrame tasks (except to_pandas) that are initiated by PyODPS.