Create and use third-party libraries

Create third-party libraries

PyODPS 自 0.11.3 起提供了 pyodps-pack 命令行工具，用于制作符合 PyODPS 及 DataWorks PyODPS 节点标准的三方包。该工具在 PyODPS 安装时同步安装，位于 Python bin 路径（Linux / MacOS）或者 Scripts 路径（Windows）下，可在命令行下调用。该工具使用方法类似 pip 命令。你可以使用该工具将所有依赖项目制作成一个 .tar.gz 压缩包，其中包含所有依照 MaxCompute / DataWorks 环境编译并打包的项目依赖。如果你的项目有自行创建的 Python 包，也可以使用该工具进行打包。

Prerequisites

Docker mode

You need to install Docker to run pyodps-pack correctly in Docker mode. You don’t need to run pyodps-pack manually inside a Docker container. It will call Docker for you automatically. For Linux users, Docker can be installed given the Official document.For personal MacOS or Windows users, Docker Desktop can be used. For enterprise users without commercial licenses of Docker Desktop, Rancher Desktop might be used. You may also consider using minikube with some extra steps described in this document. We do not test on other tools providing Docker environments, and availability of the tool on these environments is not guaranteed.

For users who want to create packages for legacy MaxCompute / DataWorks in private clouds, --legacy-image option might be used. In Windows, MacOS or Linux with some kernel, you might receive errors with this option. In this case you may take a look at this article for solutions.

For Windows users, it is possible that your Docker service depends on Server service of Windows system. However, this service is often prohibited in many companies. In this case, please create packages under Linux or try starting the service. It is known that Rancher Desktop may not perform correctly with containerd as container engine, you may switch to dockerd instead. Details about switching container engines can be found in this article.

If your MaxCompute or DataWorks are deployed on ARM64 architecture (usually within proprietary clouds), you need to add an extra --arch aarch64 argument to specify your architecture for the package. Usually components for cross-architecture packaging like binfmt are already included in Docker Desktop or Rancher Desktop. You can also run command below to install related virtual environments manually.

docker run --privileged --rm tonistiigi/binfmt --install arm64

This command requires version of Linux kernel above 4.8. Details of the command can be found in this article.

Non-Docker mode

Note

We recommend using Docker mode to create packages if possible. Non-Docker mode might be used only when Docker is not available. It is also possible to create malfunctioning packages.

When you have problems installing Docker, you might try non-Docker mode by adding a --no-docker argument. When using non-Docker mode, pip is needed in your Python installation. Windows users need to install Git bash to use non-Docker mode, which is included in Git for Windows.

Pack all dependencies

Note

It is recommended to use Python 3 for new projects. We do not guarantee availability of methods below for Python 2. You might try your best migrating your legacy projects to Python 3 to reduce difficulties of maintenance in future.

Please add sudo when calling pyodps-pack in Linux to make sure Docker is called correctly.

Please avoid using sudo when calling pyodps-pack in macOS to avoid potential permission errors.

After PyODPS is installed, you can use command below to pack pandas and all its dependencies.

pyodps-pack pandas

If you want to use non-Docker mode to pack, you can use

pyodps-pack --no-docker pandas

If you need to specify version of pandas, you may use

pyodps-pack pandas==1.2.5

After a series of packing processes, the utility will show versions of packed packages.

Package         Version
--------------- -------
numpy           1.21.6
pandas          1.2.5
python-dateutil 2.8.2
pytz            2022.6
six             1.16.0

and generates a packages.tar.gz with all dependency items listed above.

If you need to create packages for Python 2.7, please check which environment your package will work with, MaxCompute or DataWorks. If you are not sure which environment you are using, you may take a look at this article. If you want to use Python 2.7 packages in MaxCompute, you can use the command below.

pyodps-pack --mcpy27 pandas

If you want to use Python 2.7 packages in DataWorks, you can use the command below.

pyodps-pack --dwpy27 pandas

Pack custom source code

pyodps-pack supports packing user-defined source code organized with setup.py or pyproject.toml. If you want to know how to build Python packages with these files, you can take a look at this link for more information.

We show how to pack custom code by creating a custom package with pyproject.toml and packing with pyodps-pack. Assuming that the directory structure of the project looks like

test_package_root
├── test_package
│   ├── __init__.py
│   ├── mod1.py
│   └── subpackage
│       ├── __init__.py
│       └── mod2.py
└── pyproject.toml

while the content of pyproject.toml is

[project]
name = "test_package"
description = "pyodps-pack example package"
version = "0.1.0"
dependencies = [
    "pandas>=1.0.5"
]

After development of the package, we can pack this package and all the dependencies into packages.tar.gz. (path_to_package is the parent directory of test_package_root)

pyodps-pack /<path_to_package>/test_package_root

Pack code in a Git repository

Packing remote Git repositories is supported in pyodps-pack. We take PyODPS repository as an example to show how to pack a remote Git repository.

pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git

If you want to pack a certain branch or tag, you may use

pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

If you want to install dependencies on build, for instance, cython, you can use --install-requires argument to specify a build-time dependency. You may also create a text file, install-requires.txt, whose format is similar to requirements.txt, and use --install-requires-file to reference it. For instance, if you need to install Cython before packing PyODPS, you can call

pyodps-pack \
    --install-requires cython \
    git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

It is also possible to write a install-requires.txt with content

cython>0.29

and pack command can be written as

pyodps-pack \
    --install-requires-file install-requires.txt \
    git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

A more complicated case: adding binary dependencies

Some third-party libraries depend on extra binary dependencies, for instance, extra dynamically-linked libraries needed to be built and installed. You can use pyodps-pack with an argument --run-before to specify a bash script which can be used to install binary dependencies. We take geospatial library GDAL as an example to show how to pack this kind of packages.

First, we need to find which dependencies needed to install. Given the document of GDAL 3.6.0.1 on PyPI, we need to install libgdal >= 3.6.0. What’s more, the build hints of GDAL shows that it depends on PROJ >= 6.0. Both dependencies can be built with CMake. Thus we write a bash script, install-gdal.sh, to install these dependencies.

#!/bin/bash
set -e

cd /tmp
curl -o proj-6.3.2.tar.gz https://download.osgeo.org/proj/proj-6.3.2.tar.gz
tar xzf proj-6.3.2.tar.gz
cd proj-6.3.2
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install

cd /tmp
curl -o gdal-3.6.0.tar.gz http://download.osgeo.org/gdal/3.6.0/gdal-3.6.0.tar.gz
tar xzf gdal-3.6.0.tar.gz
cd gdal-3.6.0
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install

Then use pyodps-pack to pack GDAL python library.

pyodps-pack --install-requires oldest-supported-numpy --run-before install-gdal.sh gdal==3.6.0.1

In some scenarios binary dependencies are dynamically linked (for instance, with ctypes.cdll.LoadLibrary) into Python. You may use --dynlib argument to introduce these binary packages with the path or library name under /lib. The binary dependency will be packed into packages/dynlibs in the package. For instance, the Python library unrar linked binary library libunrar dynamically, and we can use the script install-libunrar.sh shown below to compile and install it.

#!/bin/bash
curl -o unrar.tar.gz https://www.rarlab.com/rar/unrarsrc-6.0.3.tar.gz
tar xzf unrar.tar.gz
cd unrar
make -j4 lib
# Code below sets SONAME to libunrar.so for the package,
# which is required by LoadLibrary in Python.
# This is not needed for most of binary libraries.
patchelf --set-soname libunrar.so libunrar.so
make install-lib

Then use pyodps-pack to pack GDAL python library.

pyodps-pack --run-before install-libunrar.sh --dynlib unrar unrar

In the above command, the value unrar for --dynlib omits prefix lib, and what pyodps-pack actually finds is /lib/libunrar.so. If you need to include multiple dynamically-linked libraries, you might specify --dynlib multiple times.

Due to complexity of dynamically-linked libraries, you may need to load these libraries manually before actually importing your Python library. For instance,

import ctypes
ctypes.cdll.LoadLibrary("work/packages.tar.gz/packages/dynlibs/libunrar.so")
import unrar

Detail information about path used in LoadLibrary in code above can be seen in directions in using third-party libraries in Python UDF.

Command details

Arguments of pyodps-pack is listed below:

-r, --requirement <file>

Pack given specified requirement file. Can be used multiple times.
-o, --output <file>

Specify file name of the target package, packages.tar.gz by default.
--install-requires <item>

Specify build-time requirements, might not be included in the final package.
--install-requires-file <file>

Specify build-time requirements in files, might not be included in the final package.
--run-before <script-file>

Specify name of bash script to run before packing, can be used to install binary dependencies.
-X, --exclude <dependency>

Specify dependencies to be excluded in the final package, can be specified multiple times.
--no-deps

If specified, will not include dependencies of specified requirements.
--pre

If specified, will include pre-release and development versions. By default, pyodps-pack only finds stable versions.
--proxy <proxy>

Specify a proxy in the form scheme://[user:passwd@]proxy.server:port.
--retries <retries>

Maximum number of retries each connection should attempt (default 5 times).
timeout <secs>

Set the socket timeout (default 15 seconds).
-i, --index-url <url>

Specify URL of package indexes of PyPI package. If absent, will use global.index-url in pip config list command by default.
--extra-index-url <url>

Extra URLs of package indexes to use in addition to --index-url. Should follow the same rules as --index-url.
--trusted-host <host>

Specify domains whose certifications are trusted when PyPI urls are using HTTPS.
-l, --legacy-image

If specified, will use CentOS 5 to pack, making the final package available under old environments such as legacy proprietary clouds.
--mcpy27

If specified, will build packages for Python 2.7 on MaxCompute and assume --legacy-image is enabled.
--dwpy27

If specified, will build packages for Python 2.7 on DataWorks and assume --legacy-image is enabled.
--prefer-binary

If specified, will prefer older binary packages over newer source packages.
--arch <architecture>

Specify the hardware architecture for the package. Currently only x86_64 and aarch64 (or equivalently arm64) supported. x86_64 by default. If you are not running your code inside a proprietary cloud, do not add this argument.
--python-version <version>

Specify Python version for the package. You may use 3.6 or 36 to stand for Python 3.6. If you are not running your code inside a proprietary cloud, do not add this argument.
--dynlib <lib-name>

Specify .so libraries to link dynamically. You may specify a path to the required library, or just the name of the library (with or without lib prefix). The command will seek these libraries under /lib, /lib64, /usr/lib or /usr/lib64, and put them into packages/dynlibs in the package. You may need to call ctypes.cdll.LoadLibrary() with paths to these libraries manually to reference them.
--docker-args <args>

Specify extra arguments needed for Docker command. If there are more than one argument, please put them within quote marks. For instance, --docker-args "--ip 192.168.1.10".
--no-docker

Use non-Docker mode to run pyodps-pack. You might receive errors or get malfunctioning packages with this mode when there are binary dependencies.
--no-merge

Skip building .tar.gz package and keep .whl files after downloading or creating Python wheels.
--skip-scan-pkg-resources

Skip scanning and resolving dependencies for pkg_resources in the package. Once configured, may save time when there are a large number of dependencies.
--find-vcs-root

当需要打包的本地代码需要依赖 Git 等版本管理工具上的 Tag 作为版本信息（例如使用 setuptools_scm 管理版本号）且 Python 包根目录与代码根目录不一致时，该选项能自动向上找到版本管理工具中代码的根目录。
--use-default-image-tag

If specified, default image will use default tag to avoid downloading images again.
--debug

If specified, will output details when executing the command. This argument is for debug purpose.

You can also specify environment variables to control the build.

DOCKER_PATH="path to docker installation"

Specify path to executable files of Docker, which should contain docker executable.
BEFORE_BUILD="command before build"

Specify commands to run before build.
AFTER_BUILD="command after build"

Specify commands to run after tar packages are created.
DOCKER_IMAGE="quay.io/pypa/manylinux2010_x86_64"

Customize Docker Image to use. It is recommended to build Docker image based on pypa/manylinux images.

Use third-party libraries

Upload third-party libraries

Please make sure your packages are uploaded as MaxCompute resources with archive type. To upload resources, you may use code below. Note that you need to change packages.tar.gz into the path to your package.

import os
from odps import ODPS

# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='**your-project**',
    endpoint='**your-endpoint**',
)
o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))

You can also try uploading packages with DataWorks following steps below.

Go to the DataStudio page.
1. Log on to the DataWorks console.
2. In the top navigation bar, click list of regions.
3. Select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
On the Data Analytics tab, move the pointer over the Create icon and choose MaxCompute > Resource > Python.

Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create > Resource > Python.
In the Create Resource dialog box, set the Resource Name and Location parameters.
Click Upload and select the file that you want to upload.
Click Create.
Click the Submit icon icon in the top toolbar to commit the resource to the development environment.

More details can be seen in this article.

Use third-party libraries in Python UDFs

You need to modify your UDF code to use uploaded packages. You need to add references to your packages in __init__ method of your UDF class, and use these packages in your UDF code, for instance, evaluate or process methods.

We take psi function in scipy for example to show how to use third-party libraries in Python UDF. First, pack dependencies use commands below:

pyodps-pack -o scipy-bundle.tar.gz scipy

Then write code below and store as test_psi_udf.py.

import sys
from odps.udf import annotate


@annotate("double->double")
class MyPsi:
    def __init__(self):
        # add line below if and only if protobuf is a dependency
        sys.setdlopenflags(10)
        # add extracted package path into sys.path
        sys.path.insert(0, "work/scipy-bundle.tar.gz/packages")

    def evaluate(self, arg0):
        # keep import statements inside evaluate function body
        from scipy.special import psi

        return float(psi(arg0))

We give some explanations to code above.

When protobuf is a dependency, you need to add sys.setdlopenflags(10) to __init__ function. pyodps-pack will notify you when you need to do this. Adding this line will avoid conflict between different versions of binaries of your libraries and MaxCompute itself.
In __init__ method, work/scipy-bundle.tar.gz/packages is inserted into sys.path, as MaxCompute will extract all archive resources the UDF references into work directory, while packages is the subdirectory created by pyodps-pack when packing your dependencies. If you need to load dynamically-linked libraries packed with --dynlib with LoadLibrary, code can also be added here.
The reason of putting import statement of scipy inside the method body of the function evaluate is that third-party libraries are only available when the UDF is being executed, and when the UDF is being resolved in MaxCompute service, there is no packages for use and import statements of these packages outside method bodies will cause errors.

Then you need to upload test_psi_udf.py as MaxCompute Python resource and scipy-bundle.tar.gz as archive resource. After that, you need to create a Python UDF named as test_psi_udf, reference two resource files and specify class name as test_psi_udf.MyPsi.

Code to accomplish above steps with PyODPS is shown below.

import os
from odps import ODPS

# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='**your-project**',
    endpoint='**your-endpoint**',
)
bundle_res = o.create_resource(
    "scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb")
)
udf_res = o.create_resource(
    "test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb")
)
o.create_function(
    "test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res]
)

If you want to use MaxCompute Console to accomplish these steps, you may type commands below.

add archive scipy-bundle.tar.gz;
add py test_psi_udf.py;
create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;

After that, you can call the UDF you just created with SQL.

set odps.pypy.enabled=false;
set odps.isolation.session.enable=true;
select test_psi_udf(sepal_length) from iris;

Use third-party libraries in PyODPS DataFrame

PyODPS DataFrame supports using third-party libraries created above by adding a libraries argument when calling methods like execute or persist. We take map method for example, the same procedure can be used for apply or map_reduce method.

First, create a package for scipy with command below.

pyodps-pack -o scipy-bundle.tar.gz scipy

Assuming that the table is named as test_float_col and it only contains one column with float value.

   col1
0  3.75
1  2.51

Write code below to compute value of psi(col1).

from odps import ODPS, options

def psi(v):
    from scipy.special import psi

    return float(psi(v))

# If isolation is enabled in Project, option below is not compulsory.
options.sql.settings = {"odps.isolation.session.enable": True}

# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='**your-project**',
    endpoint='**your-endpoint**',
)
df = o.get_table("test_float_col").to_df()
# Execute directly and fetch result
df.col1.map(psi).execute(libraries=["scipy-bundle.tar.gz"])
# Store to another table
df.col1.map(psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])

If you want to use the same third-party packages, you can configure these packages as global:

from odps import options
options.df.libraries = ["scipy-bundle.tar.gz"]

After that, you can use these third-party libraries when DataFrames are being executed.

Use third-party libraries in DataWorks

PyODPS nodes in DataWorks already installed several third-party libraries beforehand. load_resource_package method is also provided to load packages not preinstalled. Details of usage can be seen here.

Upload and use third-party libraries manually

Note

Documents below is only a reference for maintenance of legacy projects or projects in legacy environments. For new projects please use pyodps-pack straightforwardly.

Some legacy projects might use old-style method to deploy and use third-party libraries, i.e., manually upload all dependant wheel packages and reference them in code. Some projects are deployed in legacy MaxCompute environments and does not support using binary wheel packages. This chapter is written for these scenarios. Take the following python-dateutil package as an example.

First, you can use the pip download command to download the package and its dependencies to a specific path. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl. Note that the packages must support Linux environment. It is recommended to call this command under Linux.

pip download python-dateutil -d /to/path/

Then upload the files to MaxCompute as resources.

>>> # make sure that file extensions are correct
>>> odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
>>> odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))

Now you have a DataFrame object that only contains a string field.

>>> df
               datestr
0  2016-08-26 14:03:29
1  2015-08-26 14:03:29

Set the third-party library as global:

>>> from odps import options
>>>
>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> options.df.libraries = ['six.whl', 'python_dateutil.whl']
>>> df.datestr.map(get_year)
   datestr
0     2016
1     2015

Or use the libraries attribute of an action to specify the package:

>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
   datestr
0     2016
1     2015

By default, PyODPS supports third-party libraries that contain pure Python code but no file operations. In newer versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or file operations. These libraries must be suffixed with certain strings, which can be looked up in the table below.

Platform	Python version	Suffixes available
RHEL 5 x86_64	Python 2.7	cp27-cp27m-manylinux1_x86_64
RHEL 5 x86_64	Python 3.7	cp37-cp37m-manylinux1_x86_64
RHEL 7 x86_64	Python 2.7	cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64
RHEL 7 x86_64	Python 3.7	cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64
RHEL 7 ARM64	Python 3.7	cp37-cp37m-manylinux2014_aarch64

All .whl packages need to be uploaded in the archive format, while .whl packages must be renamed to .zip files. You also need to enable the odps.isolation.session.enable option or enable Isolation in your project. The following example demonstrates how to upload and use the special functions in scipy:

>>> # packages containing binaries should be uploaded with archive method,
>>> # replacing extension .whl with .zip.
>>> odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
>>>
>>> # if your project has already been configured with isolation, the line below can be omitted
>>> options.sql.settings = { 'odps.isolation.session.enable': True }
>>>
>>> def psi(value):
>>>     # it is recommended to import third-party libraries inside your function
>>>     # in case that structures of the same package differ between different systems.
>>>     from scipy.special import psi
>>>     return float(psi(value))
>>>
>>> df.float_col.map(psi).execute(libraries=['scipy.zip'])

For binary packages that only contain source code, they can be packaged into .whl files and uploaded through Linux shell. .whl files generated in Mac and Windows are not usable in MaxCompute:

python setup.py bdist_wheel