Create and use third-party libraries

Create third-party libraries

PyODPS provides a pip-like command line tool, pyodps-pack, to support creating third-party library bundles that can be used in PyODPS and DataWorks nodes since 0.11.3. You can use this tool to pack all your dependencies into a .tar.gz archive containing all dependencies packed according to Python environments in MaxCompute or DataWorks. It can also help packing Python packages created by yourself.

Prerequisites

Docker mode

You need to install Docker to run pyodps-pack correctly in Docker mode. For Linux users, Docker can be installed given the Official document.For personal MacOS or Windows users, Docker Desktop can be used. For enterprise users without commercial licenses of Docker Desktop, Rancher Desktop might be used. We does not test on other tools providing Docker environments such as minikube, and availability of the tool on these environments is not guaranteed.

For users who want to create packages for legacy MaxCompute / DataWorks in private clouds, --legacy-image option might be used. In Windows, MacOS or Linux with some kernel, you might receive errors with this option. In this case you may take a look at this article for solutions.

For Windows users, it is possible that your Docker service depends on Server service of Windows system. However, this service is often prohibited in many companies. In this case, please create packages under Linux or try starting the service. It is known that Rancher Desktop may not perform correctly with containerd as container engine, you may switch to dockerd instead. Details about switching container engines can be found in this article.

If your MaxCompute or DataWorks are deployed on Arm64 architecture, you need to add an extra --arch aarch64 argument to specify your architecture for the package. Usually components for cross-architecture packaging like binfmt are already included in Docker Desktop or Rancher Desktop. You can also run command below to install related virtual environments manually.

docker run --privileged --rm tonistiigi/binfmt --install arm64

This command requires version of Linux kernel above 4.8. Details of the command can be found in this article.

Non-Docker mode

Note

We recommend using Docker mode to create packages if possible. Non-Docker mode might be used only when Docker is not available. It is also possible to create malfunctioning packages.

When you have problems installing Docker, you might try non-Docker mode by adding a --without-docker argument. When using non-Docker mode, pip is needed in your Python installation. Windows users need to install Git bash to use non-Docker mode, which is included in Git for Windows.

Pack all dependencies

Note

It is recommended to use Python 3 for new projects. We do not guarantee availability of methods below for Python 2. You might try your best migrating your legacy projects to Python 3 to reduce difficulties of maintenance in future.

Please add sudo when calling pyodps-pack in Linux to make sure Docker is called correctly.

After PyODPS is installed, you can use command below to pack pandas and all its dependencies.

pyodps-pack pandas

If you want to use non-Docker mode to pack, you can use

pyodps-pack --without-docker pandas

If you need to specify version of pandas, you may use

pyodps-pack pandas==1.2.5

After a series of packing processes, the utility will show versions of packed packages.

Package         Version
--------------- -------
numpy           1.21.6
pandas          1.2.5
python-dateutil 2.8.2
pytz            2022.6
six             1.16.0

and generates a packages.tar.gz with all dependency items listed above.

If you need to create packages for Python 2.7, please check which environment your package will work with, MaxCompute or DataWorks. If you are not sure which environment you are using, you may take a look at this article. If you want to use Python 2.7 packages in MaxCompute, you can use the command below.

pyodps-pack --mcpy27 pandas

If you want to use Python 2.7 packages in DataWorks, you can use the command below.

pyodps-pack --dwpy27 pandas

Pack custom source code

pyodps-pack supports packing user-defined source code organized with setup.py or pyproject.toml. If you want to know how to build Python packages with these files, you can take a look at this link for more information.

We show how to pack custom code by creating a custom package with pyproject.toml and packing with pyodps-pack. Assuming that the directory structure of the project looks like

test_package_root
├── test_package
│   ├── __init__.py
│   ├── mod1.py
│   └── subpackage
│       ├── __init__.py
│       └── mod2.py
└── pyproject.toml

while the content of pyproject.toml is

[project]
name = "test_package"
description = "pyodps-pack example package"
version = "0.1.0"
dependencies = [
    "pandas>=1.0.5"
]

After development of the package, we can pack this package and all the dependencies into packages.tar.gz. (path_to_package is the parent directory of test_package_root)

pyodps-pack /<path_to_package>/test_package_root

Pack code in a Git repository

Packing remote Git repositories is supported in pyodps-pack. We take PyODPS repository as an example to show how to pack a remote Git repository.

pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git

If you want to pack a certain branch or tag, you may use

pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

If you want to install dependencies on build, for instance, cython, you can use --install-requires argument to specify a build-time dependency. You may also create a text file, install-requires.txt, whose format is similar to requirements.txt, and use --install-requires-file to reference it. For instance, if you need to install Cython before packing PyODPS, you can call

pyodps-pack \
    --install-requires cython \
    git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

It is also possible to write a install-requires.txt with content

cython>0.29

and pack command can be written as

pyodps-pack \
    --install-requires-file install-requires.txt \
    git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2

A more complicated case: adding binary dependencies

Some third-party libraries depend on extra binary dependencies, for instance, extra dynamically-linked libraries needed to be built and installed. You can use pyodps-pack with an argument --run-before to specify a bash script which can be used to install binary dependencies. We take geospatial library GDAL as an example to show how to pack this kind of packages.

First, we need to find which dependencies needed to install. Given the document of GDAL 3.6.0.1 on PyPI, we need to install libgdal >= 3.6.0. What’s more, the build hints of GDAL shows that it depends on PROJ >= 6.0. Both dependencies can be built with CMake. Thus we write a bash script, install-gdal.sh, to install these dependencies.

#!/bin/bash
set -e

cd /tmp
curl -o proj-6.3.2.tar.gz https://download.osgeo.org/proj/proj-6.3.2.tar.gz
tar xzf proj-6.3.2.tar.gz
cd proj-6.3.2
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install

cd /tmp
curl -o gdal-3.6.0.tar.gz http://download.osgeo.org/gdal/3.6.0/gdal-3.6.0.tar.gz
tar xzf gdal-3.6.0.tar.gz
cd gdal-3.6.0
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install

Then use pyodps-pack to pack GDAL python library.

pyodps-pack --install-requires oldest-supported-numpy --run-before install-gdal.sh gdal==3.6.0.1

Command details

Arguments of pyodps-pack is listed below:

  • -r, --requirement <file>

    Pack given specified requirement file. Can be used multiple times.

  • -o, --output <file>

    Specify file name of the target package, packages.tar.gz by default.

  • --install-requires <item>

    Specify build-time requirements, might not be included in the final package.

  • --install-requires-file <file>

    Specify build-time requirements in files, might not be included in the final package.

  • --run-before <script-file>

    Specify name of bash script to run before packing, can be used to install binary dependencies.

  • -X, --exclude <dependency>

    Specify dependencies to be excluded in the final package, can be specified multiple times.

  • --no-deps

    If specified, will not include dependencies of specified requirements.

  • -i, --index-url <index-url>

    Specify Base URL of PyPI package. If absent, will use global.index-url in pip config list command by default.

  • --trusted-host <host>

    Specify domains whose certifications are trusted when PyPI urls are using HTTPS.

  • -l, --legacy-image

    If specified, will use CentOS 5 to pack, making the final package available under old environments such as legacy private clouds.

  • --mcpy27

    If specified, will build packages for Python 2.7 on MaxCompute and assume --legacy-image is enabled.

  • --dwpy27

    If specified, will build packages for Python 2.7 on DataWorks and assume --legacy-image is enabled.

  • --prefer-binary

    If specified, will prefer older binary packages over newer source packages.

  • --arch <architecture>

    Specify the hardware architecture for the package. Currently only x86_64 and aarch64 (or equivalently arm64) supported. x86_64 by default. If you are not running your code inside a proprietary cloud, do not add this argument.

  • --python-version <version>

    Specify Python version for the package. You may use 3.6 or 36 to stand for Python 3.6. If you are not running your code inside a proprietary cloud, do not add this argument.

  • --docker-args <args>

    Specify extra arguments needed for Docker command. If there are more than one argument, please put them within quote marks. For instance, --docker-args "--ip 192.168.1.10".

  • --without-docker

    Use non-Docker mode to run pyodps-pack. You might receive errors or get malfunctioning packages with this mode when there are binary dependencies.

  • --without-merge

    Skip building .tar.gz package and keep .whl files after downloading or creating Python wheels.

  • --debug

    If specified, will output details when executing the command. This argument is for debug purpose.

You can also specify environment variables to control the build.

  • BEFORE_BUILD="command before build"

    Specify commands to run before build.

  • AFTER_BUILD="command after build"

    Specify commands to run after tar packages are created.

  • DOCKER_IMAGE="quay.io/pypa/manylinux2010_x86_64"

    Customize Docker Image to use. It is recommended to build Docker image based on pypa/manylinux images.

Use third-party libraries

Upload third-party libraries

Please make sure your packages are uploaded as MaxCompute resources with archive type. To upload resources, you may use code below. Note that you need to change packages.tar.gz into the path to your package.

from odps import ODPS

o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))

You can also try uploading packages with DataWorks following steps below.

  1. Go to the DataStudio page.

    1. Log on to the DataWorks console.
    2. In the top navigation bar, click list of regions.
    3. Select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the Data Analytics tab, move the pointer over the Create icon and choose MaxCompute > Resource > Python.

    Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create > Resource > Python.

  3. In the Create Resource dialog box, set the Resource Name and Location parameters.

  4. Click Upload and select the file that you want to upload.

  5. Click Create.

  6. Click the Submit icon icon in the top toolbar to commit the resource to the development environment.

More details can be seen in this article.

Use third-party libraries in Python UDFs

You need to modify your UDF code to use uploaded packages. You need to add references to your packages in __init__ method of your UDF class, and use these packages in your UDF code, for instance, evaluate or process methods.

We take psi function in scipy for example to show how to use third-party libraries in Python UDF. First, pack dependencies use commands below:

pyodps-pack -o scipy-bundle.tar.gz scipy

Then write code below and store as test_psi_udf.py.

import sys
from odps.udf import annotate


@annotate("double->double")
class MyPsi(object):
    def __init__(self):
        # add extracted package path into sys.path
        sys.path.insert(0, "work/scipy-bundle.tar.gz/packages")

    def evaluate(self, arg0):
        # keep import statements inside evaluate function body
        from scipy.special import psi

        return float(psi(arg0))

We give some explanations to code above. In __init__ method, work/scipy-bundle.tar.gz/packages is inserted into sys.path, as MaxCompute will extract all archive resources the UDF references into work directory, while packages is the subdirectory created by pyodps-pack when packing your dependencies. The reason of putting import statement of scipy inside the method body of the function evaluate is that third-party libraries are only available when the UDF is being executed, and when the UDF is being resolved in MaxCompute service, there is no packages for use and import statements of these packages outside method bodies will cause errors.

Then you need to upload test_psi_udf.py as MaxCompute Python resource and scipy-bundle.tar.gz as archive resource. After that, you need to create a Python UDF named as test_psi_udf, reference two resource files and specify class name as test_psi_udf.MyPsi.

Code to accomplish above steps with PyODPS is shown below.

from odps import ODPS

o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
bundle_res = o.create_resource(
    "scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb")
)
udf_res = o.create_resource(
    "test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb")
)
o.create_function(
    "test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res]
)

If you want to use MaxCompute Console to accomplish these steps, you may type commands below.

add archive scipy-bundle.tar.gz;
add py test_psi_udf.py;
create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;

After that, you can call the UDF you just created with SQL.

set odps.pypy.enabled=false;
set odps.isolation.session.enable=true;
select test_psi_udf(sepal_length) from iris;

Use third-party libraries in PyODPS DataFrame

PyODPS DataFrame supports using third-party libraries created above by adding a libraries argument when calling methods like execute or persist. We take map method for example, the same procedure can be used for apply or map_reduce method.

First, create a package for scipy with command below.

pyodps-pack -o scipy-bundle.tar.gz scipy

Assuming that the table is named as test_float_col and it only contains one column with float value.

Write code below to compute value of psi(col1).

from odps import ODPS, options

def psi(v):
    from scipy.special import psi

    return float(psi(v))

# If isolation is enabled in Project, option below is not compulsory.
options.sql.settings = {"odps.isolation.session.enable": True}

o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
df = o.get_table("test_float_col").to_df()
# Execute directly and fetch result
df.col1.map(psi).execute(libraries=["scipy-bundle.tar.gz"])
# Store to another table
df.col1.map(psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])

If you want to use the same third-party packages, you can configure these packages as global:

from odps import options
options.df.libraries = ["scipy-bundle.tar.gz"]

After that, you can use these third-party libraries when DataFrames are being executed.

Use third-party libraries in DataWorks

PyODPS nodes in DataWorks already installed several third-party libraries beforehand. load_resource_package method is also provided to load packages not preinstalled. Details of usage can be seen here.

Upload and use third-party libraries manually

Note

Documents below is only a reference for maintenance of legacy projects or projects in legacy environments. For new projects please use pyodps-pack straightforwardly.

Some legacy projects might use old-style method to deploy and use third-party libraries, i.e., manually upload all dependant wheel packages and reference them in code. Some projects are deployed in legacy MaxCompute environments and does not support using binary wheel packages. This chapter is written for these scenarios. Take the following python-dateutil package as an example.

First, you can use the pip download command to download the package and its dependencies to a specific path. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl. Note that the packages must support Linux environment. It is recommended to call this command under Linux.

pip download python-dateutil -d /to/path/

Then upload the files to MaxCompute as resources.

>>> # make sure that file extensions are correct
>>> odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
>>> odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))

Now you have a DataFrame object that only contains a string field.

>>> df
               datestr
0  2016-08-26 14:03:29
1  2015-08-26 14:03:29

Set the third-party library as global:

>>> from odps import options
>>>
>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> options.df.libraries = ['six.whl', 'python_dateutil.whl']
>>> df.datestr.map(get_year)
   datestr
0     2016
1     2015

Or use the libraries attribute of an action to specify the package:

>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
   datestr
0     2016
1     2015

By default, PyODPS supports third-party libraries that contain pure Python code but no file operations. In newer versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or file operations. These libraries must be suffixed with certain strings, which can be looked up in the table below.

Platform Python version Suffixes available
RHEL 5 x86_64 Python 2.7 cp27-cp27m-manylinux1_x86_64
RHEL 5 x86_64 Python 3.7 cp37-cp37m-manylinux1_x86_64
RHEL 7 x86_64 Python 2.7 cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64
RHEL 7 x86_64 Python 3.7 cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64
RHEL 7 Arm64 Python 3.7 cp37-cp37m-manylinux2014_aarch64

All .whl packages need to be uploaded in the archive format, while .whl packages must be renamed to .zip files. You also need to enable the odps.isolation.session.enable option or enable Isolation in your project. The following example demonstrates how to upload and use the special functions in scipy:

>>> # packages containing binaries should be uploaded with archive method,
>>> # replacing extension .whl with .zip.
>>> odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
>>>
>>> # if your project has already been configured with isolation, the line below can be omitted
>>> options.sql.settings = { 'odps.isolation.session.enable': True }
>>>
>>> def psi(value):
>>>     # it is recommended to import third-party libraries inside your function
>>>     # in case that structures of the same package differ between different systems.
>>>     from scipy.special import psi
>>>     return float(psi(value))
>>>
>>> df.float_col.map(psi).execute(libraries=['scipy.zip'])

For binary packages that only contain source code, they can be packaged into .whl files and uploaded through Linux shell. .whl files generated in Mac and Windows are not usable in MaxCompute:

python setup.py bdist_wheel