Create and use third-party libraries
Create third-party libraries
PyODPS provides a pip-like command line tool, pyodps-pack
, to support creating third-party library bundles that can be used in PyODPS and DataWorks nodes since 0.11.3. You can use this tool to pack all your dependencies into a .tar.gz
archive containing all dependencies packed according to Python environments in MaxCompute or DataWorks. It can also help packing Python packages created by yourself.
Prerequisites
Docker mode
You need to install Docker to run pyodps-pack
correctly in Docker mode. You don’t need to run pyodps-pack
manually inside a Docker container. It will call Docker for you automatically. For Linux users, Docker can be installed given the Official document.For personal MacOS or Windows users, Docker Desktop can be used. For enterprise users without commercial licenses of Docker Desktop, Rancher Desktop might be used. You may also consider using minikube with some extra steps described in this document. We do not test on other tools providing Docker environments, and availability of the tool on these environments is not guaranteed.
For users who want to create packages for legacy MaxCompute / DataWorks in private clouds, --legacy-image
option might be used. In Windows, MacOS or Linux with some kernel, you might receive errors with this option. In this case you may take a look at this article for solutions.
For Windows users, it is possible that your Docker service depends on Server service of Windows system. However, this service is often prohibited in many companies. In this case, please create packages under Linux or try starting the service. It is known that Rancher Desktop may not perform correctly with containerd
as container engine, you may switch to dockerd
instead. Details about switching container engines can be found in this article.
If your MaxCompute or DataWorks are deployed on ARM64 architecture (usually within proprietary clouds), you need to add an extra --arch aarch64
argument to specify your architecture for the package. Usually components for cross-architecture packaging like binfmt
are already included in Docker Desktop or Rancher Desktop. You can also run command below to install related virtual environments manually.
docker run --privileged --rm tonistiigi/binfmt --install arm64
This command requires version of Linux kernel above 4.8. Details of the command can be found in this article.
Non-Docker mode
Note
We recommend using Docker mode to create packages if possible. Non-Docker mode might be used only when Docker is not available. It is also possible to create malfunctioning packages.
When you have problems installing Docker, you might try non-Docker mode by adding a --without-docker
argument. When using non-Docker mode, pip
is needed in your Python installation. Windows users need to install Git bash to use non-Docker mode, which is included in Git for Windows.
Pack all dependencies
Note
It is recommended to use Python 3 for new projects. We do not guarantee availability of methods below for Python 2. You might try your best migrating your legacy projects to Python 3 to reduce difficulties of maintenance in future.
Please add sudo
when calling pyodps-pack
in Linux to make sure Docker is called correctly.
After PyODPS is installed, you can use command below to pack pandas and all its dependencies.
pyodps-pack pandas
If you want to use non-Docker mode to pack, you can use
pyodps-pack --without-docker pandas
If you need to specify version of pandas, you may use
pyodps-pack pandas==1.2.5
After a series of packing processes, the utility will show versions of packed packages.
Package Version
--------------- -------
numpy 1.21.6
pandas 1.2.5
python-dateutil 2.8.2
pytz 2022.6
six 1.16.0
and generates a packages.tar.gz
with all dependency items listed above.
If you need to create packages for Python 2.7, please check which environment your package will work with, MaxCompute or DataWorks. If you are not sure which environment you are using, you may take a look at this article. If you want to use Python 2.7 packages in MaxCompute, you can use the command below.
pyodps-pack --mcpy27 pandas
If you want to use Python 2.7 packages in DataWorks, you can use the command below.
pyodps-pack --dwpy27 pandas
Pack custom source code
pyodps-pack
supports packing user-defined source code organized with setup.py
or pyproject.toml
. If you want to know how to build Python packages with these files, you can take a look at this link for more information.
We show how to pack custom code by creating a custom package with pyproject.toml
and packing with pyodps-pack
. Assuming that the directory structure of the project looks like
test_package_root
├── test_package
│ ├── __init__.py
│ ├── mod1.py
│ └── subpackage
│ ├── __init__.py
│ └── mod2.py
└── pyproject.toml
while the content of pyproject.toml
is
[project]
name = "test_package"
description = "pyodps-pack example package"
version = "0.1.0"
dependencies = [
"pandas>=1.0.5"
]
After development of the package, we can pack this package and all the dependencies into packages.tar.gz
. (path_to_package
is the parent directory of test_package_root
)
pyodps-pack /<path_to_package>/test_package_root
Pack code in a Git repository
Packing remote Git repositories is supported in pyodps-pack
. We take PyODPS repository as an example to show how to pack a remote Git repository.
pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git
If you want to pack a certain branch or tag, you may use
pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
If you want to install dependencies on build, for instance, cython
, you can use --install-requires
argument to specify a build-time dependency. You may also create a text file, install-requires.txt
, whose format is similar to requirements.txt
, and use --install-requires-file
to reference it. For instance, if you need to install Cython
before packing PyODPS, you can call
pyodps-pack \
--install-requires cython \
git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
It is also possible to write a install-requires.txt
with content
cython>0.29
and pack command can be written as
pyodps-pack \
--install-requires-file install-requires.txt \
git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
A more complicated case: adding binary dependencies
Some third-party libraries depend on extra binary dependencies, for instance, extra dynamically-linked libraries needed to be built and installed. You can use pyodps-pack
with an argument --run-before
to specify a bash script which can be used to install binary dependencies. We take geospatial library GDAL as an example to show how to pack this kind of packages.
First, we need to find which dependencies needed to install. Given the document of GDAL 3.6.0.1 on PyPI, we need to install libgdal >= 3.6.0. What’s more, the build hints of GDAL shows that it depends on PROJ >= 6.0. Both dependencies can be built with CMake. Thus we write a bash script, install-gdal.sh
, to install these dependencies.
#!/bin/bash
set -e
cd /tmp
curl -o proj-6.3.2.tar.gz https://download.osgeo.org/proj/proj-6.3.2.tar.gz
tar xzf proj-6.3.2.tar.gz
cd proj-6.3.2
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install
cd /tmp
curl -o gdal-3.6.0.tar.gz http://download.osgeo.org/gdal/3.6.0/gdal-3.6.0.tar.gz
tar xzf gdal-3.6.0.tar.gz
cd gdal-3.6.0
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install
Then use pyodps-pack
to pack GDAL python library.
pyodps-pack --install-requires oldest-supported-numpy --run-before install-gdal.sh gdal==3.6.0.1
In some scenarios binary dependencies are dynamically linked (for instance, with ctypes.cdll.LoadLibrary
) into Python. You may use --dynlib
argument to introduce these binary packages with the path or library name under /lib. The binary dependency will be packed into packages/dynlibs
in the package. For instance, the Python library unrar
linked binary library libunrar
dynamically, and we can use the script install-libunrar.sh
shown below to compile and install it.
#!/bin/bash
curl -o unrar.tar.gz https://www.rarlab.com/rar/unrarsrc-6.0.3.tar.gz
tar xzf unrar.tar.gz
cd unrar
make -j4 lib
# Code below sets SONAME to libunrar.so for the package,
# which is required by LoadLibrary in Python.
# This is not needed for most of binary libraries.
patchelf --set-soname libunrar.so libunrar.so
make install-lib
Then use pyodps-pack
to pack GDAL python library.
pyodps-pack --run-before install-libunrar.sh --dynlib unrar unrar
In the above command, the value unrar
for --dynlib
omits prefix lib
, and what pyodps-pack
actually finds is /lib/libunrar.so
. If you need to include multiple dynamically-linked libraries, you might specify --dynlib
multiple times.
Due to complexity of dynamically-linked libraries, you may need to load these libraries manually before actually importing your Python library. For instance,
import ctypes
ctypes.cdll.LoadLibrary("work/packages.tar.gz/packages/dynlibs/libunrar.so")
import unrar
Detail information about path used in LoadLibrary
in code above can be seen in directions in using third-party libraries in Python UDF.
Command details
Arguments of pyodps-pack
is listed below:
-r
,--requirement <file>
Pack given specified requirement file. Can be used multiple times.
-o
,--output <file>
Specify file name of the target package,
packages.tar.gz
by default.--install-requires <item>
Specify build-time requirements, might not be included in the final package.
--install-requires-file <file>
Specify build-time requirements in files, might not be included in the final package.
--run-before <script-file>
Specify name of bash script to run before packing, can be used to install binary dependencies.
-X
,--exclude <dependency>
Specify dependencies to be excluded in the final package, can be specified multiple times.
--no-deps
If specified, will not include dependencies of specified requirements.
--pre
If specified, will include pre-release and development versions. By default, pyodps-pack only finds stable versions.
--proxy <proxy>
Specify a proxy in the form scheme://[user:passwd@]proxy.server:port.
--retries <retries>
Maximum number of retries each connection should attempt (default 5 times).
timeout <secs>
Set the socket timeout (default 15 seconds).
-i
,--index-url <url>
Specify URL of package indexes of PyPI package. If absent, will use
global.index-url
inpip config list
command by default.--extra-index-url <url>
Extra URLs of package indexes to use in addition to
--index-url
. Should follow the same rules as--index-url
.--trusted-host <host>
Specify domains whose certifications are trusted when PyPI urls are using HTTPS.
-l
,--legacy-image
If specified, will use CentOS 5 to pack, making the final package available under old environments such as legacy proprietary clouds.
--mcpy27
If specified, will build packages for Python 2.7 on MaxCompute and assume
--legacy-image
is enabled.--dwpy27
If specified, will build packages for Python 2.7 on DataWorks and assume
--legacy-image
is enabled.--prefer-binary
If specified, will prefer older binary packages over newer source packages.
--arch <architecture>
Specify the hardware architecture for the package. Currently only x86_64 and aarch64 (or equivalently arm64) supported. x86_64 by default. If you are not running your code inside a proprietary cloud, do not add this argument.
--python-version <version>
Specify Python version for the package. You may use 3.6 or 36 to stand for Python 3.6. If you are not running your code inside a proprietary cloud, do not add this argument.
--dynlib <lib-name>
Specify .so libraries to link dynamically. You may specify a path to the required library, or just the name of the library (with or without lib prefix). The command will seek these libraries under
/lib
,/lib64
,/usr/lib
or/usr/lib64
, and put them into packages/dynlibs in the package. You may need to callctypes.cdll.LoadLibrary()
with paths to these libraries manually to reference them.--docker-args <args>
Specify extra arguments needed for Docker command. If there are more than one argument, please put them within quote marks. For instance,
--docker-args "--ip 192.168.1.10"
.--without-docker
Use non-Docker mode to run
pyodps-pack
. You might receive errors or get malfunctioning packages with this mode when there are binary dependencies.--without-merge
Skip building
.tar.gz
package and keep.whl
files after downloading or creating Python wheels.--skip-scan-pkg-resources
Skip scanning and resolving dependencies for
pkg_resources
in the package. Once configured, may save time when there are a large number of dependencies.--debug
If specified, will output details when executing the command. This argument is for debug purpose.
You can also specify environment variables to control the build.
DOCKER_PATH="path to docker installation"
Specify path to executable files of Docker, which should contain
docker
executable.BEFORE_BUILD="command before build"
Specify commands to run before build.
AFTER_BUILD="command after build"
Specify commands to run after tar packages are created.
DOCKER_IMAGE="quay.io/pypa/manylinux2010_x86_64"
Customize Docker Image to use. It is recommended to build Docker image based on
pypa/manylinux
images.
Use third-party libraries
Upload third-party libraries
Please make sure your packages are uploaded as MaxCompute resources with archive type. To upload resources, you may use code below. Note that you need to change packages.tar.gz
into the path to your package.
import os
from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
)
o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))
You can also try uploading packages with DataWorks following steps below.
Go to the DataStudio page.
Log on to the DataWorks console.
In the top navigation bar, click list of regions.
Select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
On the Data Analytics tab, move the pointer over the Create icon and choose MaxCompute > Resource > Python.
Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create > Resource > Python.
In the Create Resource dialog box, set the Resource Name and Location parameters.
Click Upload and select the file that you want to upload.
Click Create.
Click the Submit icon icon in the top toolbar to commit the resource to the development environment.
More details can be seen in this article.
Use third-party libraries in Python UDFs
You need to modify your UDF code to use uploaded packages. You need to add references to your packages in __init__
method of your UDF class, and use these packages in your UDF code, for instance, evaluate or process methods.
We take psi function in scipy for example to show how to use third-party libraries in Python UDF. First, pack dependencies use commands below:
pyodps-pack -o scipy-bundle.tar.gz scipy
Then write code below and store as test_psi_udf.py
.
import sys
from odps.udf import annotate
@annotate("double->double")
class MyPsi(object):
def __init__(self):
# add line below if and only if protobuf is a dependency
sys.setdlopenflags(10)
# add extracted package path into sys.path
sys.path.insert(0, "work/scipy-bundle.tar.gz/packages")
def evaluate(self, arg0):
# keep import statements inside evaluate function body
from scipy.special import psi
return float(psi(arg0))
We give some explanations to code above.
When protobuf is a dependency, you need to add
sys.setdlopenflags(10)
to__init__
function.pyodps-pack
will notify you when you need to do this. Adding this line will avoid conflict between different versions of binaries of your libraries and MaxCompute itself.In
__init__
method,work/scipy-bundle.tar.gz/packages
is inserted intosys.path
, as MaxCompute will extract all archive resources the UDF references intowork
directory, whilepackages
is the subdirectory created bypyodps-pack
when packing your dependencies. If you need to load dynamically-linked libraries packed with--dynlib
withLoadLibrary
, code can also be added here.The reason of putting import statement of scipy inside the method body of the function evaluate is that third-party libraries are only available when the UDF is being executed, and when the UDF is being resolved in MaxCompute service, there is no packages for use and import statements of these packages outside method bodies will cause errors.
Then you need to upload test_psi_udf.py
as MaxCompute Python resource and scipy-bundle.tar.gz
as archive resource. After that, you need to create a Python UDF named as test_psi_udf
, reference two resource files and specify class name as test_psi_udf.MyPsi
.
Code to accomplish above steps with PyODPS is shown below.
import os
from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
)
bundle_res = o.create_resource(
"scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb")
)
udf_res = o.create_resource(
"test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb")
)
o.create_function(
"test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res]
)
If you want to use MaxCompute Console to accomplish these steps, you may type commands below.
add archive scipy-bundle.tar.gz;
add py test_psi_udf.py;
create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;
After that, you can call the UDF you just created with SQL.
set odps.pypy.enabled=false;
set odps.isolation.session.enable=true;
select test_psi_udf(sepal_length) from iris;
Use third-party libraries in PyODPS DataFrame
PyODPS DataFrame supports using third-party libraries created above by adding a libraries
argument when calling methods like execute or persist. We take map method for example, the same procedure can be used for apply or map_reduce method.
First, create a package for scipy with command below.
pyodps-pack -o scipy-bundle.tar.gz scipy
Assuming that the table is named as test_float_col
and it only contains one column with float value.
col1
0 3.75
1 2.51
Write code below to compute value of psi(col1).
from odps import ODPS, options
def psi(v):
from scipy.special import psi
return float(psi(v))
# If isolation is enabled in Project, option below is not compulsory.
options.sql.settings = {"odps.isolation.session.enable": True}
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
)
df = o.get_table("test_float_col").to_df()
# Execute directly and fetch result
df.col1.map(psi).execute(libraries=["scipy-bundle.tar.gz"])
# Store to another table
df.col1.map(psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])
If you want to use the same third-party packages, you can configure these packages as global:
from odps import options
options.df.libraries = ["scipy-bundle.tar.gz"]
After that, you can use these third-party libraries when DataFrames are being executed.
Use third-party libraries in DataWorks
PyODPS nodes in DataWorks already installed several third-party libraries beforehand. load_resource_package
method is also provided to load packages not preinstalled. Details of usage can be seen here.
Upload and use third-party libraries manually
Note
Documents below is only a reference for maintenance of legacy projects or projects in legacy environments. For new projects please use pyodps-pack
straightforwardly.
Some legacy projects might use old-style method to deploy and use third-party libraries, i.e., manually upload all dependant wheel packages and reference them in code. Some projects are deployed in legacy MaxCompute environments and does not support using binary wheel packages. This chapter is written for these scenarios. Take the following python-dateutil package as an example.
First, you can use the pip download command to download the package and its dependencies to a specific path. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl. Note that the packages must support Linux environment. It is recommended to call this command under Linux.
pip download python-dateutil -d /to/path/
Then upload the files to MaxCompute as resources.
>>> # make sure that file extensions are correct
>>> odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
>>> odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))
Now you have a DataFrame object that only contains a string field.
>>> df
datestr
0 2016-08-26 14:03:29
1 2015-08-26 14:03:29
Set the third-party library as global:
>>> from odps import options
>>>
>>> def get_year(t):
>>> from dateutil.parser import parse
>>> return parse(t).strftime('%Y')
>>>
>>> options.df.libraries = ['six.whl', 'python_dateutil.whl']
>>> df.datestr.map(get_year)
datestr
0 2016
1 2015
Or use the libraries
attribute of an action to specify the package:
>>> def get_year(t):
>>> from dateutil.parser import parse
>>> return parse(t).strftime('%Y')
>>>
>>> df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
datestr
0 2016
1 2015
By default, PyODPS supports third-party libraries that contain pure Python code but no file operations. In newer versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or file operations. These libraries must be suffixed with certain strings, which can be looked up in the table below.
Platform |
Python version |
Suffixes available |
RHEL 5 x86_64 |
Python 2.7 |
cp27-cp27m-manylinux1_x86_64 |
RHEL 5 x86_64 |
Python 3.7 |
cp37-cp37m-manylinux1_x86_64 |
RHEL 7 x86_64 |
Python 2.7 |
cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64 |
RHEL 7 x86_64 |
Python 3.7 |
cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64 |
RHEL 7 ARM64 |
Python 3.7 |
cp37-cp37m-manylinux2014_aarch64 |
All .whl packages need to be uploaded in the archive format, while .whl packages must be renamed to .zip files. You also need to enable the odps.isolation.session.enable
option or enable Isolation in your project. The following example demonstrates how to upload and use the special functions in scipy:
>>> # packages containing binaries should be uploaded with archive method,
>>> # replacing extension .whl with .zip.
>>> odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
>>>
>>> # if your project has already been configured with isolation, the line below can be omitted
>>> options.sql.settings = { 'odps.isolation.session.enable': True }
>>>
>>> def psi(value):
>>> # it is recommended to import third-party libraries inside your function
>>> # in case that structures of the same package differ between different systems.
>>> from scipy.special import psi
>>> return float(psi(value))
>>>
>>> df.float_col.map(psi).execute(libraries=['scipy.zip'])
For binary packages that only contain source code, they can be packaged into .whl files and uploaded through Linux shell. .whl files generated in Mac and Windows are not usable in MaxCompute:
python setup.py bdist_wheel