Create and use third-party libraries¶
Create third-party libraries¶
PyODPS provides a pip-like command line tool, pyodps-pack
, to support creating third-party library bundles that can be used in PyODPS and DataWorks nodes since 0.11.3. You can use this tool to pack all your dependencies into a .tar.gz
archive containing all dependencies packed according to Python environments in MaxCompute or DataWorks. It can also help packing Python packages created by yourself.
Prerequisites¶
Docker mode¶
You need to install Docker to run pyodps-pack
correctly in Docker mode. For Linux users, Docker can be installed given the Official document.For personal MacOS or Windows users, Docker Desktop can be used. For enterprise users without commercial licenses of Docker Desktop, Rancher Desktop might be used. We does not test on other tools providing Docker environments such as minikube
, and availability of the tool on these environments is not guaranteed.
For users who want to create packages for legacy MaxCompute / DataWorks in private clouds, --legacy-image
option might be used. In Windows, MacOS or Linux with some kernel, you might receive errors with this option. In this case you may take a look at this article for solutions.
For Windows users, it is possible that your Docker service depends on Server service of Windows system. However, this service is often prohibited in many companies. In this case, please create packages under Linux or try starting the service. It is known that Rancher Desktop may not perform correctly with containerd
as container engine, you may switch to dockerd
instead. Details about switching container engines can be found in this article.
If your MaxCompute or DataWorks are deployed on Arm64 architecture, you need to add an extra --arch aarch64
argument to specify your architecture for the package. Usually components for cross-architecture packaging like binfmt
are already included in Docker Desktop or Rancher Desktop. You can also run command below to install related virtual environments manually.
docker run --privileged --rm tonistiigi/binfmt --install arm64
This command requires version of Linux kernel above 4.8. Details of the command can be found in this article.
Non-Docker mode¶
Note
We recommend using Docker mode to create packages if possible. Non-Docker mode might be used only when Docker is not available. It is also possible to create malfunctioning packages.
When you have problems installing Docker, you might try non-Docker mode by adding a --without-docker
argument. When using non-Docker mode, pip
is needed in your Python installation. Windows users need to install Git bash to use non-Docker mode, which is included in Git for Windows.
Pack all dependencies¶
Note
It is recommended to use Python 3 for new projects. We do not guarantee availability of methods below for Python 2. You might try your best migrating your legacy projects to Python 3 to reduce difficulties of maintenance in future.
Please add sudo
when calling pyodps-pack
in Linux to make sure Docker is called correctly.
After PyODPS is installed, you can use command below to pack pandas and all its dependencies.
pyodps-pack pandas
If you want to use non-Docker mode to pack, you can use
pyodps-pack --without-docker pandas
If you need to specify version of pandas, you may use
pyodps-pack pandas==1.2.5
After a series of packing processes, the utility will show versions of packed packages.
Package Version
--------------- -------
numpy 1.21.6
pandas 1.2.5
python-dateutil 2.8.2
pytz 2022.6
six 1.16.0
and generates a packages.tar.gz
with all dependency items listed above.
If you need to create packages for Python 2.7, please check which environment your package will work with, MaxCompute or DataWorks. If you are not sure which environment you are using, you may take a look at this article. If you want to use Python 2.7 packages in MaxCompute, you can use the command below.
pyodps-pack --mcpy27 pandas
If you want to use Python 2.7 packages in DataWorks, you can use the command below.
pyodps-pack --dwpy27 pandas
Pack custom source code¶
pyodps-pack
supports packing user-defined source code organized with setup.py
or pyproject.toml
. If you want to know how to build Python packages with these files, you can take a look at this link for more information.
We show how to pack custom code by creating a custom package with pyproject.toml
and packing with pyodps-pack
. Assuming that the directory structure of the project looks like
test_package_root
├── test_package
│ ├── __init__.py
│ ├── mod1.py
│ └── subpackage
│ ├── __init__.py
│ └── mod2.py
└── pyproject.toml
while the content of pyproject.toml
is
[project]
name = "test_package"
description = "pyodps-pack example package"
version = "0.1.0"
dependencies = [
"pandas>=1.0.5"
]
After development of the package, we can pack this package and all the dependencies into packages.tar.gz
. (path_to_package
is the parent directory of test_package_root
)
pyodps-pack /<path_to_package>/test_package_root
Pack code in a Git repository¶
Packing remote Git repositories is supported in pyodps-pack
. We take PyODPS repository as an example to show how to pack a remote Git repository.
pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git
If you want to pack a certain branch or tag, you may use
pyodps-pack git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
If you want to install dependencies on build, for instance, cython
, you can use --install-requires
argument to specify a build-time dependency. You may also create a text file, install-requires.txt
, whose format is similar to requirements.txt
, and use --install-requires-file
to reference it. For instance, if you need to install Cython
before packing PyODPS, you can call
pyodps-pack \
--install-requires cython \
git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
It is also possible to write a install-requires.txt
with content
cython>0.29
and pack command can be written as
pyodps-pack \
--install-requires-file install-requires.txt \
git+https://github.com/aliyun/aliyun-odps-python-sdk.git@v0.11.2.2
A more complicated case: adding binary dependencies¶
Some third-party libraries depend on extra binary dependencies, for instance, extra dynamically-linked libraries needed to be built and installed. You can use pyodps-pack
with an argument --run-before
to specify a bash script which can be used to install binary dependencies. We take geospatial library GDAL as an example to show how to pack this kind of packages.
First, we need to find which dependencies needed to install. Given the document of GDAL 3.6.0.1 on PyPI, we need to install libgdal >= 3.6.0. What’s more, the build hints of GDAL shows that it depends on PROJ >= 6.0. Both dependencies can be built with CMake. Thus we write a bash script, install-gdal.sh
, to install these dependencies.
#!/bin/bash
set -e
cd /tmp
curl -o proj-6.3.2.tar.gz https://download.osgeo.org/proj/proj-6.3.2.tar.gz
tar xzf proj-6.3.2.tar.gz
cd proj-6.3.2
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install
cd /tmp
curl -o gdal-3.6.0.tar.gz http://download.osgeo.org/gdal/3.6.0/gdal-3.6.0.tar.gz
tar xzf gdal-3.6.0.tar.gz
cd gdal-3.6.0
mkdir build && cd build
cmake ..
cmake --build .
cmake --build . --target install
Then use pyodps-pack
to pack GDAL python library.
pyodps-pack --install-requires oldest-supported-numpy --run-before install-gdal.sh gdal==3.6.0.1
Command details¶
Arguments of pyodps-pack
is listed below:
-r
,--requirement <file>
Pack given specified requirement file. Can be used multiple times.
-o
,--output <file>
Specify file name of the target package,
packages.tar.gz
by default.--install-requires <item>
Specify build-time requirements, might not be included in the final package.
--install-requires-file <file>
Specify build-time requirements in files, might not be included in the final package.
--run-before <script-file>
Specify name of bash script to run before packing, can be used to install binary dependencies.
-X
,--exclude <dependency>
Specify dependencies to be excluded in the final package, can be specified multiple times.
--no-deps
If specified, will not include dependencies of specified requirements.
-i
,--index-url <index-url>
Specify Base URL of PyPI package. If absent, will use
global.index-url
inpip config list
command by default.--trusted-host <host>
Specify domains whose certifications are trusted when PyPI urls are using HTTPS.
-l
,--legacy-image
If specified, will use CentOS 5 to pack, making the final package available under old environments such as legacy private clouds.
--mcpy27
If specified, will build packages for Python 2.7 on MaxCompute and assume
--legacy-image
is enabled.--dwpy27
If specified, will build packages for Python 2.7 on DataWorks and assume
--legacy-image
is enabled.--prefer-binary
If specified, will prefer older binary packages over newer source packages.
--arch <architecture>
Specify the hardware architecture for the package. Currently only x86_64 and aarch64 (or equivalently arm64) supported. x86_64 by default. If you are not running your code inside a proprietary cloud, do not add this argument.
--python-version <version>
Specify Python version for the package. You may use 3.6 or 36 to stand for Python 3.6. If you are not running your code inside a proprietary cloud, do not add this argument.
--docker-args <args>
Specify extra arguments needed for Docker command. If there are more than one argument, please put them within quote marks. For instance,
--docker-args "--ip 192.168.1.10"
.--without-docker
Use non-Docker mode to run
pyodps-pack
. You might receive errors or get malfunctioning packages with this mode when there are binary dependencies.--without-merge
Skip building
.tar.gz
package and keep.whl
files after downloading or creating Python wheels.--debug
If specified, will output details when executing the command. This argument is for debug purpose.
You can also specify environment variables to control the build.
BEFORE_BUILD="command before build"
Specify commands to run before build.
AFTER_BUILD="command after build"
Specify commands to run after tar packages are created.
DOCKER_IMAGE="quay.io/pypa/manylinux2010_x86_64"
Customize Docker Image to use. It is recommended to build Docker image based on
pypa/manylinux
images.
Use third-party libraries¶
Upload third-party libraries¶
Please make sure your packages are uploaded as MaxCompute resources with archive type. To upload resources, you may use code below. Note that you need to change packages.tar.gz
into the path to your package.
from odps import ODPS
o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))
You can also try uploading packages with DataWorks following steps below.
Go to the DataStudio page.
- Log on to the DataWorks console.
- In the top navigation bar, click list of regions.
- Select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
On the Data Analytics tab, move the pointer over the Create icon and choose MaxCompute > Resource > Python.
Alternatively, you can click the required workflow in the Business Flow section, right-click MaxCompute, and then choose Create > Resource > Python.
In the Create Resource dialog box, set the Resource Name and Location parameters.
Click Upload and select the file that you want to upload.
Click Create.
Click the Submit icon icon in the top toolbar to commit the resource to the development environment.
More details can be seen in this article.
Use third-party libraries in Python UDFs¶
You need to modify your UDF code to use uploaded packages. You need to add references to your packages in __init__
method of your UDF class, and use these packages in your UDF code, for instance, evaluate or process methods.
We take psi function in scipy for example to show how to use third-party libraries in Python UDF. First, pack dependencies use commands below:
pyodps-pack -o scipy-bundle.tar.gz scipy
Then write code below and store as test_psi_udf.py
.
import sys
from odps.udf import annotate
@annotate("double->double")
class MyPsi(object):
def __init__(self):
# add extracted package path into sys.path
sys.path.insert(0, "work/scipy-bundle.tar.gz/packages")
def evaluate(self, arg0):
# keep import statements inside evaluate function body
from scipy.special import psi
return float(psi(arg0))
We give some explanations to code above. In __init__
method, work/scipy-bundle.tar.gz/packages
is inserted into sys.path
, as MaxCompute will extract all archive resources the UDF references into work
directory, while packages
is the subdirectory created by pyodps-pack
when packing your dependencies. The reason of putting import statement of scipy inside the method body of the function evaluate is that third-party libraries are only available when the UDF is being executed, and when the UDF is being resolved in MaxCompute service, there is no packages for use and import statements of these packages outside method bodies will cause errors.
Then you need to upload test_psi_udf.py
as MaxCompute Python resource and scipy-bundle.tar.gz
as archive resource. After that, you need to create a Python UDF named as test_psi_udf
, reference two resource files and specify class name as test_psi_udf.MyPsi
.
Code to accomplish above steps with PyODPS is shown below.
from odps import ODPS
o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
bundle_res = o.create_resource(
"scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb")
)
udf_res = o.create_resource(
"test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb")
)
o.create_function(
"test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res]
)
If you want to use MaxCompute Console to accomplish these steps, you may type commands below.
add archive scipy-bundle.tar.gz;
add py test_psi_udf.py;
create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;
After that, you can call the UDF you just created with SQL.
set odps.pypy.enabled=false;
set odps.isolation.session.enable=true;
select test_psi_udf(sepal_length) from iris;
Use third-party libraries in PyODPS DataFrame¶
PyODPS DataFrame supports using third-party libraries created above by adding a libraries
argument when calling methods like execute or persist. We take map method for example, the same procedure can be used for apply or map_reduce method.
First, create a package for scipy with command below.
pyodps-pack -o scipy-bundle.tar.gz scipy
Assuming that the table is named as test_float_col
and it only contains one column with float value.
Write code below to compute value of psi(col1).
from odps import ODPS, options
def psi(v):
from scipy.special import psi
return float(psi(v))
# If isolation is enabled in Project, option below is not compulsory.
options.sql.settings = {"odps.isolation.session.enable": True}
o = ODPS("<access_id>", "<secret_access_key>", "<project_name>", "<endpoint>")
df = o.get_table("test_float_col").to_df()
# Execute directly and fetch result
df.col1.map(psi).execute(libraries=["scipy-bundle.tar.gz"])
# Store to another table
df.col1.map(psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])
If you want to use the same third-party packages, you can configure these packages as global:
from odps import options
options.df.libraries = ["scipy-bundle.tar.gz"]
After that, you can use these third-party libraries when DataFrames are being executed.
Upload and use third-party libraries manually¶
Note
Documents below is only a reference for maintenance of legacy projects or projects in legacy environments. For new projects please use pyodps-pack
straightforwardly.
Some legacy projects might use old-style method to deploy and use third-party libraries, i.e., manually upload all dependant wheel packages and reference them in code. Some projects are deployed in legacy MaxCompute environments and does not support using binary wheel packages. This chapter is written for these scenarios. Take the following python-dateutil package as an example.
First, you can use the pip download command to download the package and its dependencies to a specific path. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl. Note that the packages must support Linux environment. It is recommended to call this command under Linux.
pip download python-dateutil -d /to/path/
Then upload the files to MaxCompute as resources.
>>> # make sure that file extensions are correct
>>> odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
>>> odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))
Now you have a DataFrame object that only contains a string field.
>>> df
datestr
0 2016-08-26 14:03:29
1 2015-08-26 14:03:29
Set the third-party library as global:
>>> from odps import options
>>>
>>> def get_year(t):
>>> from dateutil.parser import parse
>>> return parse(t).strftime('%Y')
>>>
>>> options.df.libraries = ['six.whl', 'python_dateutil.whl']
>>> df.datestr.map(get_year)
datestr
0 2016
1 2015
Or use the libraries
attribute of an action to specify the package:
>>> def get_year(t):
>>> from dateutil.parser import parse
>>> return parse(t).strftime('%Y')
>>>
>>> df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
datestr
0 2016
1 2015
By default, PyODPS supports third-party libraries that contain pure Python code but no file operations. In newer versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or file operations. These libraries must be suffixed with certain strings, which can be looked up in the table below.
Platform | Python version | Suffixes available |
RHEL 5 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64 |
RHEL 5 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64 |
RHEL 7 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64 |
RHEL 7 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64 |
RHEL 7 Arm64 | Python 3.7 | cp37-cp37m-manylinux2014_aarch64 |
All .whl packages need to be uploaded in the archive format, while .whl packages must be renamed to .zip files. You also need to enable the odps.isolation.session.enable
option or enable Isolation in your project. The following example demonstrates how to upload and use the special functions in scipy:
>>> # packages containing binaries should be uploaded with archive method,
>>> # replacing extension .whl with .zip.
>>> odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
>>>
>>> # if your project has already been configured with isolation, the line below can be omitted
>>> options.sql.settings = { 'odps.isolation.session.enable': True }
>>>
>>> def psi(value):
>>> # it is recommended to import third-party libraries inside your function
>>> # in case that structures of the same package differ between different systems.
>>> from scipy.special import psi
>>> return float(psi(value))
>>>
>>> df.float_col.map(psi).execute(libraries=['scipy.zip'])
For binary packages that only contain source code, they can be packaged into .whl files and uploaded through Linux shell. .whl files generated in Mac and Windows are not usable in MaxCompute:
python setup.py bdist_wheel