Frequently asked questions
How to look for the version of PyODPS you are using
import odps
print(odps.__version__)
Installation failure/error
For more information, see PyODPS installation FAQ (Chinese version only) .
Project not found error
This error is caused by an error in the configuration of Endpoint. For more information, see MaxCompute activation and service connections by region . Check to see if the ODPS object parameter position is correct.
How to specify tunnel endpoint manually
You can create your MaxCompute (ODPS) entrance object with an extra `tunnel_endpoint` parameter, as shown in the following code. Asterisks should be removed.
import os
from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
tunnel_endpoint='**your-tunnel-endpoint**',
)
How to configure execution options in SQL or DataFrame
You can find a list of options for MaxCompute SQL here .These settings can be configured at options.sql.settings. For instance,
from odps import options
# replace <option_name> and <option_value> with true option names and values
options.sql.settings = {'<option_name>': '<option_value>'}
You may also configure these options at every execution, which will override the global ones.
When you are using
odps.execute_sql, you can configure these options viafrom odps import options # replace <option_name> and <option_value> with true option names and values o.execute_sql('<sql_statement>', hints={'<option_name>': '<option_value>'})
When using
DataFrame.executeorDataFrame.persist, you can configure these options viafrom odps import options # replace <option_name> and <option_value> with true option names and values df.persist('<table_name>', hints={'<option_name>': '<option_value>'})
An error occurred while reading data: “project is protected”. How can I deal with this error?
The project security policy disables reading data from tables. To retrieve all the data, you can apply the following solutions:
Contact the Project Owner to add exceptions.
Use DataWorks or other masking tool to mask the data and export the data as an unprotected project before reading.
To retrieve part of the data, you can apply the following solutions:
Use
o.execute_sql('select * from <table_name>').open_reader()Use DataFrame,
o.get_table('<table_name>').to_df()
An error occurred while using IPython and Jupyter: ImportError. How can I deal with this error?
If running from odps import errors does not fix the error, you need to execute pip install -U jupyter to install the ipython component.
I can only retrieve a maximum of 10,000 items of data by executing SQL command open_reader. How can I retrieve more than 10,000 items of data?
Use create table as select ... to save the SQL execution result to a table, and use table.open_reader to read data.
How can I diagnose the cause when SQL execution is slow?
PyODPS does not perform heavy operation before submitting a SQL task. Hence in most of the cases, the reason tasks are becoming slow is not related to PyODPS itself. You can consider investigating the issue in aspects below.
If the proxy server or network chain has long latency.
If the task is queued or delayed at the server end.
If your code includes data downloading, whether the data size is too large or the data is split into too many chunks.
If your cude is executed with DataWorks, make sure that if there are tasks submitted with
run_sqlorexecute_sqlbut without printted logview addresses expecially when PyODPS < 0.11.6.
如果你需要确认提交任务变慢是否由本地环境造成,可以尝试开启调试日志。PyODPS 将会把每个请求及返回都打印出来,可以根据请求和返回的日志确定延迟发生的位置。
import datetime
import logging
from odps import ODPS
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
o = ODPS(...) # fill your account here. skip if MaxCompute Entry is already provided by the environment.
# print local time to determine the initial time of operation
print("Check time:", datetime.datetime.now())
# submit job
inst = o.run_sql("select * from your_table")
Your standard output should look like
Check time: 2025-01-24 15:34:21.531330
2025-01-24 15:34:21,532 - odps.rest - DEBUG - Start request.
2025-01-24 15:34:21,532 - odps.rest - DEBUG - POST: http://service.<region>.maxcompute.aliyun.com/api/projects/<project>/instances
2025-01-24 15:34:21,532 - odps.rest - DEBUG - data: b'<?xml version="1.0" encoding="utf-8"?>\n<Instance>\n <Job>\n <Priority>9</Priority>\n <Tasks>\n <SQL>\n ....
2025-01-24 15:34:21,532 - odps.rest - DEBUG - headers: {'Content-Type': 'application/xml'}
2025-01-24 15:34:21,533 - odps.rest - DEBUG - request url + params /api/projects/<project>/instances?curr_project=<project>
2025-01-24 15:34:21,533 - odps.accounts - DEBUG - headers before signing: {'Content-Type': 'application/xml', 'User-Agent': 'pyodps/0.12.2 CPython/3.7.12', 'Content-Length': '736'}
2025-01-24 15:34:21,533 - odps.accounts - DEBUG - headers to sign: OrderedDict([('content-md5', ''), ('content-type', 'application/xml'), ('date', 'Fri, 24 Jan 2025 07:34:21 GMT')])
2025-01-24 15:34:21,533 - odps.accounts - DEBUG - canonical string: POST
application/xml
Fri, 24 Jan 2025 07:34:21 GMT
/projects/maxframe_ci_cd/instances?curr_project=maxframe_ci_cd
2025-01-24 15:34:21,533 - odps.accounts - DEBUG - headers after signing: {'Content-Type': 'application/xml', 'User-Agent': 'pyodps/0.12.2 CPython/3.7.12', 'Content-Length': '736', ....
2025-01-24 15:34:21,533 - urllib3.connectionpool - DEBUG - Resetting dropped connection: service.<region>.maxcompute.aliyun.com
2025-01-24 15:34:22,027 - urllib3.connectionpool - DEBUG - http://service.<region>.maxcompute.aliyun.com:80 "POST /api/projects/<project>/instances?curr_project=<project> HTTP/1.1" 201 0
2025-01-24 15:34:22,027 - odps.rest - DEBUG - response.status_code 201
2025-01-24 15:34:22,027 - odps.rest - DEBUG - response.headers:
{'Server': '<Server>', 'Date': 'Fri, 24 Jan 2025 07:34:22 GMT', 'Content-Type': 'text/plain;charset=utf-8', 'Content-Length': '0', 'Connection': 'close', 'Location': ....
2025-01-24 15:34:22,027 - odps.rest - DEBUG - response.content: b''
从上面的输出中,可以知道代码启动任务的时间(2025-01-24 15:34:21,531)、请求发起时间(2025-01-24 15:34:21,533)以及服务端返回的时间(2025-01-24 15:34:22,027)获知各个阶段的时间开销。
如果你需要确认执行变慢是否是由拉取数据造成,可以先尝试将提交执行与拉取数据拆开,也就是说,使用 run_sql 提交任务,并使用 instance.wait_for_success 等待任务结束,再使用 instance.open_reader 读取数据。也就是说,将
with o.execute_sql('select * from your_table').open_reader() as reader:
for row in reader:
print(row)
into
inst = o.run_sql('select * from your_table')
inst.wait_for_success()
with inst.open_reader() as reader:
for row in reader:
print(row)
and then check the latency of every statement above.
An error occurred while uploading pandas DataFrame to MaxCompute ODPS: ODPSError: ODPS entrance should be provided. How can I deal with this error?
You need to set the ODPS object to global in one of the three following ways:
When using room mechanism ,
%enter, configure the global ODPS object.Call the
to_globalmethod when using the ODPS object.Use the MaxCompute parameter
DataFrame(pd_df).persist('your_table', odps=odps).
How can I use max_pt in DataFrame?
Use the odps.df.func module to call the built-in functions of MaxCompute.
from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')] # ds is a partition column
Error “table lifecycle is not specified in mandatory mode” occurred when persisting DataFrame to table
Your project requires that every table should be created with a lifecycle. Thus you should run the code below every time you run your own code.
from odps import options
options.lifecycle = 7 # or your expected lifecycle in days
Error “Please add put { “odps.sql.submit.mode” : “script”} for multi-statement query in settings” occurred when executing SQL scripts
Please read set runtime parameters for more information.
How to enumerate rows in PyODPS DataFrame
We do not support enumerating over every row in PyODPS DataFrame. As PyODPS DataFrame mainly focuses on handling huge amount of data, enumerating over every row means low efficiency and is discouraged. We recommend using `apply` or `map_reduce` methods of DataFrame to parallelize your enumerations. Details can be found in this article . If you are sure that your code cannot be parallelized using methods listed above, and the cost of enumeration is tolerable, you may use `to_pandas` to convert your DataFrame into Pandas, or persist your DataFrame into a MaxCompute table and read it via `read_table` method or table tunnel.
Why memory usage after calling to_pandas is significantly larger than the size of the table?
Two possible reasons might cause this issue. First, MaxCompute compresses table data, and the size you see is the size after compression. Second, variables are stored in Python with extra overhead. For instance, for every Python string, an overhead of approximately 40 bytes will be taken even if the string is empty. You may get the size by calling sys.getsizeof("").
Note that when using info or memory_usage of Pandas to calculate the size of your DataFrame might not be accurate, as these methods does not take string types or objects into account by default. To get sizes of DataFrames with more accuracy, df.memory_usage(deep=True).sum() might be used. Details can be seen in this Pandas document.
To reduce memory usage when reading data, you might try Arrow format. Details can be found here.