Frequently asked questions
How to look for the version of PyODPS you are using
import odps
print(odps.__version__)
Installation failure/error
For more information, see PyODPS installation FAQ (Chinese version only) .
Project not found error
This error is caused by an error in the configuration of Endpoint. For more information, see MaxCompute activation and service connections by region . Check to see if the ODPS object parameter position is correct.
如何手动指定 Tunnel Endpoint
You can create your MaxCompute (ODPS) entrance object with an extra `tunnel_endpoint`
parameter, as shown in the following code. Asterisks should be removed.
import os
from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='**your-project**',
endpoint='**your-endpoint**',
tunnel_endpoint='**your-tunnel-endpoint**',
)
How to configure execution options in SQL or DataFrame
You can find a list of options for MaxCompute SQL here .These settings can be configured at options.sql.settings
. For instance,
from odps import options
# replace <option_name> and <option_value> with true option names and values
options.sql.settings = {'<option_name>': '<option_value>'}
You may also configure these options at every execution, which will override the global ones.
When you are using
odps.execute_sql
, you can configure these options viafrom odps import options # replace <option_name> and <option_value> with true option names and values o.execute_sql('<sql_statement>', hints={'<option_name>': '<option_value>'})
When using
DataFrame.execute
orDataFrame.persist
, you can configure these options viafrom odps import options # replace <option_name> and <option_value> with true option names and values df.persist('<table_name>', hints={'<option_name>': '<option_value>'})
An error occurred while reading data: “project is protected”. How can I deal with this error?
The project security policy disables reading data from tables. To retrieve all the data, you can apply the following solutions:
Contact the Project Owner to add exceptions.
Use DataWorks or other masking tool to mask the data and export the data as an unprotected project before reading.
To retrieve part of the data, you can apply the following solutions:
Use
o.execute_sql('select * from <table_name>').open_reader()
Use DataFrame,
o.get_table('<table_name>').to_df()
An error occurred while using IPython and Jupyter: ImportError. How can I deal with this error?
If running from odps import errors
does not fix the error, you need to execute pip install -U jupyter
to install the ipython component.
I can only retrieve a maximum of 10,000 items of data by executing SQL command open_reader. How can I retrieve more than 10,000 items of data?
Use create table as select ...
to save the SQL execution result to a table, and use table.open_reader to read data.
An error occurred while uploading pandas DataFrame to MaxCompute ODPS: ODPSError: ODPS entrance should be provided. How can I deal with this error?
You need to set the ODPS object to global in one of the three following ways:
When using room mechanism ,
%enter
, configure the global ODPS object.Call the
to_global
method when using the ODPS object.Use the MaxCompute parameter
DataFrame(pd_df).persist('your_table', odps=odps)
.
How can I use max_pt in DataFrame?
Use the odps.df.func
module to call the built-in functions of MaxCompute.
from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')] # ds is a partition column
Error “table lifecycle is not specified in mandatory mode” occurred when persisting DataFrame to table
Your project requires that every table should be created with a lifecycle. Thus you should run the code below every time you run your own code.
from odps import options
options.lifecycle = 7 # or your expected lifecycle in days
Error “Please add put { “odps.sql.submit.mode” : “script”} for multi-statement query in settings” occurred when executing SQL scripts
Please read set runtime parameters for more information.
How to enumerate rows in PyODPS DataFrame
We do not support enumerating over every row in PyODPS DataFrame. As PyODPS DataFrame mainly focuses on handling huge amount of data, enumerating over every row means low efficiency and is discouraged. We recommend using `apply`
or `map_reduce`
methods of DataFrame to parallelize your enumerations. Details can be found in this article . If you are sure that your code cannot be parallelized using methods listed above, and the cost of enumeration is tolerable, you may use `to_pandas`
to convert your DataFrame into Pandas, or persist your DataFrame into a MaxCompute table and read it via `read_table`
method or table tunnel.
Why memory usage after calling to_pandas is significantly larger than the size of the table?
Two possible reasons might cause this issue. First, MaxCompute compresses table data, and the size you see is the size after compression. Second, variables are stored in Python with extra overhead. For instance, for every Python string, an overhead of approximately 40 bytes will be taken even if the string is empty. You may get the size by calling sys.getsizeof("")
.
Note that when using info
or memory_usage
of Pandas to calculate the size of your DataFrame might not be accurate, as these methods does not take string types or objects into account by default. To get sizes of DataFrames with more accuracy, df.memory_usage(deep=True).sum()
might be used. Details can be seen in this Pandas document.
To reduce memory usage when reading data, you might try Arrow format. Details can be found here.