Frequently asked questions

How to look for the version of PyODPS you are using

import odps

Installation failure/error

For more information, see PyODPS installation FAQ (Chinese version only) .

Project not found error

This error is caused by an error in the configuration of Endpoint. For more information, see MaxCompute activation and service connections by region . Check to see if the ODPS object parameter position is correct.

如何手动指定 Tunnel Endpoint

You can create your MaxCompute (ODPS) entrance object with an extra `tunnel_endpoint` parameter, as shown in the following code. Asterisks should be removed.

from odps import ODPS

o = ODPS('**your-access-id**', '**your-secret-access-key**', '**your-default-project**',
         endpoint='**your-end-point**', tunnel_endpoint='**your-tunnel-endpoint**')

How to configure execution options in SQL or DataFrame

You can find a list of options for MaxCompute SQL here .These settings can be configured at options.sql.settings. For instance,

from odps import options
# replace <option_name> and <option_value> with true option names and values
options.sql.settings = {'<option_name>': '<option_value>'}

You may also configure these options at every execution, which will override the global ones.

  • When you are using odps.execute_sql, you can configure these options via

    from odps import options
    # replace <option_name> and <option_value> with true option names and values
    o.execute_sql('<sql_statement>', hints={'<option_name>': '<option_value>'})
  • When using DataFrame.execute or DataFrame.persist, you can configure these options via

    from odps import options
    # replace <option_name> and <option_value> with true option names and values
    df.persist('<table_name>', hints={'<option_name>': '<option_value>'})

An error occurred while reading data: “project is protected”. How can I deal with this error?

The project security policy disables reading data from tables. To retrieve all the data, you can apply the following solutions:

  • Contact the Project Owner to add exceptions.
  • Use DataWorks or other masking tool to mask the data and export the data as an unprotected project before reading.

To retrieve part of the data, you can apply the following solutions:

  • Use o.execute_sql('select * from <table_name>').open_reader()
  • Use DataFrame, o.get_table('<table_name>').to_df()

An error occurred while using IPython and Jupyter: ImportError. How can I deal with this error?

If running from odps import errors does not fix the error, you need to execute pip install -U jupyter to install the ipython component.

I can only retrieve a maximum of 10,000 items of data by executing SQL command open_reader. How can I retrieve more than 10,000 items of data?

Use create table as select ... to save the SQL execution result to a table, and use table.open_reader to read data.

An error occurred while uploading pandas DataFrame to MaxCompute ODPS: ODPSError: ODPS entrance should be provided. How can I deal with this error?

You need to set the ODPS object to global in one of the three following ways:

  • When using room mechanism , %enter , configure the global ODPS object.
  • Call the to_global method when using the ODPS object.
  • Use the MaxCompute parameter DataFrame(pd_df).persist('your_table', odps=odps).

How can I use max_pt in DataFrame?

Use the odps.df.func module to call the built-in functions of MaxCompute.

from odps.df import func
df = o.get_table('your_table').to_df()
df[df.ds == func.max_pt('your_project.your_table')]  # ds is a partition column

Error “table lifecycle is not specified in mandatory mode” occurred when persisting DataFrame to table

Your project requires that every table should be created with a lifecycle. Thus you should run the code below every time you run your own code.

from odps import options
options.lifecycle = 7  # or your expected lifecycle in days

Error “Please add put { “odps.sql.submit.mode” : “script”} for multi-statement query in settings” occurred when executing SQL scripts

Please read set runtime parameters for more information.

How to enumerate rows in PyODPS DataFrame

We do not support enumerating over every row in PyODPS DataFrame. As PyODPS DataFrame mainly focuses on handling huge amount of data, enumerating over every row means low efficiency and is discouraged. We recommend using `apply` or `map_reduce` methods of DataFrame to parallelize your enumerations. Details can be found in this article . If you are sure that your code cannot be parallelized using methods listed above, and the cost of enumeration is tolerable, you may use `to_pandas` to convert your DataFrame into Pandas, or persist your DataFrame into a MaxCompute table and read it via `read_table` method or table tunnel.

Why memory usage after calling to_pandas is significantly larger than the size of the table?

Two possible reasons might cause this issue. First, MaxCompute compresses table data, and the size you see is the size after compression. Second, variables are stored in Python with extra overhead. For instance, for every Python string, an overhead of approximately 40 bytes will be taken even if the string is empty. You may get the size by calling sys.getsizeof("").

Note that when using info or memory_usage of Pandas to calculate the size of your DataFrame might not be accurate, as these methods does not take string types or objects into account by default. To get sizes of DataFrames with more accuracy, df.memory_usage(deep=True).sum() might be used. Details can be seen in this Pandas document.

To reduce memory usage when reading data, you might try Arrow format. Details can be found here.