Debugging instructions

Visualized DataFrame

Python on MaxCompute (PyODPS) DataFrame can optimize and display the entire operation execution. You can use this to visualize the entire computation process of the operation execution.

Note that this visualization depends on graphviz and the graphviz Python package.

>>> df = iris.groupby('name').agg(id=iris.sepalwidth.sum())
>>> df = df[df.name, df.id + 3]
>>> df.visualize()
_images/df-steps-visualize.svg

PyODPS DataFrame combines the GroupBy operation and column filtering, as seen in the computation process.

>>> df = iris.groupby('name').agg(id=iris.sepalwidth.sum()).cache()
>>> df2 = df[df.name, df.id + 3]
>>> df2.visualize()
_images/df-op-merge-visualize.svg

Due to the executed cache operation, the entire process runs in two steps.

View compiling results at the MaxCompute SQL backend

Use the compile method to view SQL compiling results at the MaxCompute SQL backend.

>>> df = iris.groupby('name').agg(sepalwidth=iris.sepalwidth.max())
>>> df.compile()
Stage 1:

SQL compiled:

SELECT
  t1.`name`,
  MAX(t1.`sepalwidth`) AS `sepalwidth`
FROM test_pyodps_dev.`pyodps_iris` t1
GROUP BY
  t1.`name`

Execute local debugging with the pandas computation backend

The DataFrame application program interfaces (APIs) that are created from the MaxCompute table do not compile some operations to MaxCompute SQL for execution. DataFrame instead uses the Tunnel API to download data quickly, without the need to wait for MaxCompute SQL task scheduling. Using this feature, you can quickly download small amounts of MaxCompute data to a local directory, and use the pandas computation backend to compile and debug code.

Follow these operations:

  • Select all or some items of data from a non-partitioned table, or filter column data excluding column computation, and then calculate the number of specific data items.

  • Select all or some items of data from all or the first several partition columns that you have specified in a partitioned table, or filter the column data, and then calculate the number of data items.

If the iris object of DataFrame uses non-partitioned MaxCompute table as the source, the following operation uses the Tunnel API to download data:

>>> iris.count()
>>> iris['name', 'sepalwidth'][:10]

If DataFrame uses a partitioned table that includes three fields ds, hh, and mm, the following operation uses Tunnel commands to download data:

>>> df[:10]
>>> df[df.ds == '20160808']['f0', 'f1']
>>> df[(df.ds == '20160808') & (df.hh == 3)][:10]
>>> df[(df.ds == '20160808') & (df.hh == 3) & (df.mm == 15)]

You can use the to_pandas method to download some data to a local directory for debugging. You can write the following code:

>>> DEBUG = True
>>> if DEBUG:
>>>     df = iris[:100].to_pandas(wrap=True)
>>> else:
>>>     df = iris

At the end of compiling, set DEBUG to False to execute complete computation on MaxCompute.

Note

Restricted by the sandbox, some programs that pass local debugging may fail to run on MaxCompute.