from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))

Window functions

The DataFrame application program interface (API) supports window functions:

>>> grouped = iris.groupby('name')
>>> grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10)
          name  sepallength_sum  row_number
0  Iris-setosa            250.3           1
1  Iris-setosa            250.3           2
2  Iris-setosa            250.3           3
3  Iris-setosa            250.3           4
4  Iris-setosa            250.3           5
5  Iris-setosa            250.3           6
6  Iris-setosa            250.3           7
7  Iris-setosa            250.3           8
8  Iris-setosa            250.3           9
9  Iris-setosa            250.3          10

You can use Window functions to select columns:

>>> iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5)
          name  sepallength  sepallength_count
0  Iris-setosa          4.3                  1
1  Iris-setosa          4.4                  2
2  Iris-setosa          4.4                  3
3  Iris-setosa          4.4                  4
4  Iris-setosa          4.5                  5

To use window functions to aggregate by scalar, follow the processing method of grouping and aggregation.

>>> from odps.df import Scalar
>>> iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()

The DataFrame API supports the following window functions:

Window functions

Description

cumsum

Calculates the cumulative sum.

cummean

Calculates the cumulative mean.

cummedian

Calculates the cumulative median.

cumstd

Calculates the cumulative standard deviation.

cummax

Calculates the cumulative maximum.

cummin

Calculates the cumulative minimum.

cumcount

Calculates the cumulative sum.

lag

Retrieves the value in the row with the offset before the current row. For example, you can retrieve the value in the row rn-offset when the current row is rn.

lead

Retrieves the value in the row with the offset following the current row. For example, you can retrieve the value in the row rn+offset when the current row is rn.

rank

Calculates the rank.

dense_rank

Calculates the dense rank.

percent_rank

Calculates the relative rank.

row_number

Calculates the row number, starting from 1.

qcut

Cuts data in the group into n slices in order and returns the number of the cut containing the current data. If data are not distributed evenly in cuts, extra data will be put in the first cut.

nth_value

Returns the nth value in the group

cume_dist

Calculates the ratio of lines whose line numbers are less than the current line

In the preceding window functions, rank, dense_rank, percent_rank, and row_number support the following parameters:

Parameter

Description

sort

This is the sorting keyword, which is null by default.

ascending

You can specify this parameter to enable or disable the ascending order in sorting. This parameter is set to True by default.

In addition to the rank parameter, the lag and lead window functions support the following parameters:

Parameter

Description

offset

Indicates the number of rows that are between the current row and the row where you retrieve data.

default

Returns this value when the row that has been specified in the offset does not exist.

In addition to the rank parameter, the cumsum, cummax, cummin, cummean, cummedian, cumcount, and cumstd window functions support the following parameters:

Parameter

Description

unique

Enables or disables deduplication of data. This parameter is set to False by default.

preceding

Specifies the start of windowing.

following

Specifies the end of windowing.