from odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))

Window functions

The DataFrame application program interface (API) supports window functions:

>>> grouped = iris.groupby('name')
>>> grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10)
          name  sepallength_sum  row_number
Iris-setosa            250.3           1
Iris-setosa            250.3           2
Iris-setosa            250.3           3
Iris-setosa            250.3           4
Iris-setosa            250.3           5
Iris-setosa            250.3           6
Iris-setosa            250.3           7
Iris-setosa            250.3           8
Iris-setosa            250.3           9
Iris-setosa            250.3          10

You can use Window functions to select columns:

>>> iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5)
          name  sepallength  sepallength_count
Iris-setosa          4.3                  1
Iris-setosa          4.4                  2
Iris-setosa          4.4                  3
Iris-setosa          4.4                  4
Iris-setosa          4.5                  5

To use window functions to aggregate by scalar, follow the processing method of grouping and aggregation.

>>> from odps.df import Scalar
>>> iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()

The DataFrame API supports the following window functions:

Window functions	Description
cumsum	Calculates the cumulative sum.
cummean	Calculates the cumulative mean.
cummedian	Calculates the cumulative median.
cumstd	Calculates the cumulative standard deviation.
cummax	Calculates the cumulative maximum.
cummin	Calculates the cumulative minimum.
cumcount	Calculates the cumulative sum.
lag	Retrieves the value in the row with the offset before the current row. For example, you can retrieve the value in the row rn-offset when the current row is rn.
lead	Retrieves the value in the row with the offset following the current row. For example, you can retrieve the value in the row rn+offset when the current row is rn.
rank	Calculates the rank.
dense_rank	Calculates the dense rank.
percent_rank	Calculates the relative rank.
row_number	Calculates the row number, starting from 1.
qcut	Cuts data in the group into n slices in order and returns the number of the cut containing the current data. If data are not distributed evenly in cuts, extra data will be put in the first cut.
nth_value	Returns the nth value in the group
cume_dist	Calculates the ratio of lines whose line numbers are less than the current line

In the preceding window functions, rank, dense_rank, percent_rank, and row_number support the following parameters:

Parameter	Description
sort	This is the sorting keyword, which is null by default.
ascending	You can specify this parameter to enable or disable the ascending order in sorting. This parameter is set to True by default.

In addition to the rank parameter, the lag and lead window functions support the following parameters:

Parameter	Description
offset	Indicates the number of rows that are between the current row and the row where you retrieve data.
default	Returns this value when the row that has been specified in the offset does not exist.

In addition to the rank parameter, the cumsum, cummax, cummin, cummean, cummedian, cumcount, and cumstd window functions support the following parameters:

Parameter	Description
unique	Enables or disables deduplication of data. This parameter is set to False by default.
preceding	Specifies the start of windowing.
following	Specifies the end of windowing.