from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))

Window functionsΒΆ

The DataFrame application program interface (API) supports window functions:

>>> grouped = iris.groupby('name')
>>> grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10)
          name  sepallength_sum  row_number
0  Iris-setosa            250.3           1
1  Iris-setosa            250.3           2
2  Iris-setosa            250.3           3
3  Iris-setosa            250.3           4
4  Iris-setosa            250.3           5
5  Iris-setosa            250.3           6
6  Iris-setosa            250.3           7
7  Iris-setosa            250.3           8
8  Iris-setosa            250.3           9
9  Iris-setosa            250.3          10

You can use Window functions to select columns:

>>> iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5)
          name  sepallength  sepallength_count
0  Iris-setosa          4.3                  1
1  Iris-setosa          4.4                  2
2  Iris-setosa          4.4                  3
3  Iris-setosa          4.4                  4
4  Iris-setosa          4.5                  5

To use window functions to aggregate by scalar, follow the processing method of grouping and aggregation.

>>> from odps.df import Scalar
>>> iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()

The DataFrame API supports the following window functions:

Window functions Description
cumsum Calculates the cumulative sum.
cummean Calculates the cumulative mean.
cummedian Calculates the cumulative median.
cumstd Calculates the cumulative standard deviation.
cummax Calculates the cumulative maximum.
cummin Calculates the cumulative minimum.
cumcount Calculates the cumulative sum.
lag Retrieves the value in the row with the offset before the current row. For example, you can retrieve the value in the row rn-offset when the current row is rn.
lead Retrieves the value in the row with the offset following the current row. For example, you can retrieve the value in the row rn+offset when the current row is rn.
rank Calculates the rank.
dense_rank Calculates the dense rank.
percent_rank Calculates the relative rank.
row_number Calculates the row number, starting from 1.
qcut Cuts data in the group into n slices in order and returns the number of the cut containing the current data. If data are not distributed evenly in cuts, extra data will be put in the first cut.
nth_value Returns the nth value in the group
cume_dist Calculates the ratio of lines whose line numbers are less than the current line

In the preceding window functions, rank, dense_rank, percent_rank, and row_number support the following parameters:

Parameter Description
sort This is the sorting keyword, which is null by default.
ascending You can specify this parameter to enable or disable the ascending order in sorting. This parameter is set to True by default.

In addition to the rank parameter, the lag and lead window functions support the following parameters:

Parameter Description
offset Indicates the number of rows that are between the current row and the row where you retrieve data.
default Returns this value when the row that has been specified in the offset does not exist.

In addition to the rank parameter, the cumsum, cummax, cummin, cummean, cummedian, cumcount, and cumstd window functions support the following parameters:

Parameter Description
unique Enables or disables deduplication of data. This parameter is set to False by default.
preceding Specifies the start of windowing.
following Specifies the end of windowing.