from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
Window functions
The DataFrame application program interface (API) supports window functions:
>>> grouped = iris.groupby('name')
>>> grouped.mutate(grouped.sepallength.cumsum(), grouped.sort('sepallength').row_number()).head(10)
name sepallength_sum row_number
0 Iris-setosa 250.3 1
1 Iris-setosa 250.3 2
2 Iris-setosa 250.3 3
3 Iris-setosa 250.3 4
4 Iris-setosa 250.3 5
5 Iris-setosa 250.3 6
6 Iris-setosa 250.3 7
7 Iris-setosa 250.3 8
8 Iris-setosa 250.3 9
9 Iris-setosa 250.3 10
You can use Window functions to select columns:
>>> iris['name', 'sepallength', iris.groupby('name').sort('sepallength').sepallength.cumcount()].head(5)
name sepallength sepallength_count
0 Iris-setosa 4.3 1
1 Iris-setosa 4.4 2
2 Iris-setosa 4.4 3
3 Iris-setosa 4.4 4
4 Iris-setosa 4.5 5
To use window functions to aggregate by scalar, follow the processing method of grouping and aggregation.
>>> from odps.df import Scalar
>>> iris.groupby(Scalar(1)).sort('sepallength').sepallength.cumcount()
The DataFrame API supports the following window functions:
Window functions |
Description |
---|---|
cumsum |
Calculates the cumulative sum. |
cummean |
Calculates the cumulative mean. |
cummedian |
Calculates the cumulative median. |
cumstd |
Calculates the cumulative standard deviation. |
cummax |
Calculates the cumulative maximum. |
cummin |
Calculates the cumulative minimum. |
cumcount |
Calculates the cumulative sum. |
lag |
Retrieves the value in the row with the offset before the current row. For example, you can retrieve the value in the row rn-offset when the current row is rn. |
lead |
Retrieves the value in the row with the offset following the current row. For example, you can retrieve the value in the row rn+offset when the current row is rn. |
rank |
Calculates the rank. |
dense_rank |
Calculates the dense rank. |
percent_rank |
Calculates the relative rank. |
row_number |
Calculates the row number, starting from 1. |
qcut |
Cuts data in the group into n slices in order and returns the number of the cut containing the current data. If data are not distributed evenly in cuts, extra data will be put in the first cut. |
nth_value |
Returns the nth value in the group |
cume_dist |
Calculates the ratio of lines whose line numbers are less than the current line |
In the preceding window functions, rank, dense_rank, percent_rank, and row_number support the following parameters:
Parameter |
Description |
---|---|
sort |
This is the sorting keyword, which is null by default. |
ascending |
You can specify this parameter to enable or disable the ascending order in sorting. This parameter is set to True by default. |
In addition to the rank parameter, the lag and lead window functions support the following parameters:
Parameter |
Description |
---|---|
offset |
Indicates the number of rows that are between the current row and the row where you retrieve data. |
default |
Returns this value when the row that has been specified in the offset does not exist. |
In addition to the rank parameter, the cumsum, cummax, cummin, cummean, cummedian, cumcount, and cumstd window functions support the following parameters:
Parameter |
Description |
---|---|
unique |
Enables or disables deduplication of data. This parameter is set to False by default. |
preceding |
Specifies the start of windowing. |
following |
Specifies the end of windowing. |