DataFrame Reference

class odps.df.DataFrame(*args, **kwargs)[source]

Main entrance of PyODPS DataFrame.

Users can initial a DataFrame by odps.models.Table.

Parameters:

data (odps.models.Table or pandas DataFrame) – ODPS table or pandas DataFrame

Example:

>>> df = DataFrame(o.get_table('my_example_table'))
>>> df.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  imdb_url                            string
  user_id                             int64
  rating                              int64
  unix_timestamp                      int64
  age                                 int64
  sex                                 string
  occupation                          string
  zip_code                            string
}
>>> df.count()
100000
>>>
>>> # Do the `groupby`, aggregate the `movie_id` by count, then sort the count in a reversed order
>>> # Finally we get the top 25 results
>>> df.groupby('title').agg(count=df.movie_id.count()).sort('count', ascending=False)[:25]
>>>
>>> # We can use the `value_counts` to reach the same goal
>>> df.movie_id.value_counts()[:25]
static batch_persist(dfs, tables, *args, **kwargs)[source]

Persist multiple DataFrames into ODPS.

Parameters:
  • dfs – DataFrames to persist.

  • tables – Table names to persist to. Use (table, partition) tuple to store to a table partition.

  • args – args for Expr.persist

  • kwargs – kwargs for Expr.persist

Examples:

>>> DataFrame.batch_persist([df1, df2], ['table_name1', ('table_name2', 'partition_name2')], lifecycle=1)
view()[source]

Clone a same collection. useful for self-join.

Returns:

class odps.df.CollectionExpr(*args, **kwargs)[source]

Collection represents for the two-dimensions data.

Example:

>>> # projection
>>> df = DataFrame(o.get_table('my_table')) # DataFrame is actually a CollectionExpr
>>> df['name', 'id']  # projection some columns
>>> df[[df.name, df.id]]  # projection
>>> df[df]  # means nothing, but get all the columns
>>> df[df, df.name.lower().rename('name2')]  # projection a new columns `name2` besides all the original columns
>>> df.select(df, name2=df.name.lower())  # projection by `select`
>>> df.exclude('name')  # projection all columns but `name`
>>> df[df.exclude('name'), df.name.lower()]  # `name` will not conflict any more
>>>
>>> # filter
>>> df[(df.id < 3) & (df.name != 'test')]
>>> df.filter(df.id < 3, df.name != 'test')
>>>
>>> # slice
>>> df[: 10]
>>> df.limit(10)
>>>
>>> # Sequence
>>> df.name # an instance of :class:`odps.df.expr.expressions.SequenceExpr`
>>>
>>> # schema or dtypes
>>> df.dtypes
odps.Schema {
  name    string
  id      int64
}
>>> df.schema
odps.Schema {
  name    string
  id      int64
}
all()

All is True.

Parameters:

expr

Returns:

any()

Any is True.

Parameters:

expr

Returns:

append_id(id_col='append_id')

Append an ID column to current column to form a new DataFrame.

Parameters:

id_col (str) – name of appended ID field.

Returns:

DataFrame with ID field

Return type:

DataFrame

apply(func, axis=0, names=None, types=None, reduce=False, resources=None, keep_nulls=False, args=(), **kwargs)

Apply a function to a row when axis=1 or column when axis=0.

Parameters:
  • expr

  • func – function to apply

  • axis – row when axis=1 else column

  • names – output names

  • types – output types

  • reduce – if True will return a sequence else return a collection

  • resources – resources to read

  • keep_nulls – if True, keep rows producing empty results, only work in lateral views

  • args – args for function

  • kwargs – kwargs for function

Returns:

Example:

Apply a function to a row:

>>> from odps.df import output
>>>
>>> @output(['iris_add', 'iris_sub'], ['float', 'float'])
>>> def handle(row):
>>>     yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth
>>>     yield row.petallength - row.petalwidth, row.petallength + row.petalwidth
>>>
>>> iris.apply(handle, axis=1).count()

Apply a function to a column:

>>> class Agg(object):
>>>
>>>     def buffer(self):
>>>         return [0.0, 0]
>>>
>>>     def __call__(self, buffer, val):
>>>         buffer[0] += val
>>>         buffer[1] += 1
>>>
>>>     def merge(self, buffer, pbuffer):
>>>         buffer[0] += pbuffer[0]
>>>         buffer[1] += pbuffer[1]
>>>
>>>     def getvalue(self, buffer):
>>>         if buffer[1] == 0:
>>>             return 0.0
>>>         return buffer[0] / buffer[1]
>>>
>>> iris.exclude('name').apply(Agg)
applymap(func, rtype=None, resources=None, columns=None, excludes=None, args=(), **kwargs)

Call func on each element of this collection.

Parameters:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

  • columns – columns to apply this function on

  • excludes – columns to skip when applying the function

Returns:

a new collection

Example:

>>> df.applymap(lambda x: x + 1)
ast()

Return the AST string.

Returns:

AST tree

Return type:

str

bfill(subset=None)

Fill NA/NaN values with the backward method. Equivalent to fillna(method=’bfill’).

Parameters:
  • expr (DataFrame) – input DataFrame.

  • subset – Labels along other axis to consider.

Returns:

DataFrame

bloom_filter(on, column, capacity=3000, error_rate=0.01)

Filter collection on the on sequence by BloomFilter built by column

Parameters:
  • collection

  • on – sequence or column name

  • column – instance of Column

  • capacity (int) – numbers of capacity

  • error_rate (float) – error rate

Returns:

collection

Example:

>>> df1 = DataFrame(pd.DataFrame({'a': ['name1', 'name2', 'name3', 'name1'], 'b': [1, 2, 3, 4]}))
>>> df2 = DataFrame(pd.DataFrame({'a': ['name1']}))
>>> df1.bloom_filter('a', df2.a)
       a  b
0  name1  1
1  name1  4
property columns

columns :rtype: list which each element is a Column

Type:

return

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

concat(rights, distinct=False, axis=0)

Concat collections.

Parameters:
  • left – left collection

  • rights – right collections, can be a DataFrame object or a list of DataFrames

  • distinct – whether to remove duplicate entries. only available when axis == 0

  • axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

Returns:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)
continuous(*args)

Set fields to be continuous.

Return type:

DataFrame

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')
count()

Value counts

Parameters:

expr

Returns:

discrete(*args)

Set fields to be discrete.

Return type:

DataFrame

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')
distinct(on=None, *ons)

Get collection with duplicate rows removed, optionally only considering certain columns

Parameters:
  • expr – collection

  • on – sequence or sequences

Returns:

dinstinct collection

Example:

>>> df.distinct(['name', 'id'])
>>> df['name', 'id'].distinct()
drop(data, axis=0, columns=None)

Drop data from a DataFrame.

Parameters:
  • expr – collection to drop data from

  • data – data to be removed

  • axis – 0 for deleting rows, 1 for columns.

  • columns – columns of data to select, only useful when axis == 0

Returns:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9
dropna(how='any', thresh=None, subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:
  • expr (DataFrame) – input DataFrame

  • how – can be ‘any’ or ‘all’. If ‘any’ is specified any NA values are present, drop that label. If ‘all’ is specified and all values are NA, drop that label.

  • thresh – require that many non-NA values

  • subset – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

Returns:

DataFrame

erase_key_value(*args)

Erase key-value represented fields.

Return type:

DataFrame

Example:

>>> new_ds = df.erase_key_value('f1 f2')
except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
exclude(*fields)[source]

Projection columns which not included in the fields

Parameters:

fields – field names

Returns:

new collection

Return type:

odps.df.expr.expression.CollectionExpr

exclude_fields(*args)

Exclude one or more fields from feature fields.

Return type:

DataFrame

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

extract_kv(columns=None, kv_delim=':', item_delim=',', dtype='float', fill_value=None)

Extract values in key-value represented columns into standalone columns. New column names will be the name of the key-value column followed by an underscore and the key.

Parameters:
  • expr (DataFrame) – input DataFrame

  • columns – the key-value columns to be extracted.

  • kv_delim (str) – delimiter between key and value.

  • item_delim (str) – delimiter between key-value pairs.

  • dtype (str) – type of value columns to generate.

  • fill_value – default value for missing key-value pairs.

Returns:

extracted data frame

Return type:

DataFrame

Example:

>>> df
    name   kv
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1
>>> table = df.extract_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_k1   kv_k2   kv_k3   kv_k5   kv_k7   kv_k9
0  name1  1.0     3.0     Nan     10.0    Nan     Nan
1  name2  Nan     3.0     5.1     Nan     Nan     Nan
2  name3  7.1     Nan     Nan     Nan     8.2     Nan
3  name4  Nan     1.2     1.5     Nan     Nan     Nan
4  name5  Nan     1.0     Nan     Nan     Nan     1.1
ffill(subset=None)

Fill NA/NaN values with the forward method. Equivalent to fillna(method=’ffill’).

Parameters:
  • expr (DataFrame) – input DataFrame.

  • subset – Labels along other axis to consider.

Returns:

DataFrame

fillna(value=None, method=None, subset=None)

Fill NA/NaN values using the specified method

Parameters:
  • expr (DataFrame) – input DataFrame

  • method – can be ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’ or None

  • value – value to fill into

  • subset – Labels along other axis to consider.

Returns:

DataFrame

filter(*predicates)[source]

Filter the data by predicates

Parameters:

predicates – the conditions to filter

Returns:

new collection

Return type:

odps.df.expr.expressions.CollectionExpr

filter_parts(predicate='', exclude=True)[source]

Filter the data by partition string. A partition string looks like pt1=1,pt2=2/pt1=2,pt2=1, where comma (,) denotes ‘and’, while (/) denotes ‘or’.

Parameters:
  • predicate (str|Partition) – predicate string of partition filter

  • exclude (bool) – True if you want to exclude partition fields, otherwise False. True for default.

Returns:

new collection

Return type:

odps.df.expr.expressions.CollectionExpr

groupby(by, *bys)

Group collection by a series of sequences.

Parameters:
  • expr – collection

  • by – columns to group

  • bys – columns to group

Returns:

GroupBy instance

Return type:

odps.df.expr.groupby.GroupBy

head(n=None, **kwargs)[source]

Return the first n rows. Execute at once.

Parameters:

n

Returns:

result frame

Return type:

odps.df.backends.frame.ResultFrame

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
intersect(*rights, **kwargs)

Calc intersection among datasets,

Parameters:
  • left – collection

  • rights – collection or list of collections

  • distinct – whether to preserve duolicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3
join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
key_value(*args, **kwargs)

Set fields to be key-value represented.

Return type:

DataFrame

Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
kurt()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

label_field(f)

Select one field as the label field.

Note that this field will be exclude from feature fields.

Parameters:

f (str) – Selected label field

Return type:

DataFrame

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
limit(n)[source]

limit n records

Parameters:

n – n records

Returns:

map_reduce(mapper=None, reducer=None, group=None, sort=None, ascending=True, combiner=None, combiner_buffer_size=1024, mapper_output_names=None, mapper_output_types=None, mapper_resources=None, reducer_output_names=None, reducer_output_types=None, reducer_resources=None)

MapReduce API, mapper or reducer should be provided.

Parameters:
  • expr

  • mapper – mapper function or class

  • reducer – reducer function or class

  • group – the keys to group after mapper

  • sort – the keys to sort after mapper

  • ascending – True if ascending else False

  • combiner – combiner function or class, combiner’s output should be equal to mapper

  • combiner_buffer_size – combiner’s buffer size, 1024 as default

  • mapper_output_names – mapper’s output names

  • mapper_output_types – mapper’s output types

  • mapper_resources – the resources for mapper

  • reducer_output_names – reducer’s output names

  • reducer_output_types – reducer’s output types

  • reducer_resources – the resources for reducer

Returns:

Example:

>>> from odps.df import output
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def mapper(row):
>>>     for word in row[0].split():
>>>         yield word.lower(), 1
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def reducer(keys):
>>>     cnt = [0]
>>>     def h(row, done):  # done illustrates that all the rows of the keys are processed
>>>         cnt[0] += row.cnt
>>>         if done:
>>>             yield keys.word, cnt[0]
>>>     return h
>>>
>>> words_df.map_reduce(mapper, reducer, group='word')
max()

Max value

Parameters:

expr

Returns:

mean()

Arithmetic mean.

Parameters:

expr

Returns:

median()

Median value.

Parameters:

expr

Returns:

melt(id_vars=None, value_vars=None, var_name='variable', value_name='value', ignore_nan=False)

“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
  • expr – collection

  • id_vars – column(s) to use as identifier variables.

  • value_vars – column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name – name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name – name to use for the ‘value’ column.

  • ignore_nan – whether to ignore NaN values in data.

Returns:

collection

Example:

>>> df.melt(id_vars='id', value_vars=['col1', 'col2'])
>>> df.melt(id_vars=['id', 'id2'], value_vars=['col1', 'col2'], var_name='variable')
min()

Min value

Parameters:

expr

Returns:

min_max_scale(columns=None, feature_range=(0, 1), preserve=False, suffix='_scaled', group=None)

Resize a data frame by max / min values, i.e., (X - min(X)) / (max(X) - min(X))

Parameters:
  • expr (DataFrame) – input DataFrame

  • feature_range – the target range to resize the value into, i.e., v * (b - a) + a

  • preserve (bool) – determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix

  • columns – columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.

  • group – determine scale groups. Scaling will be done in each group separately.

  • suffix (str) – column suffix to be appended to the scaled columns.

Returns:

resized data frame

Return type:

DataFrame

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

Returns:

nunique()

The distinct count.

Parameters:

expr

Returns:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
pivot(rows, columns, values=None)

Produce ‘pivot’ table based on 3 columns of this DataFrame. Uses unique values from rows / columns and fills with values.

Parameters:
  • expr – collection

  • rows – use to make new collection’s grouped rows

  • columns – use to make new collection’s columns

  • values – values to use for populating new collection’s values

Returns:

collection

Example:

>>> df.pivot(rows='id', columns='category')
>>> df.pivot(rows='id', columns='category', values='sale')
>>> df.pivot(rows=['id', 'id2'], columns='category', values='sale')
pivot_table(values=None, rows=None, columns=None, aggfunc='mean', fill_value=None)

Create a spreadsheet-style pivot table as a DataFrame.

Parameters:
  • expr – collection

  • (optional) (fill_value) – column to aggregate

  • rows – rows to group

  • columns – keys to group by on the pivot table column

  • aggfunc – aggregate function or functions

  • (optional) – value to replace missing value with, default None

Returns:

collection

Example:

>>> df
    A    B      C   D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7
>>> table = df.pivot_table(values='D', rows=['A', 'B'], columns='C', aggfunc='sum')
>>> table
     A    B  large_D_sum   small_D_sum
0  bar  one          4.0           5.0
1  bar  two          7.0           6.0
2  foo  one          4.0           1.0
3  foo  two          NaN           6.0
quantile(prob=None, **kw)

Percentile value.

Parameters:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

Returns:

query(expr)[source]

Query the data with a boolean expression.

Parameters:

expr – the query string, you can use ‘@’ character refer to environment variables.

Returns:

new collection

Return type:

odps.df.expr.expressions.CollectionExpr

reshuffle(by=None, sort=None, ascending=True)

Reshuffle data.

Parameters:
  • expr

  • by – the sequence or scalar to shuffle by. RandomScalar as default

  • sort – the sequence or scalar to sort.

  • ascending – True if ascending else False

Returns:

collection

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
roles(clear_features=True, **field_roles)

Set roles of fields

Parameters:
  • clear_features – Clear feature roles on fields

  • field_roles

Returns:

sample(parts=None, columns=None, i=None, n=None, frac=None, replace=False, weights=None, strata=None, random_state=None)

Sample collection.

Parameters:
  • expr – collection

  • parts – how many parts to hash

  • columns – the columns to sample

  • i – the part to sample out, can be a list of parts, must be from 0 to parts-1

  • n – how many rows to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample size as values

  • frac – how many fraction to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample weight as values

  • replace – whether to perform replace sampling

  • weights – the column name of weights

  • strata – the name of strata column

  • random_state – the random seed when performing sampling

Returns:

collection

Note that n, frac, replace, weights, strata and random_state can only be used under Pandas DataFrames or XFlow.

Example:

Sampling with parts:

>>> df.sample(parts=1)
>>> df.sample(parts=5, i=0)
>>> df.sample(parts=10, columns=['name'])

Sampling with fraction or weights, replacement option can be specified:

>>> df.sample(n=100)
>>> df.sample(frac=0.1)
>>> df.sample(frac=0.1, replace=True)

Sampling with weight column:

>>> df.sample(n=100, weights='weight_col')
>>> df.sample(n=100, weights='weight_col', replace=True)

Stratified sampling. Note that currently we do not support stratified sampling with replacement.

>>> df.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})
select(*fields, **kw)[source]

Projection columns. Remember to avoid column names’ conflict.

Parameters:
  • fields – columns to project

  • kw – columns and their names to project

Returns:

new collection

Return type:

odps.df.expr.expression.CollectionExpr

select_features(*args, **kwargs)

Select one or more fields as feature fields.

Return type:

DataFrame

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
size()

Value counts

Parameters:

expr

Returns:

skew()

Calculate skewness of the sequence

Parameters:

expr

Returns:

sort(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

Parameters:
  • expr – collection

  • by – the sequence or sequences to sort

  • ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

Returns:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3
sort_values(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

Parameters:
  • expr – collection

  • by – the sequence or sequences to sort

  • ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

Returns:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3
split(frac, seed=None)

Split the current column into two column objects with certain ratio.

Parameters:

frac (float) – Split ratio

Returns:

two split DataFrame objects

std(**kw)

Standard deviation.

Parameters:
  • expr

  • kw

Returns:

std_scale(columns=None, with_means=True, with_std=True, preserve=False, suffix='_scaled', group=None)

Resize a data frame by mean and standard error.

Parameters:
  • expr (DataFrame) – Input DataFrame

  • with_means (bool) – Determine whether the output will be subtracted by means

  • with_std (bool) – Determine whether the output will be divided by standard deviations

  • preserve (bool) – Determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix

  • columns – Columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.

  • group – determine scale groups. Scaling will be done in each group separately.

  • suffix (str) – column suffix to be appended to the scaled columns.

Returns:

resized data frame

Return type:

DataFrame

sum()

Sum value

Parameters:

expr

Returns:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:
  • expr

  • args

  • kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
tail(n=None, **kwargs)[source]

Return the last n rows. Execute at once.

Parameters:

n

Returns:

result frame

Return type:

odps.df.backends.frame.ResultFrame

to_kv(columns=None, kv_delim=':', item_delim=',', kv_name='kv_col')

Merge values in specified columns into a key-value represented column.

Parameters:
  • expr (DataFrame) – input DataFrame

  • columns – the columns to be merged.

  • kv_delim (str) – delimiter between key and value.

  • item_delim (str) – delimiter between key-value pairs.

  • kv_col (str) – name of the new key-value column

Returns:

converted data frame

Return type:

DataFrame

Example:

>>> df
    name   k1   k2   k3   k5    k7   k9
0  name1  1.0  3.0  Nan  10.0  Nan  Nan
1  name2  Nan  3.0  5.1  Nan   Nan  Nan
2  name3  7.1  Nan  Nan  Nan   8.2  Nan
3  name4  Nan  1.2  1.5  Nan   Nan  Nan
4  name5  Nan  1.0  Nan  Nan   Nan  1.1
>>> table = df.to_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_col
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1
to_pandas(wrap=False, **kwargs)[source]

Convert to pandas DataFrame. Execute at once.

Parameters:

wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame

Returns:

pandas DataFrame

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

Parameters:
  • left – left collection

  • right – right collection

  • distinct

Returns:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])
var(**kw)

Variance

Parameters:
  • expr

  • ddof – degree of freedom

  • kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

view()[source]

Clone a same collection. useful for self-join.

Returns:

weight_field(f)

Select one field as the weight field.

Note that this field will be exclude from feature fields.

Parameters:

f (str) – Selected weight field

Return type:

DataFrame

class odps.df.SequenceExpr(*args, **kwargs)[source]

Sequence represents for 1-dimension data.

ast()

Return the AST string.

Returns:

AST tree

Return type:

str

astype(data_type)[source]

Cast to a new data type.

Parameters:

data_type – the new data type

Returns:

casted sequence

Example:

>>> df.id.astype('float')
between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

concat(rights, distinct=False, axis=0)

Concat collections.

Parameters:
  • left – left collection

  • rights – right collections, can be a DataFrame object or a list of DataFrames

  • distinct – whether to remove duplicate entries. only available when axis == 0

  • axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

Returns:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)
continuous()

Set sequence to be continuous.

Return type:

Column

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')
count()

Value counts

Parameters:

expr

Returns:

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

discrete()

Set sequence to be discrete.

Return type:

Column

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')
drop(data, axis=0, columns=None)

Drop data from a DataFrame.

Parameters:
  • expr – collection to drop data from

  • data – data to be removed

  • axis – 0 for deleting rows, 1 for columns.

  • columns – columns of data to select, only useful when axis == 0

Returns:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9
property dtype

Return the data type. Available types: int8, int16, int32, int64, float32, float64, boolean, string, decimal, datetime

Returns:

the data type

erase_key_value()

Erase key-value represented fields.

Return type:

Column

Example:

>>> new_ds = df.erase_key_value('f1 f2')
except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:
  • expr – sequence or scalar

  • value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:
  • expr

  • func – hash function

Returns:

head(n=None, **kwargs)[source]

Return first n rows. Execute at once.

Parameters:

n

Returns:

result frame

Return type:

odps.df.expr.expressions.CollectionExpr

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
intersect(*rights, **kwargs)

Calc intersection among datasets,

Parameters:
  • left – collection

  • rights – collection or list of collections

  • distinct – whether to preserve duolicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
key_value(**kwargs)

Set fields to be key-value represented.

Return type:

Column

Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
max()

Max value

Parameters:

expr

Returns:

min()

Min value

Parameters:

expr

Returns:

notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

nunique()

The distinct count.

Parameters:

expr

Returns:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
role(role_name)

Set role of current column

Parameters:

role_name – name of the role to be selected.

Returns:

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
size()

Value counts

Parameters:

expr

Returns:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:
  • expr

  • args

  • kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
tail(n=None, **kwargs)[source]

Return the last n rows. Execute at once.

Parameters:

n

Returns:

to_pandas(wrap=False, **kwargs)[source]

Convert to pandas Series. Execute at once.

Parameters:

wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame

Returns:

pandas Series

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

Parameters:
  • left – left collection

  • right – right collection

  • distinct

Returns:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])
value_counts(sort=True, ascending=False, dropna=False)

Return object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occuring element. Exclude NA values by default

Parameters:
  • expr – sequence

  • sort (bool) – if sort

  • dropna – Don’t include counts of None, default False

Returns:

collection with two columns

Return type:

odps.df.expr.expressions.CollectionExpr

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

class odps.df.expr.expressions.Int64SequenceExpr(*args, **kwargs)[source]
kurt()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

mean()

Arithmetic mean.

Parameters:

expr

Returns:

median()

Median value.

Parameters:

expr

Returns:

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

Returns:

quantile(prob=None, **kw)

Percentile value.

Parameters:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

Returns:

skew()

Calculate skewness of the sequence

Parameters:

expr

Returns:

std(**kw)

Standard deviation.

Parameters:
  • expr

  • kw

Returns:

sum()

Sum value

Parameters:

expr

Returns:

to_datetime()

Return a sequence or scalar that is the datetime value of the current numeric sequence or scalar.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

var(**kw)

Variance

Parameters:
  • expr

  • ddof – degree of freedom

  • kw

Returns:

class odps.df.expr.expressions.StringSequenceExpr(*args, **kwargs)[source]
capitalize()

Convert strings in the Sequence or string scalar to be capitalized. Equivalent to str.capitalize().

Parameters:

expr

Returns:

sequence or scalar

cat(others=None, sep=None, na_rep=None)

Concatenate strings in sequence with given separator

Parameters:
  • expr

  • others – other sequences

  • sep – string or None, default None

  • na_rep – string or None default None, if None, NA in the sequence are ignored

Returns:

contains(pat, case=True, flags=0, regex=True)

Return boolean sequence whether given pattern/regex is contained in each string in the sequence

Parameters:
  • expr – sequence or scalar

  • pat – Character sequence or regular expression

  • case (bool) – If True, case sensitive

  • flags – re module flags, e.g. re.IGNORECASE

  • regex – If True use regex, otherwise use string finder

Returns:

sequence or scalar

count(*args, **kwargs)

Value counts

Parameters:

expr

Returns:

endswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar ends with passed pattern. Equivalent to str.endswith().

Parameters:
  • expr

  • pat – Character sequence

Returns:

sequence or scalar

extract(pat, flags=0, group=0)

Find group in each string in the Series using passed regular expression.

Parameters:
  • expr

  • pat – Pattern or regular expression

  • flags – re module, e.g. re.IGNORECASE

  • group – if None as group 0

Returns:

sequence or scalar

find(sub, start=0, end=None)

Return lowest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

Parameters:
  • expr

  • sub – substring being searched

  • start – left edge index

  • end – right edge index

Returns:

sequence or scalar

get(index)

Extract element from lists, tuples, or strings in each element in the sequence or scalar

Parameters:
  • expr

  • index – Integer index(location)

Returns:

sequence or scalar

isalnum()

Check whether all characters in each string in the sequence or scalar are alphanumeric. Equivalent to str.isalnum().

Parameters:

expr

Returns:

boolean sequence or scalar

isalpha()

Check whether all characters in each string in the sequence or scalar are alphabetic. Equivalent to str.isalpha().

Parameters:

expr

Returns:

boolean sequence or scalar

isdecimal()

Check whether all characters in each string in the sequence or scalar are decimal. Equivalent to str.isdecimal().

Parameters:

expr

Returns:

boolean sequence or scalar

isdigit()

Check whether all characters in each string in the sequence or scalar are digits. Equivalent to str.isdigit().

Parameters:

expr

Returns:

boolean sequence or scalar

islower()

Check whether all characters in each string in the sequence or scalar are lowercase. Equivalent to str.islower().

Parameters:

expr

Returns:

boolean sequence or scalar

isnumeric()

Check whether all characters in each string in the sequence or scalar are numeric. Equivalent to str.isnumeric().

Parameters:

expr

Returns:

boolean sequence or scalar

isspace()

Check whether all characters in each string in the sequence or scalar are whitespace. Equivalent to str.isspace().

Parameters:

expr

Returns:

boolean sequence or scalar

istitle()

Check whether all characters in each string in the sequence or scalar are titlecase. Equivalent to str.istitle().

Parameters:

expr

Returns:

boolean sequence or scalar

isupper()

Check whether all characters in each string in the sequence or scalar are uppercase. Equivalent to str.isupper().

Parameters:

expr

Returns:

boolean sequence or scalar

len()

Compute length of each string in the sequence or scalar

Parameters:

expr

Returns:

lengths

ljust(width, fillchar=' ')

Filling right side of strings in the sequence or scalar with an additional character. Equivalent to str.ljust().

Parameters:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with fillchar

  • fillchar – Additional character for filling, default is whitespace.

Returns:

sequence or scalar

lower()

Convert strings in the sequence or scalar lowercase. Equivalent to str.lower().

Parameters:

expr

Returns:

sequence or scalar

lstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left side. Equivalent to str.lstrip().

Parameters:
  • expr

  • to_strip

Returns:

sequence or sclaar

pad(width, side='left', fillchar=' ')

Pad strings in the sequence or scalar with an additional character to specified side.

Parameters:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with spaces

  • side – {‘left’, ‘right’, ‘both’}, default ‘left’

  • fillchar – Additional character for filling, default is whitespace

Returns:

sequence or scalar

repeat(repeats)

Duplicate each string in the sequence or scalar by indicated number of times.

Parameters:
  • expr

  • repeats – times

Returns:

sequence or scalar

replace(pat, repl, n=-1, case=True, flags=0, regex=True)

Replace occurrence of pattern/regex in the sequence or scalar with some other string. Equivalent to str.replace()

Parameters:
  • expr

  • pat – Character sequence or regular expression

  • repl – Replacement

  • n – Number of replacements to make from start

  • case – if True, case sensitive

  • flags – re module flag, e.g. re.IGNORECASE

Returns:

sequence or scalar

rfind(sub, start=0, end=None)

Return highest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

Parameters:
  • expr

  • sub

  • start

  • end

Returns:

sequence or scalar

rjust(width, fillchar=' ')

Filling left side of strings in the sequence or scalar with an additional character. Equivalent to str.rjust().

Parameters:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with fillchar

  • fillchar – Additional character for filling, default is whitespace.

Returns:

sequence or scalar

rstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from right side. Equivalent to str.rstrip().

Parameters:
  • expr

  • to_strip

Returns:

sequence or scalar

slice(start=None, stop=None, step=None)

Slice substrings from each element in the sequence or scalar

Parameters:
  • expr

  • start – int or None

  • stop – int or None

  • step – int or None

Returns:

sliced

split(pat=None, n=-1)

Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to str.split().

Parameters:
  • expr

  • pat – Separator to split on. If None, splits on whitespace

  • n – not supported right now

Returns:

list sequence or scalar

startswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar starts with passed pattern. Equivalent to str.startswith().

Parameters:
  • expr

  • pat – Character sequence

Returns:

sequence or scalar

strip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left and right sides. Equivalent to str.strip().

Parameters:
  • expr

  • to_strip

Returns:

sequence or scalar

strptime(date_format)

Return datetimes specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc

Parameters:
  • expr

  • date_format (str) – date format string (e.g. “%Y-%m-%d”)

Returns:

sum()

Sum value

Parameters:

expr

Returns:

swapcase()

Convert strings in the sequence or scalar to be swapcased. Equivalent to str.swapcase().

Parameters:

expr

Returns:

converted

title()

Convert strings in the sequence or scalar to titlecase. Equivalent to str.title().

Parameters:

expr

Returns:

converted

todict(item_delim=',', kv_delim='=')

Convert the string sequence / expr into a string dict given item and key-value delimiters.

Parameters:
  • expr

  • item_delim – delimiter between data items

  • kv_delim – delimiter between keys and values

Returns:

dict sequence or scalar

upper()

Convert strings in the sequence or scalar uppercase. Equivalent to str.upper().

Parameters:

expr

Returns:

sequence or scalar

zfill(width)

Filling left side of strings in the sequence or scalar with 0. Equivalent to str.zfill().

Parameters:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with 0

Returns:

filled

class odps.df.Scalar(*args, **kwargs)[source]

Represent for the scalar type.

Parameters:
  • _value – value of the scalar

  • _value_type – value type of the scalar

Example:

>>> df[df, Scalar(4).rename('append_const')]
ast()

Return the AST string.

Returns:

AST tree

Return type:

str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:
  • expr – sequence or scalar

  • value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:
  • expr

  • func – hash function

Returns:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:
  • expr

  • args

  • kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

odps.df.NullScalar(tp)[source]

Creates a Scalar representing typed None values.

Parameters:

tp – type of the scalar

Returns:

Scalar with None value

class odps.df.RandomScalar(seed=None, **kw)[source]

Represent for the random scalar type.

Parameters:

seed – random seed, None by default

Example:

>>> df[df, RandomScalar().rename('append_random')]
ast()

Return the AST string.

Returns:

AST tree

Return type:

str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:
  • expr – sequence or scalar

  • value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:
  • expr

  • func – hash function

Returns:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:
  • expr – sequence or scalar

  • valueslist object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:

expr – sequence or scalar

Returns:

sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:
  • expr

  • args

  • kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

class odps.df.expr.groupby.GroupBy(*args, **kwargs)[source]
all()

All is True.

Parameters:

expr

Returns:

any()

Any is True.

Parameters:

expr

Returns:

ast()

Return the AST string.

Returns:

AST tree

Return type:

str

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

count()

Value counts

Parameters:

expr

Returns:

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

kurt()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

max()

Max value

Parameters:

expr

Returns:

mean()

Arithmetic mean.

Parameters:

expr

Returns:

median()

Median value.

Parameters:

expr

Returns:

min()

Min value

Parameters:

expr

Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

Returns:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:

expr

Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

quantile(prob=None, **kw)

Percentile value.

Parameters:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

Returns:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:

expr

Returns:

skew()

Calculate skewness of the sequence

Parameters:

expr

Returns:

std(**kw)

Standard deviation.

Parameters:
  • expr

  • kw

Returns:

sum()

Sum value

Parameters:

expr

Returns:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

Parameters:
  • expr

  • ddof – degree of freedom

  • kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

class odps.df.expr.groupby.SequenceGroupBy(*args, **kwargs)[source]
ast()

Return the AST string.

Returns:

AST tree

Return type:

str

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

count()

Value counts

Parameters:

expr

Returns:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

Parameters:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

Parameters:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

max()

Max value

Parameters:

expr

Returns:

min()

Min value

Parameters:

expr

Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:

expr

Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:

expr

Returns:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool

class odps.df.expr.groupby.Int64SequenceGroupBy(*args, **kwargs)[source]
ast()

Return the AST string.

Returns:

AST tree

Return type:

str

compile()

Compile this expression into an ODPS SQL

Returns:

compiled DAG

Return type:

str

count()

Value counts

Parameters:

expr

Returns:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cummean(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative mean of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cummedian(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative median of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cumstd(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative standard deviation of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

cumsum(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative summation of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)
Parameters:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
kurt()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:

expr

Returns:

lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

Parameters:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

Parameters:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

max()

Max value

Parameters:

expr

Returns:

mean()

Arithmetic mean.

Parameters:

expr

Returns:

median()

Median value.

Parameters:

expr

Returns:

min()

Min value

Parameters:

expr

Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

Returns:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:

expr

Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

quantile(prob=None, **kw)

Percentile value.

Parameters:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

Returns:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:

expr

Returns:

skew()

Calculate skewness of the sequence

Parameters:

expr

Returns:

std(**kw)

Standard deviation.

Parameters:
  • expr

  • kw

Returns:

sum()

Sum value

Parameters:

expr

Returns:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

Parameters:
  • expr

  • ddof – degree of freedom

  • kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:

True if compilation succeed else False

Return type:

bool