DataFrame

class odps.df.DataFrame(*args, **kwargs)[source]

Main entrance of PyODPS DataFrame.

Users can initial a DataFrame by odps.models.Table.

Parameters:: data (odps.models.Table or pandas DataFrame) – ODPS table or pandas DataFrame
Example:

>>> df = DataFrame(o.get_table('my_example_table'))
>>> df.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  imdb_url                            string
  user_id                             int64
  rating                              int64
  unix_timestamp                      int64
  age                                 int64
  sex                                 string
  occupation                          string
  zip_code                            string
}
>>> df.count()
100000
>>>
>>> # Do the `groupby`, aggregate the `movie_id` by count, then sort the count in a reversed order
>>> # Finally we get the top 25 results
>>> df.groupby('title').agg(count=df.movie_id.count()).sort('count', ascending=False)[:25]
>>>
>>> # We can use the `value_counts` to reach the same goal
>>> df.movie_id.value_counts()[:25]

static batch_persist(dfs, tables, *args, **kwargs)[source]

Persist multiple DataFrames into ODPS.

Parameters:

dfs – DataFrames to persist.
tables – Table names to persist to. Use (table, partition) tuple to store to a table partition.
args – args for Expr.persist
kwargs – kwargs for Expr.persist

Examples:

>>> DataFrame.batch_persist([df1, df2], ['table_name1', ('table_name2', 'partition_name2')], lifecycle=1)

view()[source]

Clone a same collection. useful for self-join.

Returns:

class odps.df.CollectionExpr(*args, **kwargs)[source]

Collection represents for the two-dimensions data.

Example:

>>> # projection
>>> df = DataFrame(o.get_table('my_table')) # DataFrame is actually a CollectionExpr
>>> df['name', 'id']  # projection some columns
>>> df[[df.name, df.id]]  # projection
>>> df[df]  # means nothing, but get all the columns
>>> df[df, df.name.lower().rename('name2')]  # projection a new columns `name2` besides all the original columns
>>> df.select(df, name2=df.name.lower())  # projection by `select`
>>> df.exclude('name')  # projection all columns but `name`
>>> df[df.exclude('name'), df.name.lower()]  # `name` will not conflict any more
>>>
>>> # filter
>>> df[(df.id < 3) & (df.name != 'test')]
>>> df.filter(df.id < 3, df.name != 'test')
>>>
>>> # slice
>>> df[: 10]
>>> df.limit(10)
>>>
>>> # Sequence
>>> df.name # an instance of :class:`odps.df.expr.expressions.SequenceExpr`
>>>
>>> # schema or dtypes
>>> df.dtypes
odps.Schema {
  name    string
  id      int64
}
>>> df.schema
odps.Schema {
  name    string
  id      int64
}

all()

All is True.

Parameters:: expr
Returns:

any()

Any is True.

Parameters:: expr
Returns:

append_id(id_col='append_id')

Append an ID column to current column to form a new DataFrame.

Parameters:: id_col (str) – name of appended ID field.
Returns:: DataFrame with ID field
Return type:: DataFrame

apply(func, axis=0, names=None, types=None, reduce=False, resources=None, keep_nulls=False, cu_request=None, args=(), **kwargs)

Apply a function to a row when axis=1 or column when axis=0.

Parameters:

expr
func – function to apply
axis – row when axis=1 else column
names – output names
types – output types
reduce – if True will return a sequence else return a collection
resources – resources to read
keep_nulls – if True, keep rows producing empty results, only work in lateral views
args – args for function
kwargs – kwargs for function

Returns:

Example:

Apply a function to a row:

>>> from odps.df import output
>>>
>>> @output(['iris_add', 'iris_sub'], ['float', 'float'])
>>> def handle(row):
>>>     yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth
>>>     yield row.petallength - row.petalwidth, row.petallength + row.petalwidth
>>>
>>> iris.apply(handle, axis=1).count()

Apply a function to a column:

>>> class Agg:
>>>
>>>     def buffer(self):
>>>         return [0.0, 0]
>>>
>>>     def __call__(self, buffer, val):
>>>         buffer[0] += val
>>>         buffer[1] += 1
>>>
>>>     def merge(self, buffer, pbuffer):
>>>         buffer[0] += pbuffer[0]
>>>         buffer[1] += pbuffer[1]
>>>
>>>     def getvalue(self, buffer):
>>>         if buffer[1] == 0:
>>>             return 0.0
>>>         return buffer[0] / buffer[1]
>>>
>>> iris.exclude('name').apply(Agg)

applymap(func, rtype=None, resources=None, columns=None, excludes=None, args=(), **kwargs)

Call func on each element of this collection.

Parameters:

func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion
rtype – if not provided, will be the dtype of this sequence
columns – columns to apply this function on
excludes – columns to skip when applying the function

Returns:

a new collection

Example:

>>> df.applymap(lambda x: x + 1)

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

bfill(subset=None)

Fill NA/NaN values with the backward method. Equivalent to fillna(method=’bfill’).

Parameters:

expr (DataFrame) – input DataFrame.
subset – Labels along other axis to consider.

Returns:

DataFrame

bloom_filter(on, column, capacity=3000, error_rate=0.01)

Filter collection on the on sequence by BloomFilter built by column

Parameters:

collection
on – sequence or column name
column – instance of Column
capacity (int) – numbers of capacity
error_rate (float) – error rate

Returns:

collection

Example:

>>> df1 = DataFrame(pd.DataFrame({'a': ['name1', 'name2', 'name3', 'name1'], 'b': [1, 2, 3, 4]}))
>>> df2 = DataFrame(pd.DataFrame({'a': ['name1']}))
>>> df1.bloom_filter('a', df2.a)
       a  b
0  name1  1
1  name1  4

property columns

columns :rtype: list which each element is a Column

Type:: return

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

concat(rights, distinct=False, axis=0)

Concat collections.

Parameters:

left – left collection
rights – right collections, can be a DataFrame object or a list of DataFrames
distinct – whether to remove duplicate entries. only available when axis == 0
axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

Returns:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)

continuous(*args)

Set fields to be continuous.

Return type:: DataFrame
Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')

count()

Value counts

Parameters:: expr
Returns:

discrete(*args)

Set fields to be discrete.

Return type:: DataFrame
Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')

distinct(on=None, *ons)

Get collection with duplicate rows removed, optionally only considering certain columns

Parameters:

expr – collection
on – sequence or sequences

Returns:

dinstinct collection

Example:

>>> df.distinct(['name', 'id'])
>>> df['name', 'id'].distinct()

drop(data, axis=0, columns=None)

Drop data from a DataFrame.

Parameters:

expr – collection to drop data from
data – data to be removed
axis – 0 for deleting rows, 1 for columns.
columns – columns of data to select, only useful when axis == 0

Returns:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9

dropna(how='any', thresh=None, subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:

expr (DataFrame) – input DataFrame
how – can be ‘any’ or ‘all’. If ‘any’ is specified any NA values are present, drop that label. If ‘all’ is specified and all values are NA, drop that label.
thresh – require that many non-NA values
subset – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

Returns:

DataFrame

erase_key_value(*args)

Erase key-value represented fields.

Return type:: DataFrame
Example:

>>> new_ds = df.erase_key_value('f1 f2')

except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:

left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2

exclude(*fields)[source]

Projection columns which not included in the fields

Parameters:: fields – field names
Returns:: new collection
Return type:: odps.df.expr.expression.CollectionExpr

exclude_fields(*args)

Exclude one or more fields from feature fields.

Return type:: DataFrame

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

extract_kv(columns=None, kv_delim=':', item_delim=',', dtype='float', fill_value=None)

Extract values in key-value represented columns into standalone columns. New column names will be the name of the key-value column followed by an underscore and the key.

Parameters:

expr (DataFrame) – input DataFrame
columns – the key-value columns to be extracted.
kv_delim (str) – delimiter between key and value.
item_delim (str) – delimiter between key-value pairs.
dtype (str) – type of value columns to generate.
fill_value – default value for missing key-value pairs.

Returns:

extracted data frame

Return type:

DataFrame

Example:

>>> df
    name   kv
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1
>>> table = df.extract_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_k1   kv_k2   kv_k3   kv_k5   kv_k7   kv_k9
0  name1  1.0     3.0     Nan     10.0    Nan     Nan
1  name2  Nan     3.0     5.1     Nan     Nan     Nan
2  name3  7.1     Nan     Nan     Nan     8.2     Nan
3  name4  Nan     1.2     1.5     Nan     Nan     Nan
4  name5  Nan     1.0     Nan     Nan     Nan     1.1

ffill(subset=None)

Fill NA/NaN values with the forward method. Equivalent to fillna(method=’ffill’).

Parameters:

expr (DataFrame) – input DataFrame.
subset – Labels along other axis to consider.

Returns:

DataFrame

fillna(value=None, method=None, subset=None)

Fill NA/NaN values using the specified method

Parameters:

expr (DataFrame) – input DataFrame
method – can be ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’ or None
value – value to fill into
subset – Labels along other axis to consider.

Returns:

DataFrame

filter(*predicates)[source]

Filter the data by predicates

Parameters:: predicates – the conditions to filter
Returns:: new collection
Return type:: odps.df.expr.expressions.CollectionExpr

filter_parts(predicate='', exclude=True)[source]

Filter the data by partition string. A partition string looks like pt1=1,pt2=2/pt1=2,pt2=1, where comma (,) denotes ‘and’, while (/) denotes ‘or’.

Parameters:

predicate (str|Partition) – predicate string of partition filter
exclude (bool) – True if you want to exclude partition fields, otherwise False. True for default.

Returns:

new collection

Return type:

odps.df.expr.expressions.CollectionExpr

groupby(by, *bys)

Group collection by a series of sequences.

Parameters:

expr – collection
by – columns to group
bys – columns to group

Returns:

GroupBy instance

Return type:

odps.df.expr.groupby.GroupBy

head(n=None, **kwargs)[source]

Return the first n rows. Execute at once.

Parameters:: n
Returns:: result frame
Return type:: odps.df.backends.frame.ResultFrame

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])

intersect(*rights, **kwargs)

Calc intersection among datasets,

Parameters:

left – collection
rights – collection or list of collections
distinct – whether to preserve duolicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])

key_value(*args, **kwargs)

Set fields to be key-value represented.

Return type:: DataFrame
Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')

kurt()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

label_field(f)

Select one field as the label field.

Note that this field will be exclude from feature fields.

Parameters:: f (str) – Selected label field
Return type:: DataFrame

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])

limit(n)[source]

limit n records

Parameters:: n – n records
Returns:

map_reduce(mapper=None, reducer=None, group=None, sort=None, ascending=True, combiner=None, combiner_buffer_size=1024, mapper_output_names=None, mapper_output_types=None, mapper_resources=None, mapper_cu=None, reducer_output_names=None, reducer_output_types=None, reducer_resources=None, reducer_cu=None)

MapReduce API, mapper or reducer should be provided.

Parameters:

expr
mapper – mapper function or class
reducer – reducer function or class
group – the keys to group after mapper
sort – the keys to sort after mapper
ascending – True if ascending else False
combiner – combiner function or class, combiner’s output should be equal to mapper
combiner_buffer_size – combiner’s buffer size, 1024 as default
mapper_output_names – mapper’s output names
mapper_output_types – mapper’s output types
mapper_resources – the resources for mapper
reducer_output_names – reducer’s output names
reducer_output_types – reducer’s output types
reducer_resources – the resources for reducer

Returns:

Example:

>>> from odps.df import output
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def mapper(row):
>>>     for word in row[0].split():
>>>         yield word.lower(), 1
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def reducer(keys):
>>>     cnt = [0]
>>>     def h(row, done):  # done illustrates that all the rows of the keys are processed
>>>         cnt[0] += row.cnt
>>>         if done:
>>>             yield keys.word, cnt[0]
>>>     return h
>>>
>>> words_df.map_reduce(mapper, reducer, group='word')

max()

Max value

Parameters:: expr
Returns:

mean()

Arithmetic mean.

Parameters:: expr
Returns:

median()

Median value.

Parameters:: expr
Returns:

melt(id_vars=None, value_vars=None, var_name='variable', value_name='value', ignore_nan=False)

“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:

expr – collection
id_vars – column(s) to use as identifier variables.
value_vars – column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name – name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name – name to use for the ‘value’ column.
ignore_nan – whether to ignore NaN values in data.

Returns:

collection

Example:

>>> df.melt(id_vars='id', value_vars=['col1', 'col2'])
>>> df.melt(id_vars=['id', 'id2'], value_vars=['col1', 'col2'], var_name='variable')

min()

Min value

Parameters:: expr
Returns:

min_max_scale(columns=None, feature_range=(0, 1), preserve=False, suffix='_scaled', group=None)

Resize a data frame by max / min values, i.e., (X - min(X)) / (max(X) - min(X))

Parameters:

expr (DataFrame) – input DataFrame
feature_range – the target range to resize the value into, i.e., v * (b - a) + a
preserve (bool) – determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
columns – columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
group – determine scale groups. Scaling will be done in each group separately.
suffix (str) – column suffix to be appended to the scaled columns.

Returns:

resized data frame

Return type:

DataFrame

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:

expr
order – moment order, must be an integer
central – if central moments are to be computed.

Returns:

nunique()

The distinct count.

Parameters:: expr
Returns:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

pivot(rows, columns, values=None)

Produce ‘pivot’ table based on 3 columns of this DataFrame. Uses unique values from rows / columns and fills with values.

Parameters:

expr – collection
rows – use to make new collection’s grouped rows
columns – use to make new collection’s columns
values – values to use for populating new collection’s values

Returns:

collection

Example:

>>> df.pivot(rows='id', columns='category')
>>> df.pivot(rows='id', columns='category', values='sale')
>>> df.pivot(rows=['id', 'id2'], columns='category', values='sale')

pivot_table(values=None, rows=None, columns=None, aggfunc='mean', fill_value=None)

Create a spreadsheet-style pivot table as a DataFrame.

Parameters:

expr – collection
(optional) (fill_value) – column to aggregate
rows – rows to group
columns – keys to group by on the pivot table column
aggfunc – aggregate function or functions
(optional) – value to replace missing value with, default None

Returns:

collection

Example:

>>> df
    A    B      C   D
foo  one  small  1
foo  one  large  2
foo  one  large  2
foo  two  small  3
foo  two  small  3
bar  one  large  4
bar  one  small  5
bar  two  small  6
bar  two  large  7
>>> table = df.pivot_table(values='D', rows=['A', 'B'], columns='C', aggfunc='sum')
>>> table
     A    B  large_D_sum   small_D_sum
bar  one          4.0           5.0
bar  two          7.0           6.0
foo  one          4.0           1.0
foo  two          NaN           6.0

quantile(prob=None, **kw)

Percentile value.

Parameters:

expr
prob – probability or list of probabilities, in [0, 1]

Returns:

query(expr)[source]

Query the data with a boolean expression.

Parameters:: expr – the query string, you can use ‘@’ character refer to environment variables.
Returns:: new collection
Return type:: odps.df.expr.expressions.CollectionExpr

reshuffle(by=None, sort=None, ascending=True)

Reshuffle data.

Parameters:

expr
by – the sequence or scalar to shuffle by. RandomScalar as default
sort – the sequence or scalar to sort.
ascending – True if ascending else False

Returns:

collection

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])

roles(clear_features=True, **field_roles)

Set roles of fields

Parameters:

clear_features – Clear feature roles on fields
field_roles

Returns:

sample(parts=None, columns=None, i=None, n=None, frac=None, replace=False, weights=None, strata=None, random_state=None)

Sample collection.

Parameters:

expr – collection
parts – how many parts to hash
columns – the columns to sample
i – the part to sample out, can be a list of parts, must be from 0 to parts-1
n – how many rows to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample size as values
frac – how many fraction to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample weight as values
replace – whether to perform replace sampling
weights – the column name of weights
strata – the name of strata column
random_state – the random seed when performing sampling

Returns:

collection

Note that n, frac, replace, weights, strata and random_state can only be used under Pandas DataFrames or XFlow.

Example:

Sampling with parts:

>>> df.sample(parts=1)
>>> df.sample(parts=5, i=0)
>>> df.sample(parts=10, columns=['name'])

Sampling with fraction or weights, replacement option can be specified:

>>> df.sample(n=100)
>>> df.sample(frac=0.1)
>>> df.sample(frac=0.1, replace=True)

Sampling with weight column:

>>> df.sample(n=100, weights='weight_col')
>>> df.sample(n=100, weights='weight_col', replace=True)

Stratified sampling. Note that currently we do not support stratified sampling with replacement.

>>> df.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})

select(*fields, **kw)[source]

Projection columns. Remember to avoid column names’ conflict.

Parameters:

fields – columns to project
kw – columns and their names to project

Returns:

new collection

Return type:

odps.df.expr.expression.CollectionExpr

select_features(*args, **kwargs)

Select one or more fields as feature fields.

Return type:: DataFrame

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:

left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2

size()

Value counts

Parameters:: expr
Returns:

skew()

Calculate skewness of the sequence

Parameters:: expr
Returns:

sort(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

Parameters:

expr – collection
by – the sequence or sequences to sort
ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

Returns:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3

sort_values(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

Parameters:

expr – collection
by – the sequence or sequences to sort
ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

Returns:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3

split(frac, seed=None)

Split the current column into two column objects with certain ratio.

Parameters:: frac (float) – Split ratio
Returns:: two split DataFrame objects

std(**kw)

Standard deviation.

Parameters:

expr
kw

Returns:

std_scale(columns=None, with_means=True, with_std=True, preserve=False, suffix='_scaled', group=None)

Resize a data frame by mean and standard error.

Parameters:

expr (DataFrame) – Input DataFrame
with_means (bool) – Determine whether the output will be subtracted by means
with_std (bool) – Determine whether the output will be divided by standard deviations
preserve (bool) – Determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
columns – Columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
group – determine scale groups. Scaling will be done in each group separately.
suffix (str) – column suffix to be appended to the scaled columns.

Returns:

resized data frame

Return type:

DataFrame

sum()

Sum value

Parameters:: expr
Returns:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:

expr
args
kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')

tail(n=None, **kwargs)[source]

Return the last n rows. Execute at once.

Parameters:: n
Returns:: result frame
Return type:: odps.df.backends.frame.ResultFrame

to_kv(columns=None, kv_delim=':', item_delim=',', kv_name='kv_col')

Merge values in specified columns into a key-value represented column.

Parameters:

expr (DataFrame) – input DataFrame
columns – the columns to be merged.
kv_delim (str) – delimiter between key and value.
item_delim (str) – delimiter between key-value pairs.
kv_col (str) – name of the new key-value column

Returns:

converted data frame

Return type:

DataFrame

Example:

>>> df
    name   k1   k2   k3   k5    k7   k9
0  name1  1.0  3.0  Nan  10.0  Nan  Nan
1  name2  Nan  3.0  5.1  Nan   Nan  Nan
2  name3  7.1  Nan  Nan  Nan   8.2  Nan
3  name4  Nan  1.2  1.5  Nan   Nan  Nan
4  name5  Nan  1.0  Nan  Nan   Nan  1.1
>>> table = df.to_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_col
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1

to_pandas(wrap=False, **kwargs)[source]

Convert to pandas DataFrame. Execute at once.

Parameters:: wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame
Returns:: pandas DataFrame

tolist(**kwargs): Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

Parameters:

left – left collection
right – right collection
distinct

Returns:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])

var(**kw)

Variance

Parameters:

expr
ddof – degree of freedom
kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

view()[source]

Clone a same collection. useful for self-join.

Returns:

weight_field(f)

Select one field as the weight field.

Note that this field will be exclude from feature fields.

Parameters:: f (str) – Selected weight field
Return type:: DataFrame

class odps.df.SequenceExpr(*args, **kwargs)[source]

Sequence represents for 1-dimension data.

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

astype(data_type)[source]

Cast to a new data type.

Parameters:: data_type – the new data type
Returns:: casted sequence
Example:

>>> df.id.astype('float')

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:

expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

concat(rights, distinct=False, axis=0)

Concat collections.

Parameters:

left – left collection
rights – right collections, can be a DataFrame object or a list of DataFrames
distinct – whether to remove duplicate entries. only available when axis == 0
axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

Returns:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)

continuous()

Set sequence to be continuous.

Return type:: Column
Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')

count()

Value counts

Parameters:: expr
Returns:

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:

expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

discrete()

Set sequence to be discrete.

Return type:: Column
Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')

drop(data, axis=0, columns=None)

Drop data from a DataFrame.

Parameters:

expr – collection to drop data from
data – data to be removed
axis – 0 for deleting rows, 1 for columns.
columns – columns of data to select, only useful when axis == 0

Returns:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9

property dtype

Return the data type. Available types: int8, int16, int32, int64, float32, float64, boolean, string, decimal, datetime

Returns:: the data type

erase_key_value()

Erase key-value represented fields.

Return type:: Column
Example:

>>> new_ds = df.erase_key_value('f1 f2')

except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:

left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:

expr – sequence or scalar
value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:

expr
func – hash function

Returns:

head(n=None, **kwargs)[source]

Return first n rows. Execute at once.

Parameters:: n
Returns:: result frame
Return type:: odps.df.expr.expressions.CollectionExpr

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:

expr
error_rate (float) – error rate
splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])

intersect(*rights, **kwargs)

Calc intersection among datasets,

Parameters:

left – collection
rights – collection or list of collections
distinct – whether to preserve duolicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3

isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])

key_value(**kwargs)

Set fields to be key-value represented.

Return type:: Column
Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])

map(func, rtype=None, resources=None, cu_request=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:

func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion
rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)

max()

Max value

Parameters:: expr
Returns:

min()

Min value

Parameters:: expr
Returns:

notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

nunique()

The distinct count.

Parameters:: expr
Returns:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])

role(role_name)

Set role of current column

Parameters:: role_name – name of the role to be selected.
Returns:

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

Parameters:

left – collection to drop data from
rights – collection or list of collections
distinct – whether to preserve duplicate entries

Returns:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2

size()

Value counts

Parameters:: expr
Returns:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:

expr
args
kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')

tail(n=None, **kwargs)[source]

Return the last n rows. Execute at once.

Parameters:: n
Returns:

to_pandas(wrap=False, **kwargs)[source]

Convert to pandas Series. Execute at once.

Parameters:: wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame
Returns:: pandas Series

tolist(**kwargs): Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

Parameters:

left – left collection
right – right collection
distinct

Returns:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])

value_counts(sort=True, ascending=False, dropna=False)

Return object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occuring element. Exclude NA values by default

Parameters:

expr – sequence
sort (bool) – if sort
dropna – Don’t include counts of None, default False

Returns:

collection with two columns

Return type:

odps.df.expr.expressions.CollectionExpr

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

class odps.df.expr.expressions.Int64SequenceExpr(*args, **kwargs)[source]

kurt()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

mean()

Arithmetic mean.

Parameters:: expr
Returns:

median()

Median value.

Parameters:: expr
Returns:

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:

expr
order – moment order, must be an integer
central – if central moments are to be computed.

Returns:

quantile(prob=None, **kw)

Percentile value.

Parameters:

expr
prob – probability or list of probabilities, in [0, 1]

Returns:

skew()

Calculate skewness of the sequence

Parameters:: expr
Returns:

std(**kw)

Standard deviation.

Parameters:

expr
kw

Returns:

sum()

Sum value

Parameters:: expr
Returns:

to_datetime()

Return a sequence or scalar that is the datetime value of the current numeric sequence or scalar.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

var(**kw)

Variance

Parameters:

expr
ddof – degree of freedom
kw

Returns:

class odps.df.expr.expressions.StringSequenceExpr(*args, **kwargs)[source]

capitalize()

Convert strings in the Sequence or string scalar to be capitalized. Equivalent to str.capitalize().

Parameters:: expr
Returns:: sequence or scalar

cat(others=None, sep=None, na_rep=None)

Concatenate strings in sequence with given separator

Parameters:

expr
others – other sequences
sep – string or None, default None
na_rep – string or None default None, if None, NA in the sequence are ignored

Returns:

contains(pat, case=True, flags=0, regex=True)

Return boolean sequence whether given pattern/regex is contained in each string in the sequence

Parameters:

expr – sequence or scalar
pat – Character sequence or regular expression
case (bool) – If True, case sensitive
flags – re module flags, e.g. re.IGNORECASE
regex – If True use regex, otherwise use string finder

Returns:

sequence or scalar

count(*args, **kwargs)

Value counts

Parameters:: expr
Returns:

endswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar ends with passed pattern. Equivalent to str.endswith().

Parameters:

expr
pat – Character sequence

Returns:

sequence or scalar

extract(pat, flags=0, group=0)

Find group in each string in the Series using passed regular expression.

Parameters:

expr
pat – Pattern or regular expression
flags – re module, e.g. re.IGNORECASE
group – if None as group 0

Returns:

sequence or scalar

find(sub, start=0, end=None)

Return lowest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

Parameters:

expr
sub – substring being searched
start – left edge index
end – right edge index

Returns:

sequence or scalar

get(index)

Extract element from lists, tuples, or strings in each element in the sequence or scalar

Parameters:

expr
index – Integer index(location)

Returns:

sequence or scalar

isalnum()

Check whether all characters in each string in the sequence or scalar are alphanumeric. Equivalent to str.isalnum().

Parameters:: expr
Returns:: boolean sequence or scalar

isalpha()

Check whether all characters in each string in the sequence or scalar are alphabetic. Equivalent to str.isalpha().

Parameters:: expr
Returns:: boolean sequence or scalar

isdecimal()

Check whether all characters in each string in the sequence or scalar are decimal. Equivalent to str.isdecimal().

Parameters:: expr
Returns:: boolean sequence or scalar

isdigit()

Check whether all characters in each string in the sequence or scalar are digits. Equivalent to str.isdigit().

Parameters:: expr
Returns:: boolean sequence or scalar

islower()

Check whether all characters in each string in the sequence or scalar are lowercase. Equivalent to str.islower().

Parameters:: expr
Returns:: boolean sequence or scalar

isnumeric()

Check whether all characters in each string in the sequence or scalar are numeric. Equivalent to str.isnumeric().

Parameters:: expr
Returns:: boolean sequence or scalar

isspace()

Check whether all characters in each string in the sequence or scalar are whitespace. Equivalent to str.isspace().

Parameters:: expr
Returns:: boolean sequence or scalar

istitle()

Check whether all characters in each string in the sequence or scalar are titlecase. Equivalent to str.istitle().

Parameters:: expr
Returns:: boolean sequence or scalar

isupper()

Check whether all characters in each string in the sequence or scalar are uppercase. Equivalent to str.isupper().

Parameters:: expr
Returns:: boolean sequence or scalar

len()

Compute length of each string in the sequence or scalar

Parameters:: expr
Returns:: lengths

ljust(width, fillchar=' ')

Filling right side of strings in the sequence or scalar with an additional character. Equivalent to str.ljust().

Parameters:

expr
width – Minimum width of resulting string; additional characters will be filled with fillchar
fillchar – Additional character for filling, default is whitespace.

Returns:

sequence or scalar

lower()

Convert strings in the sequence or scalar lowercase. Equivalent to str.lower().

Parameters:: expr
Returns:: sequence or scalar

lstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left side. Equivalent to str.lstrip().

Parameters:

expr
to_strip

Returns:

sequence or sclaar

pad(width, side='left', fillchar=' ')

Pad strings in the sequence or scalar with an additional character to specified side.

Parameters:

expr
width – Minimum width of resulting string; additional characters will be filled with spaces
side – {‘left’, ‘right’, ‘both’}, default ‘left’
fillchar – Additional character for filling, default is whitespace

Returns:

sequence or scalar

repeat(repeats)

Duplicate each string in the sequence or scalar by indicated number of times.

Parameters:

expr
repeats – times

Returns:

sequence or scalar

replace(pat, repl, n=-1, case=True, flags=0, regex=True)

Replace occurrence of pattern/regex in the sequence or scalar with some other string. Equivalent to str.replace()

Parameters:

expr
pat – Character sequence or regular expression
repl – Replacement
n – Number of replacements to make from start
case – if True, case sensitive
flags – re module flag, e.g. re.IGNORECASE

Returns:

sequence or scalar

rfind(sub, start=0, end=None)

Return highest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

Parameters:

expr
sub
start
end

Returns:

sequence or scalar

rjust(width, fillchar=' ')

Filling left side of strings in the sequence or scalar with an additional character. Equivalent to str.rjust().

Parameters:

expr
width – Minimum width of resulting string; additional characters will be filled with fillchar
fillchar – Additional character for filling, default is whitespace.

Returns:

sequence or scalar

rstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from right side. Equivalent to str.rstrip().

Parameters:

expr
to_strip

Returns:

sequence or scalar

slice(start=None, stop=None, step=None)

Slice substrings from each element in the sequence or scalar

Parameters:

expr
start – int or None
stop – int or None
step – int or None

Returns:

sliced

split(pat=None, n=-1)

Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to str.split().

Parameters:

expr
pat – Separator to split on. If None, splits on whitespace
n – not supported right now

Returns:

list sequence or scalar

startswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar starts with passed pattern. Equivalent to str.startswith().

Parameters:

expr
pat – Character sequence

Returns:

sequence or scalar

strip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left and right sides. Equivalent to str.strip().

Parameters:

expr
to_strip

Returns:

sequence or scalar

strptime(date_format)

Return datetimes specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc

Parameters:

expr
date_format (str) – date format string (e.g. “%Y-%m-%d”)

Returns:

sum()

Sum value

Parameters:: expr
Returns:

swapcase()

Convert strings in the sequence or scalar to be swapcased. Equivalent to str.swapcase().

Parameters:: expr
Returns:: converted

title()

Convert strings in the sequence or scalar to titlecase. Equivalent to str.title().

Parameters:: expr
Returns:: converted

todict(item_delim=',', kv_delim='=')

Convert the string sequence / expr into a string dict given item and key-value delimiters.

Parameters:

expr
item_delim – delimiter between data items
kv_delim – delimiter between keys and values

Returns:

dict sequence or scalar

upper()

Convert strings in the sequence or scalar uppercase. Equivalent to str.upper().

Parameters:: expr
Returns:: sequence or scalar

zfill(width)

Filling left side of strings in the sequence or scalar with 0. Equivalent to str.zfill().

Parameters:

expr
width – Minimum width of resulting string; additional characters will be filled with 0

Returns:

filled

class odps.df.Scalar(*args, **kwargs)[source]

Represent for the scalar type.

Parameters:

_value – value of the scalar
_value_type – value type of the scalar

Example:

>>> df[df, Scalar(4).rename('append_const')]

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:

expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:

expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:

expr – sequence or scalar
value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:

expr
func – hash function

Returns:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])

isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])

map(func, rtype=None, resources=None, cu_request=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:

func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion
rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)

notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:

expr
args
kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

odps.df.NullScalar(tp)[source]

Creates a Scalar representing typed None values.

Parameters:: tp – type of the scalar
Returns:: Scalar with None value

class odps.df.RandomScalar(seed=None, **kw)[source]

Represent for the random scalar type.

Parameters:: seed – random seed, None by default
Example:

>>> df[df, RandomScalar().rename('append_random')]

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

Parameters:

expr – sequence or scalar
left – left value
right – right value
inclusive – if true, will be left <= expr <= right, else will be left < expr < right

Returns:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

Parameters:

expr – sequence or scalar
bins – list of scalars
right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
include_lowest – Whether the first interval should be left-inclusive or not.
include_under – include the bin below the leftmost edge or not
include_over – include the bin above the rightmost edge or not

Returns:

sequence or scalar

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

Parameters:

expr – sequence or scalar
value – value to fill into

Returns:

sequence or scalar

hash(func=None)

Calculate the hash value.

Parameters:

expr
func – hash function

Returns:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])

isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
how – ‘inner’, ‘left’, ‘right’, or ‘outer’
suffixes – when name conflict, the suffix will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])

map(func, rtype=None, resources=None, cu_request=None, args=(), **kwargs)

Call func on each element of this sequence.

Parameters:

func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion
rtype – if not provided, will be the dtype of this sequence

Returns:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)

notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

Parameters:

expr – sequence or scalar
values – list object or sequence

Returns:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

Parameters:: expr – sequence or scalar
Returns:: sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

Parameters:

left – left collection
right – right collection
on – fields to join on
suffixes – when name conflict, the suffixes will be added to both columns.
mapjoin – set use mapjoin or not, default value False.
merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

Returns:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

Parameters:

expr
args
kw

Returns:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

class odps.df.expr.groupby.GroupBy(*args, **kwargs)[source]

all()

All is True.

Parameters:: expr
Returns:

any()

Any is True.

Parameters:: expr
Returns:

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

count()

Value counts

Parameters:: expr
Returns:

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

kurt()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

max()

Max value

Parameters:: expr
Returns:

mean()

Arithmetic mean.

Parameters:: expr
Returns:

median()

Median value.

Parameters:: expr
Returns:

min()

Min value

Parameters:: expr
Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:

expr
order – moment order, must be an integer
central – if central moments are to be computed.

Returns:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:

expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:: expr
Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:

expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

quantile(prob=None, **kw)

Percentile value.

Parameters:

expr
prob – probability or list of probabilities, in [0, 1]

Returns:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:: expr
Returns:

skew()

Calculate skewness of the sequence

Parameters:: expr
Returns:

std(**kw)

Standard deviation.

Parameters:

expr
kw

Returns:

sum()

Sum value

Parameters:: expr
Returns:

tolist(**kwargs): Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

Parameters:

expr
ddof – degree of freedom
kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

class odps.df.expr.groupby.SequenceGroupBy(*args, **kwargs)[source]

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

count()

Value counts

Parameters:: expr
Returns:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:

expr
error_rate (float) – error rate
splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250

lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

Parameters:

offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

Parameters:

offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

max()

Max value

Parameters:: expr
Returns:

min()

Min value

Parameters:: expr
Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:

expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:: expr
Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:

expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:: expr
Returns:

tolist(**kwargs): Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool

class odps.df.expr.groupby.Int64SequenceGroupBy(*args, **kwargs)[source]

ast()

Return the AST string.

Returns:: AST tree
Return type:: str

compile()

Compile this expression into an ODPS SQL

Returns:: compiled DAG
Return type:: str

count()

Value counts

Parameters:: expr
Returns:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cummean(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative mean of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cummedian(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative median of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cumstd(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative standard deviation of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

cumsum(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative summation of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order
unique – whether to eliminate duplicate entries
preceding – the start point of a window
following – the end point of a window

Returns:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

execute(**kwargs)

Parameters:

hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance

Returns:

execution result

Return type:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

Parameters:

expr
error_rate (float) – error rate
splitter – the splitter to split the column value

Returns:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250

kurt()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

kurtosis()

Calculate kurtosis of the sequence

Parameters:: expr
Returns:

lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

Parameters:

offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

Parameters:

offset – the offset value
default – default value for the function, when there are no rows satisfying the offset
expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

max()

Max value

Parameters:: expr
Returns:

mean()

Arithmetic mean.

Parameters:: expr
Returns:

median()

Median value.

Parameters:: expr
Returns:

min()

Min value

Parameters:: expr
Returns:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

Parameters:

expr
order – moment order, must be an integer
central – if central moments are to be computed.

Returns:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

Parameters:

expr – expression for calculation
nth – integer position
skip_nulls – whether to skip null values, False by default
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

nunique()

The distinct count.

Parameters:: expr
Returns:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

Parameters:

name – table name
partitions (list) – list of string, the partition fields
partition (string or PartitionSpec) – persist to a specified partition
lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
project – project name, if not provided, will be the default project
hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
priority (int) – instance priority, 9 as default
running_cluster – cluster to run this instance
overwrite (bool) – overwrite the table, True as default
drop_table (bool) – drop table if exists, False as default
create_table (bool) – create table first if not exits, True as default
drop_partition (bool) – drop partition if exists, False as default
create_partition (bool) – create partition if not exists, None as default
cast (bool) – cast all columns’ types as the existed table, False as default

Returns:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])

qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

Parameters:

expr – expression for calculation
bins – number of bins
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

quantile(prob=None, **kw)

Percentile value.

Parameters:

expr
prob – probability or list of probabilities, in [0, 1]

Returns:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

Parameters:

expr – expression for calculation
sort – name of the sort column
ascending – whether to sort in ascending order

Returns:

calculated column

size()

Value counts

Parameters:: expr
Returns:

skew()

Calculate skewness of the sequence

Parameters:: expr
Returns:

std(**kw)

Standard deviation.

Parameters:

expr
kw

Returns:

sum()

Sum value

Parameters:: expr
Returns:

tolist(**kwargs): Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

Parameters:

expr
ddof – degree of freedom
kw

Returns:

verify()

Verify if this expression can be compiled into ODPS SQL.

Returns:: True if compilation succeed else False
Return type:: bool