DataFrame Reference

class odps.df.DataFrame(*args, **kwargs)[源代码]

Main entrance of PyODPS DataFrame.

Users can initial a DataFrame by odps.models.Table.

参数:

data (odps.models.Table or pandas DataFrame) – ODPS table or pandas DataFrame

Example:

>>> df = DataFrame(o.get_table('my_example_table'))
>>> df.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  imdb_url                            string
  user_id                             int64
  rating                              int64
  unix_timestamp                      int64
  age                                 int64
  sex                                 string
  occupation                          string
  zip_code                            string
}
>>> df.count()
100000
>>>
>>> # Do the `groupby`, aggregate the `movie_id` by count, then sort the count in a reversed order
>>> # Finally we get the top 25 results
>>> df.groupby('title').agg(count=df.movie_id.count()).sort('count', ascending=False)[:25]
>>>
>>> # We can use the `value_counts` to reach the same goal
>>> df.movie_id.value_counts()[:25]
static batch_persist(dfs, tables, *args, **kwargs)[源代码]

Persist multiple DataFrames into ODPS.

参数:
  • dfs – DataFrames to persist.

  • tables – Table names to persist to. Use (table, partition) tuple to store to a table partition.

  • args – args for Expr.persist

  • kwargs – kwargs for Expr.persist

Examples:

>>> DataFrame.batch_persist([df1, df2], ['table_name1', ('table_name2', 'partition_name2')], lifecycle=1)
view()[源代码]

Clone a same collection. useful for self-join.

返回:

class odps.df.CollectionExpr(*args, **kwargs)[源代码]

Collection represents for the two-dimensions data.

Example:

>>> # projection
>>> df = DataFrame(o.get_table('my_table')) # DataFrame is actually a CollectionExpr
>>> df['name', 'id']  # projection some columns
>>> df[[df.name, df.id]]  # projection
>>> df[df]  # means nothing, but get all the columns
>>> df[df, df.name.lower().rename('name2')]  # projection a new columns `name2` besides all the original columns
>>> df.select(df, name2=df.name.lower())  # projection by `select`
>>> df.exclude('name')  # projection all columns but `name`
>>> df[df.exclude('name'), df.name.lower()]  # `name` will not conflict any more
>>>
>>> # filter
>>> df[(df.id < 3) & (df.name != 'test')]
>>> df.filter(df.id < 3, df.name != 'test')
>>>
>>> # slice
>>> df[: 10]
>>> df.limit(10)
>>>
>>> # Sequence
>>> df.name # an instance of :class:`odps.df.expr.expressions.SequenceExpr`
>>>
>>> # schema or dtypes
>>> df.dtypes
odps.Schema {
  name    string
  id      int64
}
>>> df.schema
odps.Schema {
  name    string
  id      int64
}
all()

All is True.

参数:

expr

返回:

any()

Any is True.

参数:

expr

返回:

append_id(id_col='append_id')

Append an ID column to current column to form a new DataFrame.

参数:

id_col (str) – name of appended ID field.

返回:

DataFrame with ID field

返回类型:

DataFrame

apply(func, axis=0, names=None, types=None, reduce=False, resources=None, keep_nulls=False, args=(), **kwargs)

Apply a function to a row when axis=1 or column when axis=0.

参数:
  • expr

  • func – function to apply

  • axis – row when axis=1 else column

  • names – output names

  • types – output types

  • reduce – if True will return a sequence else return a collection

  • resources – resources to read

  • keep_nulls – if True, keep rows producing empty results, only work in lateral views

  • args – args for function

  • kwargs – kwargs for function

返回:

Example:

Apply a function to a row:

>>> from odps.df import output
>>>
>>> @output(['iris_add', 'iris_sub'], ['float', 'float'])
>>> def handle(row):
>>>     yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth
>>>     yield row.petallength - row.petalwidth, row.petallength + row.petalwidth
>>>
>>> iris.apply(handle, axis=1).count()

Apply a function to a column:

>>> class Agg(object):
>>>
>>>     def buffer(self):
>>>         return [0.0, 0]
>>>
>>>     def __call__(self, buffer, val):
>>>         buffer[0] += val
>>>         buffer[1] += 1
>>>
>>>     def merge(self, buffer, pbuffer):
>>>         buffer[0] += pbuffer[0]
>>>         buffer[1] += pbuffer[1]
>>>
>>>     def getvalue(self, buffer):
>>>         if buffer[1] == 0:
>>>             return 0.0
>>>         return buffer[0] / buffer[1]
>>>
>>> iris.exclude('name').apply(Agg)
applymap(func, rtype=None, resources=None, columns=None, excludes=None, args=(), **kwargs)

Call func on each element of this collection.

参数:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

  • columns – columns to apply this function on

  • excludes – columns to skip when applying the function

返回:

a new collection

Example:

>>> df.applymap(lambda x: x + 1)
ast()

Return the AST string.

返回:

AST tree

返回类型:

str

bfill(subset=None)

Fill NA/NaN values with the backward method. Equivalent to fillna(method=’bfill’).

参数:
  • expr (DataFrame) – input DataFrame.

  • subset – Labels along other axis to consider.

返回:

DataFrame

bloom_filter(on, column, capacity=3000, error_rate=0.01)

Filter collection on the on sequence by BloomFilter built by column

参数:
  • collection

  • on – sequence or column name

  • column – instance of Column

  • capacity (int) – numbers of capacity

  • error_rate (float) – error rate

返回:

collection

Example:

>>> df1 = DataFrame(pd.DataFrame({'a': ['name1', 'name2', 'name3', 'name1'], 'b': [1, 2, 3, 4]}))
>>> df2 = DataFrame(pd.DataFrame({'a': ['name1']}))
>>> df1.bloom_filter('a', df2.a)
       a  b
0  name1  1
1  name1  4
property columns

columns :rtype: list which each element is a Column

Type:

return

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

concat(rights, distinct=False, axis=0)

Concat collections.

参数:
  • left – left collection

  • rights – right collections, can be a DataFrame object or a list of DataFrames

  • distinct – whether to remove duplicate entries. only available when axis == 0

  • axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

返回:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)
continuous(*args)

Set fields to be continuous.

返回类型:

DataFrame

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')
count()

Value counts

参数:

expr

返回:

discrete(*args)

Set fields to be discrete.

返回类型:

DataFrame

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')
distinct(on=None, *ons)

Get collection with duplicate rows removed, optionally only considering certain columns

参数:
  • expr – collection

  • on – sequence or sequences

返回:

dinstinct collection

Example:

>>> df.distinct(['name', 'id'])
>>> df['name', 'id'].distinct()
drop(data, axis=0, columns=None)

Drop data from a DataFrame.

参数:
  • expr – collection to drop data from

  • data – data to be removed

  • axis – 0 for deleting rows, 1 for columns.

  • columns – columns of data to select, only useful when axis == 0

返回:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9
dropna(how='any', thresh=None, subset=None)

Return object with labels on given axis omitted where alternately any or all of the data are missing

参数:
  • expr (DataFrame) – input DataFrame

  • how – can be ‘any’ or ‘all’. If ‘any’ is specified any NA values are present, drop that label. If ‘all’ is specified and all values are NA, drop that label.

  • thresh – require that many non-NA values

  • subset – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include

返回:

DataFrame

erase_key_value(*args)

Erase key-value represented fields.

返回类型:

DataFrame

Example:

>>> new_ds = df.erase_key_value('f1 f2')
except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

参数:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
exclude(*fields)[源代码]

Projection columns which not included in the fields

参数:

fields – field names

返回:

new collection

返回类型:

odps.df.expr.expression.CollectionExpr

exclude_fields(*args)

Exclude one or more fields from feature fields.

返回类型:

DataFrame

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

extract_kv(columns=None, kv_delim=':', item_delim=',', dtype='float', fill_value=None)

Extract values in key-value represented columns into standalone columns. New column names will be the name of the key-value column followed by an underscore and the key.

参数:
  • expr (DataFrame) – input DataFrame

  • columns – the key-value columns to be extracted.

  • kv_delim (str) – delimiter between key and value.

  • item_delim (str) – delimiter between key-value pairs.

  • dtype (str) – type of value columns to generate.

  • fill_value – default value for missing key-value pairs.

返回:

extracted data frame

返回类型:

DataFrame

Example:

>>> df
    name   kv
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1
>>> table = df.extract_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_k1   kv_k2   kv_k3   kv_k5   kv_k7   kv_k9
0  name1  1.0     3.0     Nan     10.0    Nan     Nan
1  name2  Nan     3.0     5.1     Nan     Nan     Nan
2  name3  7.1     Nan     Nan     Nan     8.2     Nan
3  name4  Nan     1.2     1.5     Nan     Nan     Nan
4  name5  Nan     1.0     Nan     Nan     Nan     1.1
ffill(subset=None)

Fill NA/NaN values with the forward method. Equivalent to fillna(method=’ffill’).

参数:
  • expr (DataFrame) – input DataFrame.

  • subset – Labels along other axis to consider.

返回:

DataFrame

fillna(value=None, method=None, subset=None)

Fill NA/NaN values using the specified method

参数:
  • expr (DataFrame) – input DataFrame

  • method – can be ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’ or None

  • value – value to fill into

  • subset – Labels along other axis to consider.

返回:

DataFrame

filter(*predicates)[源代码]

Filter the data by predicates

参数:

predicates – the conditions to filter

返回:

new collection

返回类型:

odps.df.expr.expressions.CollectionExpr

filter_parts(predicate='', exclude=True)[源代码]

Filter the data by partition string. A partition string looks like pt1=1,pt2=2/pt1=2,pt2=1, where comma (,) denotes ‘and’, while (/) denotes ‘or’.

参数:
  • predicate (str|Partition) – predicate string of partition filter

  • exclude (bool) – True if you want to exclude partition fields, otherwise False. True for default.

返回:

new collection

返回类型:

odps.df.expr.expressions.CollectionExpr

groupby(by, *bys)

Group collection by a series of sequences.

参数:
  • expr – collection

  • by – columns to group

  • bys – columns to group

返回:

GroupBy instance

返回类型:

odps.df.expr.groupby.GroupBy

head(n=None, **kwargs)[源代码]

Return the first n rows. Execute at once.

参数:

n

返回:

result frame

返回类型:

odps.df.backends.frame.ResultFrame

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
intersect(*rights, **kwargs)

Calc intersection among datasets,

参数:
  • left – collection

  • rights – collection or list of collections

  • distinct – whether to preserve duolicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3
join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
key_value(*args, **kwargs)

Set fields to be key-value represented.

返回类型:

DataFrame

Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
kurt()

Calculate kurtosis of the sequence

参数:

expr

返回:

kurtosis()

Calculate kurtosis of the sequence

参数:

expr

返回:

label_field(f)

Select one field as the label field.

Note that this field will be exclude from feature fields.

参数:

f (str) – Selected label field

返回类型:

DataFrame

left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
limit(n)[源代码]

limit n records

参数:

n – n records

返回:

map_reduce(mapper=None, reducer=None, group=None, sort=None, ascending=True, combiner=None, combiner_buffer_size=1024, mapper_output_names=None, mapper_output_types=None, mapper_resources=None, reducer_output_names=None, reducer_output_types=None, reducer_resources=None)

MapReduce API, mapper or reducer should be provided.

参数:
  • expr

  • mapper – mapper function or class

  • reducer – reducer function or class

  • group – the keys to group after mapper

  • sort – the keys to sort after mapper

  • ascending – True if ascending else False

  • combiner – combiner function or class, combiner’s output should be equal to mapper

  • combiner_buffer_size – combiner’s buffer size, 1024 as default

  • mapper_output_names – mapper’s output names

  • mapper_output_types – mapper’s output types

  • mapper_resources – the resources for mapper

  • reducer_output_names – reducer’s output names

  • reducer_output_types – reducer’s output types

  • reducer_resources – the resources for reducer

返回:

Example:

>>> from odps.df import output
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def mapper(row):
>>>     for word in row[0].split():
>>>         yield word.lower(), 1
>>>
>>> @output(['word', 'cnt'], ['string', 'int'])
>>> def reducer(keys):
>>>     cnt = [0]
>>>     def h(row, done):  # done illustrates that all the rows of the keys are processed
>>>         cnt[0] += row.cnt
>>>         if done:
>>>             yield keys.word, cnt[0]
>>>     return h
>>>
>>> words_df.map_reduce(mapper, reducer, group='word')
max()

Max value

参数:

expr

返回:

mean()

Arithmetic mean.

参数:

expr

返回:

median()

Median value.

参数:

expr

返回:

melt(id_vars=None, value_vars=None, var_name='variable', value_name='value', ignore_nan=False)

“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

参数:
  • expr – collection

  • id_vars – column(s) to use as identifier variables.

  • value_vars – column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name – name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name – name to use for the ‘value’ column.

  • ignore_nan – whether to ignore NaN values in data.

返回:

collection

Example:

>>> df.melt(id_vars='id', value_vars=['col1', 'col2'])
>>> df.melt(id_vars=['id', 'id2'], value_vars=['col1', 'col2'], var_name='variable')
min()

Min value

参数:

expr

返回:

min_max_scale(columns=None, feature_range=(0, 1), preserve=False, suffix='_scaled', group=None)

Resize a data frame by max / min values, i.e., (X - min(X)) / (max(X) - min(X))

参数:
  • expr (DataFrame) – input DataFrame

  • feature_range – the target range to resize the value into, i.e., v * (b - a) + a

  • preserve (bool) – determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix

  • columns – columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.

  • group – determine scale groups. Scaling will be done in each group separately.

  • suffix (str) – column suffix to be appended to the scaled columns.

返回:

resized data frame

返回类型:

DataFrame

moment(order, central=False)

Calculate the n-th order moment of the sequence

参数:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

返回:

nunique()

The distinct count.

参数:

expr

返回:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
pivot(rows, columns, values=None)

Produce ‘pivot’ table based on 3 columns of this DataFrame. Uses unique values from rows / columns and fills with values.

参数:
  • expr – collection

  • rows – use to make new collection’s grouped rows

  • columns – use to make new collection’s columns

  • values – values to use for populating new collection’s values

返回:

collection

Example:

>>> df.pivot(rows='id', columns='category')
>>> df.pivot(rows='id', columns='category', values='sale')
>>> df.pivot(rows=['id', 'id2'], columns='category', values='sale')
pivot_table(values=None, rows=None, columns=None, aggfunc='mean', fill_value=None)

Create a spreadsheet-style pivot table as a DataFrame.

参数:
  • expr – collection

  • (optional) (fill_value) – column to aggregate

  • rows – rows to group

  • columns – keys to group by on the pivot table column

  • aggfunc – aggregate function or functions

  • (optional) – value to replace missing value with, default None

返回:

collection

Example:

>>> df
    A    B      C   D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7
>>> table = df.pivot_table(values='D', rows=['A', 'B'], columns='C', aggfunc='sum')
>>> table
     A    B  large_D_sum   small_D_sum
0  bar  one          4.0           5.0
1  bar  two          7.0           6.0
2  foo  one          4.0           1.0
3  foo  two          NaN           6.0
quantile(prob=None, **kw)

Percentile value.

参数:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

返回:

query(expr)[源代码]

Query the data with a boolean expression.

参数:

expr – the query string, you can use ‘@’ character refer to environment variables.

返回:

new collection

返回类型:

odps.df.expr.expressions.CollectionExpr

reshuffle(by=None, sort=None, ascending=True)

Reshuffle data.

参数:
  • expr

  • by – the sequence or scalar to shuffle by. RandomScalar as default

  • sort – the sequence or scalar to sort.

  • ascending – True if ascending else False

返回:

collection

right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
roles(clear_features=True, **field_roles)

Set roles of fields

参数:
  • clear_features – Clear feature roles on fields

  • field_roles

返回:

sample(parts=None, columns=None, i=None, n=None, frac=None, replace=False, weights=None, strata=None, random_state=None)

Sample collection.

参数:
  • expr – collection

  • parts – how many parts to hash

  • columns – the columns to sample

  • i – the part to sample out, can be a list of parts, must be from 0 to parts-1

  • n – how many rows to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample size as values

  • frac – how many fraction to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample weight as values

  • replace – whether to perform replace sampling

  • weights – the column name of weights

  • strata – the name of strata column

  • random_state – the random seed when performing sampling

返回:

collection

Note that n, frac, replace, weights, strata and random_state can only be used under Pandas DataFrames or XFlow.

Example:

Sampling with parts:

>>> df.sample(parts=1)
>>> df.sample(parts=5, i=0)
>>> df.sample(parts=10, columns=['name'])

Sampling with fraction or weights, replacement option can be specified:

>>> df.sample(n=100)
>>> df.sample(frac=0.1)
>>> df.sample(frac=0.1, replace=True)

Sampling with weight column:

>>> df.sample(n=100, weights='weight_col')
>>> df.sample(n=100, weights='weight_col', replace=True)

Stratified sampling. Note that currently we do not support stratified sampling with replacement.

>>> df.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})
select(*fields, **kw)[源代码]

Projection columns. Remember to avoid column names’ conflict.

参数:
  • fields – columns to project

  • kw – columns and their names to project

返回:

new collection

返回类型:

odps.df.expr.expression.CollectionExpr

select_features(*args, **kwargs)

Select one or more fields as feature fields.

返回类型:

DataFrame

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

参数:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
size()

Value counts

参数:

expr

返回:

skew()

Calculate skewness of the sequence

参数:

expr

返回:

sort(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

参数:
  • expr – collection

  • by – the sequence or sequences to sort

  • ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

返回:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3
sort_values(by, ascending=True)

Sort the collection by values. sort is an alias name for sort_values

参数:
  • expr – collection

  • by – the sequence or sequences to sort

  • ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by

返回:

Sorted collection

Example:

>>> df.sort_values(['name', 'id'])  # 1
>>> df.sort(['name', 'id'], ascending=False)  # 2
>>> df.sort(['name', 'id'], ascending=[False, True])  # 3
>>> df.sort([-df.name, df.id])  # 4, equal to #3
split(frac, seed=None)

Split the current column into two column objects with certain ratio.

参数:

frac (float) – Split ratio

返回:

two split DataFrame objects

std(**kw)

Standard deviation.

参数:
  • expr

  • kw

返回:

std_scale(columns=None, with_means=True, with_std=True, preserve=False, suffix='_scaled', group=None)

Resize a data frame by mean and standard error.

参数:
  • expr (DataFrame) – Input DataFrame

  • with_means (bool) – Determine whether the output will be subtracted by means

  • with_std (bool) – Determine whether the output will be divided by standard deviations

  • preserve (bool) – Determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix

  • columns – Columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.

  • group – determine scale groups. Scaling will be done in each group separately.

  • suffix (str) – column suffix to be appended to the scaled columns.

返回:

resized data frame

返回类型:

DataFrame

sum()

Sum value

参数:

expr

返回:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

参数:
  • expr

  • args

  • kw

返回:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
tail(n=None, **kwargs)[源代码]

Return the last n rows. Execute at once.

参数:

n

返回:

result frame

返回类型:

odps.df.backends.frame.ResultFrame

to_kv(columns=None, kv_delim=':', item_delim=',', kv_name='kv_col')

Merge values in specified columns into a key-value represented column.

参数:
  • expr (DataFrame) – input DataFrame

  • columns – the columns to be merged.

  • kv_delim (str) – delimiter between key and value.

  • item_delim (str) – delimiter between key-value pairs.

  • kv_col (str) – name of the new key-value column

返回:

converted data frame

返回类型:

DataFrame

Example:

>>> df
    name   k1   k2   k3   k5    k7   k9
0  name1  1.0  3.0  Nan  10.0  Nan  Nan
1  name2  Nan  3.0  5.1  Nan   Nan  Nan
2  name3  7.1  Nan  Nan  Nan   8.2  Nan
3  name4  Nan  1.2  1.5  Nan   Nan  Nan
4  name5  Nan  1.0  Nan  Nan   Nan  1.1
>>> table = df.to_kv(columns=['A', 'B'], kv_delim='=')
>>> table
    name   kv_col
0  name1  k1=1.0,k2=3.0,k5=10.0
1  name2  k2=3.0,k3=5.1
2  name3  k1=7.1,k7=8.2
3  name4  k2=1.2,k3=1.5
4  name5  k2=1.0,k9=1.1
to_pandas(wrap=False, **kwargs)[源代码]

Convert to pandas DataFrame. Execute at once.

参数:

wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame

返回:

pandas DataFrame

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

参数:
  • left – left collection

  • right – right collection

  • distinct

返回:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])
var(**kw)

Variance

参数:
  • expr

  • ddof – degree of freedom

  • kw

返回:

verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

view()[源代码]

Clone a same collection. useful for self-join.

返回:

weight_field(f)

Select one field as the weight field.

Note that this field will be exclude from feature fields.

参数:

f (str) – Selected weight field

返回类型:

DataFrame

class odps.df.SequenceExpr(*args, **kwargs)[源代码]

Sequence represents for 1-dimension data.

ast()

Return the AST string.

返回:

AST tree

返回类型:

str

astype(data_type)[源代码]

Cast to a new data type.

参数:

data_type – the new data type

返回:

casted sequence

Example:

>>> df.id.astype('float')
between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

参数:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

返回:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

concat(rights, distinct=False, axis=0)

Concat collections.

参数:
  • left – left collection

  • rights – right collections, can be a DataFrame object or a list of DataFrames

  • distinct – whether to remove duplicate entries. only available when axis == 0

  • axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.

返回:

collection

Note that axis==1 can only be used under Pandas DataFrames or XFlow.

Example:

>>> df['name', 'id'].concat(df2['score'], axis=1)
continuous()

Set sequence to be continuous.

返回类型:

Column

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=DISCRETE, f2=DISCRETE
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.continuous('f1 f2')
count()

Value counts

参数:

expr

返回:

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

参数:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

返回:

sequence or scalar

discrete()

Set sequence to be discrete.

返回类型:

Column

Example:

>>> # Table schema is create table test(f1 double, f2 string)
>>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS
>>> # Now we want to set ``f1`` and ``f2`` into continuous
>>> new_ds = df.discrete('f1 f2')
drop(data, axis=0, columns=None)

Drop data from a DataFrame.

参数:
  • expr – collection to drop data from

  • data – data to be removed

  • axis – 0 for deleting rows, 1 for columns.

  • columns – columns of data to select, only useful when axis == 0

返回:

collection

Example:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]}))
>>> df1.drop(df2)
   a  b  c
0  1  4  7
1  3  6  9
>>> df1.drop(df2, columns='a')
   a  b  c
0  1  4  7
>>> df1.drop(['a'], axis=1)
   b  c
0  4  7
1  5  8
2  6  9
>>> df1.drop(df2, axis=1)
   c
0  7
1  8
2  9
property dtype

Return the data type. Available types: int8, int16, int32, int64, float32, float64, boolean, string, decimal, datetime

返回:

the data type

erase_key_value()

Erase key-value represented fields.

返回类型:

Column

Example:

>>> new_ds = df.erase_key_value('f1 f2')
except_(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

参数:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

参数:
  • expr – sequence or scalar

  • value – value to fill into

返回:

sequence or scalar

hash(func=None)

Calculate the hash value.

参数:
  • expr

  • func – hash function

返回:

head(n=None, **kwargs)[源代码]

Return first n rows. Execute at once.

参数:

n

返回:

result frame

返回类型:

odps.df.expr.expressions.CollectionExpr

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

参数:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

返回:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
intersect(*rights, **kwargs)

Calc intersection among datasets,

参数:
  • left – collection

  • rights – collection or list of collections

  • distinct – whether to preserve duolicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]}))
>>> df1.intersect(df2)
   a  b
0  1  1
1  3  3
2  3  3
>>> df1.intersect(df2, distinct=True)
   a  b
0  1  1
1  3  3
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
key_value(**kwargs)

Set fields to be key-value represented.

返回类型:

Column

Example:

>>> new_ds = df.key_value('f1 f2', kv=':', item=',')
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

参数:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

返回:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
max()

Max value

参数:

expr

返回:

min()

Min value

参数:

expr

返回:

notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

nunique()

The distinct count.

参数:

expr

返回:

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
role(role_name)

Set role of current column

参数:

role_name – name of the role to be selected.

返回:

setdiff(*rights, **kwargs)

Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.

参数:
  • left – collection to drop data from

  • rights – collection or list of collections

  • distinct – whether to preserve duplicate entries

返回:

collection

Examples:

>>> import pandas as pd
>>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]}))
>>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]}))
>>> df1.setdiff(df2)
   a  b
0  2  2
1  3  3
2  3  3
>>> df1.setdiff(df2, distinct=True)
   a  b
0  2  2
size()

Value counts

参数:

expr

返回:

switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

参数:
  • expr

  • args

  • kw

返回:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
tail(n=None, **kwargs)[源代码]

Return the last n rows. Execute at once.

参数:

n

返回:

to_pandas(wrap=False, **kwargs)[源代码]

Convert to pandas Series. Execute at once.

参数:

wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame

返回:

pandas Series

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

union(right, distinct=False)

Union two collections.

参数:
  • left – left collection

  • right – right collection

  • distinct

返回:

collection

Example:

>>> df['name', 'id'].union(df2['id', 'name'])
value_counts(sort=True, ascending=False, dropna=False)

Return object containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occuring element. Exclude NA values by default

参数:
  • expr – sequence

  • sort (bool) – if sort

  • dropna – Don’t include counts of None, default False

返回:

collection with two columns

返回类型:

odps.df.expr.expressions.CollectionExpr

verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

class odps.df.expr.expressions.Int64SequenceExpr(*args, **kwargs)[源代码]
kurt()

Calculate kurtosis of the sequence

参数:

expr

返回:

kurtosis()

Calculate kurtosis of the sequence

参数:

expr

返回:

mean()

Arithmetic mean.

参数:

expr

返回:

median()

Median value.

参数:

expr

返回:

moment(order, central=False)

Calculate the n-th order moment of the sequence

参数:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

返回:

quantile(prob=None, **kw)

Percentile value.

参数:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

返回:

skew()

Calculate skewness of the sequence

参数:

expr

返回:

std(**kw)

Standard deviation.

参数:
  • expr

  • kw

返回:

sum()

Sum value

参数:

expr

返回:

to_datetime()

Return a sequence or scalar that is the datetime value of the current numeric sequence or scalar.

参数:

expr – sequence or scalar

返回:

sequence or scalar

var(**kw)

Variance

参数:
  • expr

  • ddof – degree of freedom

  • kw

返回:

class odps.df.expr.expressions.StringSequenceExpr(*args, **kwargs)[源代码]
capitalize()

Convert strings in the Sequence or string scalar to be capitalized. Equivalent to str.capitalize().

参数:

expr

返回:

sequence or scalar

cat(others=None, sep=None, na_rep=None)

Concatenate strings in sequence with given separator

参数:
  • expr

  • others – other sequences

  • sep – string or None, default None

  • na_rep – string or None default None, if None, NA in the sequence are ignored

返回:

contains(pat, case=True, flags=0, regex=True)

Return boolean sequence whether given pattern/regex is contained in each string in the sequence

参数:
  • expr – sequence or scalar

  • pat – Character sequence or regular expression

  • case (bool) – If True, case sensitive

  • flags – re module flags, e.g. re.IGNORECASE

  • regex – If True use regex, otherwise use string finder

返回:

sequence or scalar

count(*args, **kwargs)

Value counts

参数:

expr

返回:

endswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar ends with passed pattern. Equivalent to str.endswith().

参数:
  • expr

  • pat – Character sequence

返回:

sequence or scalar

extract(pat, flags=0, group=0)

Find group in each string in the Series using passed regular expression.

参数:
  • expr

  • pat – Pattern or regular expression

  • flags – re module, e.g. re.IGNORECASE

  • group – if None as group 0

返回:

sequence or scalar

find(sub, start=0, end=None)

Return lowest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().

参数:
  • expr

  • sub – substring being searched

  • start – left edge index

  • end – right edge index

返回:

sequence or scalar

get(index)

Extract element from lists, tuples, or strings in each element in the sequence or scalar

参数:
  • expr

  • index – Integer index(location)

返回:

sequence or scalar

isalnum()

Check whether all characters in each string in the sequence or scalar are alphanumeric. Equivalent to str.isalnum().

参数:

expr

返回:

boolean sequence or scalar

isalpha()

Check whether all characters in each string in the sequence or scalar are alphabetic. Equivalent to str.isalpha().

参数:

expr

返回:

boolean sequence or scalar

isdecimal()

Check whether all characters in each string in the sequence or scalar are decimal. Equivalent to str.isdecimal().

参数:

expr

返回:

boolean sequence or scalar

isdigit()

Check whether all characters in each string in the sequence or scalar are digits. Equivalent to str.isdigit().

参数:

expr

返回:

boolean sequence or scalar

islower()

Check whether all characters in each string in the sequence or scalar are lowercase. Equivalent to str.islower().

参数:

expr

返回:

boolean sequence or scalar

isnumeric()

Check whether all characters in each string in the sequence or scalar are numeric. Equivalent to str.isnumeric().

参数:

expr

返回:

boolean sequence or scalar

isspace()

Check whether all characters in each string in the sequence or scalar are whitespace. Equivalent to str.isspace().

参数:

expr

返回:

boolean sequence or scalar

istitle()

Check whether all characters in each string in the sequence or scalar are titlecase. Equivalent to str.istitle().

参数:

expr

返回:

boolean sequence or scalar

isupper()

Check whether all characters in each string in the sequence or scalar are uppercase. Equivalent to str.isupper().

参数:

expr

返回:

boolean sequence or scalar

len()

Compute length of each string in the sequence or scalar

参数:

expr

返回:

lengths

ljust(width, fillchar=' ')

Filling right side of strings in the sequence or scalar with an additional character. Equivalent to str.ljust().

参数:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with fillchar

  • fillchar – Additional character for filling, default is whitespace.

返回:

sequence or scalar

lower()

Convert strings in the sequence or scalar lowercase. Equivalent to str.lower().

参数:

expr

返回:

sequence or scalar

lstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left side. Equivalent to str.lstrip().

参数:
  • expr

  • to_strip

返回:

sequence or sclaar

pad(width, side='left', fillchar=' ')

Pad strings in the sequence or scalar with an additional character to specified side.

参数:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with spaces

  • side – {‘left’, ‘right’, ‘both’}, default ‘left’

  • fillchar – Additional character for filling, default is whitespace

返回:

sequence or scalar

repeat(repeats)

Duplicate each string in the sequence or scalar by indicated number of times.

参数:
  • expr

  • repeats – times

返回:

sequence or scalar

replace(pat, repl, n=-1, case=True, flags=0, regex=True)

Replace occurrence of pattern/regex in the sequence or scalar with some other string. Equivalent to str.replace()

参数:
  • expr

  • pat – Character sequence or regular expression

  • repl – Replacement

  • n – Number of replacements to make from start

  • case – if True, case sensitive

  • flags – re module flag, e.g. re.IGNORECASE

返回:

sequence or scalar

rfind(sub, start=0, end=None)

Return highest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().

参数:
  • expr

  • sub

  • start

  • end

返回:

sequence or scalar

rjust(width, fillchar=' ')

Filling left side of strings in the sequence or scalar with an additional character. Equivalent to str.rjust().

参数:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with fillchar

  • fillchar – Additional character for filling, default is whitespace.

返回:

sequence or scalar

rstrip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from right side. Equivalent to str.rstrip().

参数:
  • expr

  • to_strip

返回:

sequence or scalar

slice(start=None, stop=None, step=None)

Slice substrings from each element in the sequence or scalar

参数:
  • expr

  • start – int or None

  • stop – int or None

  • step – int or None

返回:

sliced

split(pat=None, n=-1)

Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to str.split().

参数:
  • expr

  • pat – Separator to split on. If None, splits on whitespace

  • n – not supported right now

返回:

list sequence or scalar

startswith(pat)

Return boolean sequence or scalar indicating whether each string in the sequence or scalar starts with passed pattern. Equivalent to str.startswith().

参数:
  • expr

  • pat – Character sequence

返回:

sequence or scalar

strip(to_strip=None)

Strip whitespace (including newlines) from each string in the sequence or scalar from left and right sides. Equivalent to str.strip().

参数:
  • expr

  • to_strip

返回:

sequence or scalar

strptime(date_format)

Return datetimes specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc

参数:
  • expr

  • date_format (str) – date format string (e.g. “%Y-%m-%d”)

返回:

sum()

Sum value

参数:

expr

返回:

swapcase()

Convert strings in the sequence or scalar to be swapcased. Equivalent to str.swapcase().

参数:

expr

返回:

converted

title()

Convert strings in the sequence or scalar to titlecase. Equivalent to str.title().

参数:

expr

返回:

converted

todict(item_delim=',', kv_delim='=')

Convert the string sequence / expr into a string dict given item and key-value delimiters.

参数:
  • expr

  • item_delim – delimiter between data items

  • kv_delim – delimiter between keys and values

返回:

dict sequence or scalar

upper()

Convert strings in the sequence or scalar uppercase. Equivalent to str.upper().

参数:

expr

返回:

sequence or scalar

zfill(width)

Filling left side of strings in the sequence or scalar with 0. Equivalent to str.zfill().

参数:
  • expr

  • width – Minimum width of resulting string; additional characters will be filled with 0

返回:

filled

class odps.df.Scalar(*args, **kwargs)[源代码]

Represent for the scalar type.

参数:
  • _value – value of the scalar

  • _value_type – value type of the scalar

Example:

>>> df[df, Scalar(4).rename('append_const')]
ast()

Return the AST string.

返回:

AST tree

返回类型:

str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

参数:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

返回:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

参数:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

返回:

sequence or scalar

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

参数:
  • expr – sequence or scalar

  • value – value to fill into

返回:

sequence or scalar

hash(func=None)

Calculate the hash value.

参数:
  • expr

  • func – hash function

返回:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

参数:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

返回:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

参数:
  • expr

  • args

  • kw

返回:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

odps.df.NullScalar(tp)[源代码]

Creates a Scalar representing typed None values.

参数:

tp – type of the scalar

返回:

Scalar with None value

class odps.df.RandomScalar(seed=None, **kw)[源代码]

Represent for the random scalar type.

参数:

seed – random seed, None by default

Example:

>>> df[df, RandomScalar().rename('append_random')]
ast()

Return the AST string.

返回:

AST tree

返回类型:

str

between(left, right, inclusive=True)

Return a boolean sequence or scalar show whether each element is between left and right.

参数:
  • expr – sequence or scalar

  • left – left value

  • right – right value

  • inclusive – if true, will be left <= expr <= right, else will be left < expr < right

返回:

boolean sequence or scalar

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

cut(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)

Return indices of half-open bins to which each value of expr belongs.

参数:
  • expr – sequence or scalar

  • bins – list of scalars

  • right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]

  • labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.

  • include_lowest – Whether the first interval should be left-inclusive or not.

  • include_under – include the bin below the leftmost edge or not

  • include_over – include the bin above the rightmost edge or not

返回:

sequence or scalar

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

fillna(value)

Fill null with value.

参数:
  • expr – sequence or scalar

  • value – value to fill into

返回:

sequence or scalar

hash(func=None)

Calculate the hash value.

参数:
  • expr

  • func – hash function

返回:

inner_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Inner join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.inner_join(df2)
>>> df.inner_join(df2, on='name')
>>> df.inner_join(df2, on=('id', 'id1'))
>>> df.inner_join(df2, on=['name', ('id', 'id1')])
>>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
isin(values)

Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

isna()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

isnull()

Return a sequence or scalar according to the input indicating if the values are null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

join(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False, skewjoin=False)

Join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • how – ‘inner’, ‘left’, ‘right’, or ‘outer’

  • suffixes – when name conflict, the suffix will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • skewjoin – set use of skewjoin or not, default value False. Can specify True if the collection is skew, or a list specifying columns with skew values, or a list of dicts specifying skew combinations.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.join(df2)
>>> df.join(df2, on='name')
>>> df.join(df2, on=('id', 'id1'))
>>> df.join(df2, on=['name', ('id', 'id1')])
>>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1])
>>> df.join(df2, mapjoin=False)
>>> df.join(df2, skewjoin=True)
>>> df.join(df2, skewjoin=["c0", "c1"])
>>> df.join(df2, skewjoin=[{"c0": 1, "c1": "2"}, {"c0": 3, "c1": "4"}])
left_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Left join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.left_join(df2)
>>> df.left_join(df2, on='name')
>>> df.left_join(df2, on=('id', 'id1'))
>>> df.left_join(df2, on=['name', ('id', 'id1')])
>>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
map(func, rtype=None, resources=None, args=(), **kwargs)

Call func on each element of this sequence.

参数:
  • func – lambda, function, odps.models.Function, or str which is the name of odps.models.Funtion

  • rtype – if not provided, will be the dtype of this sequence

返回:

a new sequence

Example:

>>> df.id.map(lambda x: x + 1)
notin(values)

Return a boolean sequence or scalar showing whether each element is not contained in the passed values.

参数:
  • expr – sequence or scalar

  • valueslist object or sequence

返回:

boolean sequence or scalar

notna()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

notnull()

Return a sequence or scalar according to the input indicating if the values are not null.

参数:

expr – sequence or scalar

返回:

sequence or scalar

outer_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Outer join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.outer_join(df2)
>>> df.outer_join(df2, on='name')
>>> df.outer_join(df2, on=('id', 'id1'))
>>> df.outer_join(df2, on=['name', ('id', 'id1')])
>>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
right_join(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None, skewjoin=False)

Right join two collections.

If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.

参数:
  • left – left collection

  • right – right collection

  • on – fields to join on

  • suffixes – when name conflict, the suffixes will be added to both columns.

  • mapjoin – set use mapjoin or not, default value False.

  • merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.

返回:

collection

Example:

>>> df.dtypes.names
['name', 'id']
>>> df2.dtypes.names
['name', 'id1']
>>> df.right_join(df2)
>>> df.right_join(df2, on='name')
>>> df.right_join(df2, on=('id', 'id1'))
>>> df.right_join(df2, on=['name', ('id', 'id1')])
>>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
switch(*args, **kw)

Similar to the case-when in SQL. Refer to the example below

参数:
  • expr

  • args

  • kw

返回:

sequence or scalar

Example:

>>> # if df.id == 3 then df.name
>>> # elif df.id == df.fid.abs() then df.name + 'test'
>>> # default: 'test'
>>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

class odps.df.expr.groupby.GroupBy(*args, **kwargs)[源代码]
all()

All is True.

参数:

expr

返回:

any()

Any is True.

参数:

expr

返回:

ast()

Return the AST string.

返回:

AST tree

返回类型:

str

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

count()

Value counts

参数:

expr

返回:

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

kurt()

Calculate kurtosis of the sequence

参数:

expr

返回:

kurtosis()

Calculate kurtosis of the sequence

参数:

expr

返回:

max()

Max value

参数:

expr

返回:

mean()

Arithmetic mean.

参数:

expr

返回:

median()

Median value.

参数:

expr

返回:

min()

Min value

参数:

expr

返回:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

参数:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

返回:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

参数:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

nunique()

The distinct count.

参数:

expr

返回:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

参数:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

quantile(prob=None, **kw)

Percentile value.

参数:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

返回:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

size()

Value counts

参数:

expr

返回:

skew()

Calculate skewness of the sequence

参数:

expr

返回:

std(**kw)

Standard deviation.

参数:
  • expr

  • kw

返回:

sum()

Sum value

参数:

expr

返回:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

参数:
  • expr

  • ddof – degree of freedom

  • kw

返回:

verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

class odps.df.expr.groupby.SequenceGroupBy(*args, **kwargs)[源代码]
ast()

Return the AST string.

返回:

AST tree

返回类型:

str

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

count()

Value counts

参数:

expr

返回:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

参数:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

返回:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

参数:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

参数:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

max()

Max value

参数:

expr

返回:

min()

Min value

参数:

expr

返回:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

参数:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

nunique()

The distinct count.

参数:

expr

返回:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

参数:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

size()

Value counts

参数:

expr

返回:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool

class odps.df.expr.groupby.Int64SequenceGroupBy(*args, **kwargs)[源代码]
ast()

Return the AST string.

返回:

AST tree

返回类型:

str

compile()

Compile this expression into an ODPS SQL

返回:

compiled DAG

返回类型:

str

count()

Value counts

参数:

expr

返回:

cumcount(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative count of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cume_dist(sort=None, ascending=True)

Calculate cumulative ratio of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

cummax(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative maximum of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cummean(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative mean of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cummedian(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative median of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cummin(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative minimum of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cumstd(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative standard deviation of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

cumsum(sort=None, ascending=True, unique=False, preceding=None, following=None)

Calculate cumulative summation of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

  • unique – whether to eliminate duplicate entries

  • preceding – the start point of a window

  • following – the end point of a window

返回:

calculated column

dense_rank(sort=None, ascending=True)

Calculate dense rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

execute(**kwargs)
参数:
  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

返回:

execution result

返回类型:

odps.df.backends.frame.ResultFrame

hll_count(error_rate=0.01, splitter=None)

Calculate HyperLogLog count

参数:
  • expr

  • error_rate (float) – error rate

  • splitter – the splitter to split the column value

返回:

sequence or scalar

Example:

>>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)}))
>>> df.a.hll_count()
63270
>>> df.a.nunique()
63250
kurt()

Calculate kurtosis of the sequence

参数:

expr

返回:

kurtosis()

Calculate kurtosis of the sequence

参数:

expr

返回:

lag(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows prior to the current row.

参数:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

lead(offset, default=None, sort=None, ascending=True)

Get value in the row offset rows after to the current row.

参数:
  • offset – the offset value

  • default – default value for the function, when there are no rows satisfying the offset

  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

max()

Max value

参数:

expr

返回:

mean()

Arithmetic mean.

参数:

expr

返回:

median()

Median value.

参数:

expr

返回:

min()

Min value

参数:

expr

返回:

min_rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

moment(order, central=False)

Calculate the n-th order moment of the sequence

参数:
  • expr

  • order – moment order, must be an integer

  • central – if central moments are to be computed.

返回:

nth_value(nth, skip_nulls=False, sort=None, ascending=True)

Get nth value of a grouped and sorted expression.

参数:
  • expr – expression for calculation

  • nth – integer position

  • skip_nulls – whether to skip null values, False by default

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

nunique()

The distinct count.

参数:

expr

返回:

percent_rank(sort=None, ascending=True)

Calculate percentage rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

persist(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)

Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.

参数:
  • name – table name

  • partitions (list) – list of string, the partition fields

  • partition (string or PartitionSpec) – persist to a specified partition

  • lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.

  • project – project name, if not provided, will be the default project

  • hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size

  • priority (int) – instance priority, 9 as default

  • running_cluster – cluster to run this instance

  • overwrite (bool) – overwrite the table, True as default

  • drop_table (bool) – drop table if exists, False as default

  • create_table (bool) – create table first if not exits, True as default

  • drop_partition (bool) – drop partition if exists, False as default

  • create_partition (bool) – create partition if not exists, None as default

  • cast (bool) – cast all columns’ types as the existed table, False as default

返回:

odps.df.DataFrame

Example:

>>> df = df['name', 'id', 'ds']
>>> df.persist('odps_new_table')
>>> df.persist('odps_new_table', partition='pt=test')
>>> df.persist('odps_new_table', partitions=['ds'])
qcut(bins, labels=False, sort=None, ascending=True)

Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.

参数:
  • expr – expression for calculation

  • bins – number of bins

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

quantile(prob=None, **kw)

Percentile value.

参数:
  • expr

  • prob – probability or list of probabilities, in [0, 1]

返回:

rank(sort=None, ascending=True)

Calculate rank of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

row_number(sort=None, ascending=True)

Calculate row number of a sequence expression.

参数:
  • expr – expression for calculation

  • sort – name of the sort column

  • ascending – whether to sort in ascending order

返回:

calculated column

size()

Value counts

参数:

expr

返回:

skew()

Calculate skewness of the sequence

参数:

expr

返回:

std(**kw)

Standard deviation.

参数:
  • expr

  • kw

返回:

sum()

Sum value

参数:

expr

返回:

tolist(**kwargs)

Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:

var(**kw)

Variance

参数:
  • expr

  • ddof – degree of freedom

  • kw

返回:

verify()

Verify if this expression can be compiled into ODPS SQL.

返回:

True if compilation succeed else False

返回类型:

bool