DataFrame Reference¶
-
class
odps.df.
DataFrame
(*args, **kwargs)[source]¶ Main entrance of PyODPS DataFrame.
Users can initial a DataFrame by
odps.models.Table
.Parameters: data ( odps.models.Table
or pandas DataFrame) – ODPS table or pandas DataFrameExample: >>> df = DataFrame(o.get_table('my_example_table')) >>> df.dtypes odps.Schema { movie_id int64 title string release_date string video_release_date string imdb_url string user_id int64 rating int64 unix_timestamp int64 age int64 sex string occupation string zip_code string } >>> df.count() 100000 >>> >>> # Do the `groupby`, aggregate the `movie_id` by count, then sort the count in a reversed order >>> # Finally we get the top 25 results >>> df.groupby('title').agg(count=df.movie_id.count()).sort('count', ascending=False)[:25] >>> >>> # We can use the `value_counts` to reach the same goal >>> df.movie_id.value_counts()[:25]
-
static
batch_persist
(dfs, tables, *args, **kwargs)[source]¶ Persist multiple DataFrames into ODPS.
Parameters: - dfs – DataFrames to persist.
- tables – Table names to persist to. Use (table, partition) tuple to store to a table partition.
- args – args for Expr.persist
- kwargs – kwargs for Expr.persist
Examples: >>> DataFrame.batch_persist([df1, df2], ['table_name1', ('table_name2', 'partition_name2')], lifecycle=1)
-
static
-
class
odps.df.
CollectionExpr
(*args, **kwargs)[source]¶ Collection represents for the two-dimensions data.
Example: >>> # projection >>> df = DataFrame(o.get_table('my_table')) # DataFrame is actually a CollectionExpr >>> df['name', 'id'] # projection some columns >>> df[[df.name, df.id]] # projection >>> df[df] # means nothing, but get all the columns >>> df[df, df.name.lower().rename('name2')] # projection a new columns `name2` besides all the original columns >>> df.select(df, name2=df.name.lower()) # projection by `select` >>> df.exclude('name') # projection all columns but `name` >>> df[df.exclude('name'), df.name.lower()] # `name` will not conflict any more >>> >>> # filter >>> df[(df.id < 3) & (df.name != 'test')] >>> df.filter(df.id < 3, df.name != 'test') >>> >>> # slice >>> df[: 10] >>> df.limit(10) >>> >>> # Sequence >>> df.name # an instance of :class:`odps.df.expr.expressions.SequenceExpr` >>> >>> # schema or dtypes >>> df.dtypes odps.Schema { name string id int64 } >>> df.schema odps.Schema { name string id int64 }
-
all
()¶ All is True.
Parameters: expr – Returns:
-
any
()¶ Any is True.
Parameters: expr – Returns:
-
append_id
(id_col='append_id')¶ Append an ID column to current column to form a new DataFrame.
Parameters: id_col (str) – name of appended ID field. Returns: DataFrame with ID field Return type: DataFrame
-
apply
(func, axis=0, names=None, types=None, reduce=False, resources=None, keep_nulls=False, args=(), **kwargs)¶ Apply a function to a row when axis=1 or column when axis=0.
Parameters: - expr –
- func – function to apply
- axis – row when axis=1 else column
- names – output names
- types – output types
- reduce – if True will return a sequence else return a collection
- resources – resources to read
- keep_nulls – if True, keep rows producing empty results, only work in lateral views
- args – args for function
- kwargs – kwargs for function
Returns: Example: Apply a function to a row:
>>> from odps.df import output >>> >>> @output(['iris_add', 'iris_sub'], ['float', 'float']) >>> def handle(row): >>> yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth >>> yield row.petallength - row.petalwidth, row.petallength + row.petalwidth >>> >>> iris.apply(handle, axis=1).count()
Apply a function to a column:
>>> class Agg(object): >>> >>> def buffer(self): >>> return [0.0, 0] >>> >>> def __call__(self, buffer, val): >>> buffer[0] += val >>> buffer[1] += 1 >>> >>> def merge(self, buffer, pbuffer): >>> buffer[0] += pbuffer[0] >>> buffer[1] += pbuffer[1] >>> >>> def getvalue(self, buffer): >>> if buffer[1] == 0: >>> return 0.0 >>> return buffer[0] / buffer[1] >>> >>> iris.exclude('name').apply(Agg)
-
applymap
(func, rtype=None, resources=None, columns=None, excludes=None, args=(), **kwargs)¶ Call func on each element of this collection.
Parameters: - func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
- rtype – if not provided, will be the dtype of this sequence
- columns – columns to apply this function on
- excludes – columns to skip when applying the function
Returns: a new collection
Example: >>> df.applymap(lambda x: x + 1)
- func – lambda, function,
-
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
bfill
(subset=None)¶ Fill NA/NaN values with the backward method. Equivalent to fillna(method=’bfill’).
Parameters: - expr (DataFrame) – input DataFrame.
- subset – Labels along other axis to consider.
Returns: DataFrame
-
bloom_filter
(on, column, capacity=3000, error_rate=0.01)¶ Filter collection on the on sequence by BloomFilter built by column
Parameters: - collection –
- on – sequence or column name
- column – instance of Column
- capacity (int) – numbers of capacity
- error_rate (float) – error rate
Returns: collection
Example: >>> df1 = DataFrame(pd.DataFrame({'a': ['name1', 'name2', 'name3', 'name1'], 'b': [1, 2, 3, 4]})) >>> df2 = DataFrame(pd.DataFrame({'a': ['name1']})) >>> df1.bloom_filter('a', df2.a) a b 0 name1 1 1 name1 4
-
columns
¶ columns :rtype: list which each element is a Column
Type: return
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
concat
(rights, distinct=False, axis=0)¶ Concat collections.
Parameters: - left – left collection
- rights – right collections, can be a DataFrame object or a list of DataFrames
- distinct – whether to remove duplicate entries. only available when axis == 0
- axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.
Returns: collection
Note that axis==1 can only be used under Pandas DataFrames or XFlow.
Example: >>> df['name', 'id'].concat(df2['score'], axis=1)
-
continuous
(*args)¶ Set fields to be continuous.
Return type: DataFrame Example: >>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=DISCRETE, f2=DISCRETE >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.continuous('f1 f2')
-
count
()¶ Value counts
Parameters: expr – Returns:
-
discrete
(*args)¶ Set fields to be discrete.
Return type: DataFrame Example: >>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.discrete('f1 f2')
-
distinct
(on=None, *ons)¶ Get collection with duplicate rows removed, optionally only considering certain columns
Parameters: - expr – collection
- on – sequence or sequences
Returns: dinstinct collection
Example: >>> df.distinct(['name', 'id']) >>> df['name', 'id'].distinct()
-
drop
(data, axis=0, columns=None)¶ Drop data from a DataFrame.
Parameters: - expr – collection to drop data from
- data – data to be removed
- axis – 0 for deleting rows, 1 for columns.
- columns – columns of data to select, only useful when axis == 0
Returns: collection
Example: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})) >>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]})) >>> df1.drop(df2) a b c 0 1 4 7 1 3 6 9 >>> df1.drop(df2, columns='a') a b c 0 1 4 7 >>> df1.drop(['a'], axis=1) b c 0 4 7 1 5 8 2 6 9 >>> df1.drop(df2, axis=1) c 0 7 1 8 2 9
-
dropna
(how='any', thresh=None, subset=None)¶ Return object with labels on given axis omitted where alternately any or all of the data are missing
Parameters: - expr (DataFrame) – input DataFrame
- how – can be ‘any’ or ‘all’. If ‘any’ is specified any NA values are present, drop that label. If ‘all’ is specified and all values are NA, drop that label.
- thresh – require that many non-NA values
- subset – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include
Returns: DataFrame
-
erase_key_value
(*args)¶ Erase key-value represented fields.
Return type: DataFrame Example: >>> new_ds = df.erase_key_value('f1 f2')
-
except_
(*rights, **kwargs)¶ Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
Parameters: - left – collection to drop data from
- rights – collection or list of collections
- distinct – whether to preserve duplicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
-
exclude
(*fields)[source]¶ Projection columns which not included in the fields
Parameters: fields – field names Returns: new collection Return type: odps.df.expr.expression.CollectionExpr
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
extract_kv
(columns=None, kv_delim=':', item_delim=', ', dtype='float', fill_value=None)¶ Extract values in key-value represented columns into standalone columns. New column names will be the name of the key-value column followed by an underscore and the key.
Parameters: - expr (DataFrame) – input DataFrame
- columns – the key-value columns to be extracted.
- kv_delim (str) – delimiter between key and value.
- item_delim (str) – delimiter between key-value pairs.
- dtype (str) – type of value columns to generate.
- fill_value – default value for missing key-value pairs.
Returns: extracted data frame
Return type: Example: >>> df name kv 0 name1 k1=1.0,k2=3.0,k5=10.0 1 name2 k2=3.0,k3=5.1 2 name3 k1=7.1,k7=8.2 3 name4 k2=1.2,k3=1.5 4 name5 k2=1.0,k9=1.1 >>> table = df.extract_kv(columns=['A', 'B'], kv_delim='=') >>> table name kv_k1 kv_k2 kv_k3 kv_k5 kv_k7 kv_k9 0 name1 1.0 3.0 Nan 10.0 Nan Nan 1 name2 Nan 3.0 5.1 Nan Nan Nan 2 name3 7.1 Nan Nan Nan 8.2 Nan 3 name4 Nan 1.2 1.5 Nan Nan Nan 4 name5 Nan 1.0 Nan Nan Nan 1.1
-
ffill
(subset=None)¶ Fill NA/NaN values with the forward method. Equivalent to fillna(method=’ffill’).
Parameters: - expr (DataFrame) – input DataFrame.
- subset – Labels along other axis to consider.
Returns: DataFrame
-
fillna
(value=None, method=None, subset=None)¶ Fill NA/NaN values using the specified method
Parameters: - expr (DataFrame) – input DataFrame
- method – can be ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’ or None
- value – value to fill into
- subset – Labels along other axis to consider.
Returns: DataFrame
-
filter
(*predicates)[source]¶ Filter the data by predicates
Parameters: predicates – the conditions to filter Returns: new collection Return type: odps.df.expr.expressions.CollectionExpr
-
filter_parts
(predicate='', exclude=True)[source]¶ Filter the data by partition string. A partition string looks like pt1=1,pt2=2/pt1=2,pt2=1, where comma (,) denotes ‘and’, while (/) denotes ‘or’.
Parameters: - predicate (str|Partition) – predicate string of partition filter
- exclude (bool) – True if you want to exclude partition fields, otherwise False. True for default.
Returns: new collection
Return type: odps.df.expr.expressions.CollectionExpr
-
groupby
(by, *bys)¶ Group collection by a series of sequences.
Parameters: - expr – collection
- by – columns to group
- bys – columns to group
Returns: GroupBy instance
Return type:
-
head
(n=None, **kwargs)[source]¶ Return the first n rows. Execute at once.
Parameters: n – Returns: result frame Return type: odps.df.backends.frame.ResultFrame
-
inner_join
(right, on=None, suffixes=('_x', '_y'), mapjoin=False)¶ Inner join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
Parameters: - left – left collection
- right – right collection
- on – fields to join on
- suffixes – when name conflict, the suffixes will be added to both columns.
Returns: collection
Example: >>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.inner_join(df2) >>> df.inner_join(df2, on='name') >>> df.inner_join(df2, on=('id', 'id1')) >>> df.inner_join(df2, on=['name', ('id', 'id1')]) >>> df.inner_join(df2, on=[df.name == df2.name, df.id == df2.id1])
-
intersect
(*rights, **kwargs)¶ Calc intersection among datasets,
Parameters: - left – collection
- rights – collection or list of collections
- distinct – whether to preserve duolicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]})) >>> df1.intersect(df2) a b 0 1 1 1 3 3 2 3 3 >>> df1.intersect(df2, distinct=True) a b 0 1 1 1 3 3
-
join
(right, on=None, how='inner', suffixes=('_x', '_y'), mapjoin=False)¶ Join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
Parameters: - left – left collection
- right – right collection
- on – fields to join on
- how – ‘inner’, ‘left’, ‘right’, or ‘outer’
- suffixes – when name conflict, the suffix will be added to both columns.
- mapjoin – set use mapjoin or not, default value False.
Returns: collection
Example: >>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.join(df2) >>> df.join(df2, on='name') >>> df.join(df2, on=('id', 'id1')) >>> df.join(df2, on=['name', ('id', 'id1')]) >>> df.join(df2, on=[df.name == df2.name, df.id == df2.id1]) >>> df.join(df2, mapjoin=False)
-
key_value
(*args, **kwargs)¶ Set fields to be key-value represented.
Return type: DataFrame Example: >>> new_ds = df.key_value('f1 f2', kv=':', item=',')
-
kurt
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
kurtosis
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
label_field
(f)¶ Select one field as the label field.
Note that this field will be exclude from feature fields.
Parameters: f (str) – Selected label field Return type: DataFrame
-
left_join
(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None)¶ Left join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
Parameters: - left – left collection
- right – right collection
- on – fields to join on
- suffixes – when name conflict, the suffixes will be added to both columns.
- mapjoin – set use mapjoin or not, default value False.
- merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
Returns: collection
Example: >>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.left_join(df2) >>> df.left_join(df2, on='name') >>> df.left_join(df2, on=('id', 'id1')) >>> df.left_join(df2, on=['name', ('id', 'id1')]) >>> df.left_join(df2, on=[df.name == df2.name, df.id == df2.id1])
-
map_reduce
(mapper=None, reducer=None, group=None, sort=None, ascending=True, combiner=None, combiner_buffer_size=1024, mapper_output_names=None, mapper_output_types=None, mapper_resources=None, reducer_output_names=None, reducer_output_types=None, reducer_resources=None)¶ MapReduce API, mapper or reducer should be provided.
Parameters: - expr –
- mapper – mapper function or class
- reducer – reducer function or class
- group – the keys to group after mapper
- sort – the keys to sort after mapper
- ascending – True if ascending else False
- combiner – combiner function or class, combiner’s output should be equal to mapper
- combiner_buffer_size – combiner’s buffer size, 1024 as default
- mapper_output_names – mapper’s output names
- mapper_output_types – mapper’s output types
- mapper_resources – the resources for mapper
- reducer_output_names – reducer’s output names
- reducer_output_types – reducer’s output types
- reducer_resources – the resources for reducer
Returns: Example: >>> from odps.df import output >>> >>> @output(['word', 'cnt'], ['string', 'int']) >>> def mapper(row): >>> for word in row[0].split(): >>> yield word.lower(), 1 >>> >>> @output(['word', 'cnt'], ['string', 'int']) >>> def reducer(keys): >>> cnt = [0] >>> def h(row, done): # done illustrates that all the rows of the keys are processed >>> cnt[0] += row.cnt >>> if done: >>> yield keys.word, cnt[0] >>> return h >>> >>> words_df.map_reduce(mapper, reducer, group='word')
-
max
()¶ Max value
Parameters: expr – Returns:
-
mean
()¶ Arithmetic mean.
Parameters: expr – Returns:
-
median
()¶ Median value.
Parameters: expr – Returns:
-
melt
(id_vars=None, value_vars=None, var_name='variable', value_name='value', ignore_nan=False)¶ “Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
Parameters: - expr – collection
- id_vars – column(s) to use as identifier variables.
- value_vars – column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
- var_name – name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- value_name – name to use for the ‘value’ column.
- ignore_nan – whether to ignore NaN values in data.
Returns: collection
Example: >>> df.melt(id_vars='id', value_vars=['col1', 'col2']) >>> df.melt(id_vars=['id', 'id2'], value_vars=['col1', 'col2'], var_name='variable')
-
min
()¶ Min value
Parameters: expr – Returns:
-
min_max_scale
(columns=None, feature_range=(0, 1), preserve=False, suffix='_scaled', group=None)¶ Resize a data frame by max / min values, i.e., (X - min(X)) / (max(X) - min(X))
Parameters: - expr (DataFrame) – input DataFrame
- feature_range – the target range to resize the value into, i.e., v * (b - a) + a
- preserve (bool) – determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
- columns – columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
- group – determine scale groups. Scaling will be done in each group separately.
- suffix (str) – column suffix to be appended to the scaled columns.
Returns: resized data frame
Return type:
-
moment
(order, central=False)¶ Calculate the n-th order moment of the sequence
Parameters: - expr –
- order – moment order, must be an integer
- central – if central moments are to be computed.
Returns:
-
nunique
()¶ The distinct count.
Parameters: expr – Returns:
-
outer_join
(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None)¶ Outer join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
Parameters: - left – left collection
- right – right collection
- on – fields to join on
- suffixes – when name conflict, the suffixes will be added to both columns.
- mapjoin – set use mapjoin or not, default value False.
- merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
Returns: collection
Example: >>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.outer_join(df2) >>> df.outer_join(df2, on='name') >>> df.outer_join(df2, on=('id', 'id1')) >>> df.outer_join(df2, on=['name', ('id', 'id1')]) >>> df.outer_join(df2, on=[df.name == df2.name, df.id == df2.id1])
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
pivot
(rows, columns, values=None)¶ Produce ‘pivot’ table based on 3 columns of this DataFrame. Uses unique values from rows / columns and fills with values.
Parameters: - expr – collection
- rows – use to make new collection’s grouped rows
- columns – use to make new collection’s columns
- values – values to use for populating new collection’s values
Returns: collection
Example: >>> df.pivot(rows='id', columns='category') >>> df.pivot(rows='id', columns='category', values='sale') >>> df.pivot(rows=['id', 'id2'], columns='category', values='sale')
-
pivot_table
(values=None, rows=None, columns=None, aggfunc='mean', fill_value=None)¶ Create a spreadsheet-style pivot table as a DataFrame.
Parameters: - expr – collection
- (optional) (fill_value) – column to aggregate
- rows – rows to group
- columns – keys to group by on the pivot table column
- aggfunc – aggregate function or functions
- (optional) – value to replace missing value with, default None
Returns: collection
Example: >>> df A B C D 0 foo one small 1 1 foo one large 2 2 foo one large 2 3 foo two small 3 4 foo two small 3 5 bar one large 4 6 bar one small 5 7 bar two small 6 8 bar two large 7 >>> table = df.pivot_table(values='D', rows=['A', 'B'], columns='C', aggfunc='sum') >>> table A B large_D_sum small_D_sum 0 bar one 4.0 5.0 1 bar two 7.0 6.0 2 foo one 4.0 1.0 3 foo two NaN 6.0
-
quantile
(prob=None, **kw)¶ Percentile value.
Parameters: - expr –
- prob – probability or list of probabilities, in [0, 1]
Returns:
-
query
(expr)[source]¶ Query the data with a boolean expression.
Parameters: expr – the query string, you can use ‘@’ character refer to environment variables. Returns: new collection Return type: odps.df.expr.expressions.CollectionExpr
-
reshuffle
(by=None, sort=None, ascending=True)¶ Reshuffle data.
Parameters: - expr –
- by – the sequence or scalar to shuffle by. RandomScalar as default
- sort – the sequence or scalar to sort.
- ascending – True if ascending else False
Returns: collection
-
right_join
(right, on=None, suffixes=('_x', '_y'), mapjoin=False, merge_columns=None)¶ Right join two collections.
If on is not specified, we will find the common fields of the left and right collection. suffixes means that if column names conflict, the suffixes will be added automatically. For example, both left and right has a field named col, there will be col_x, and col_y in the joined collection.
Parameters: - left – left collection
- right – right collection
- on – fields to join on
- suffixes – when name conflict, the suffixes will be added to both columns.
- mapjoin – set use mapjoin or not, default value False.
- merge_columns – whether to merge columns with the same name into one column without suffix. If the value is True, columns in the predicate with same names will be merged, with non-null value. If the value is ‘left’ or ‘right’, the values of predicates on the left / right collection will be taken. You can also pass a dictionary to describe the behavior of each column, such as { ‘a’: ‘auto’, ‘b’: ‘left’ }.
Returns: collection
Example: >>> df.dtypes.names ['name', 'id'] >>> df2.dtypes.names ['name', 'id1'] >>> df.right_join(df2) >>> df.right_join(df2, on='name') >>> df.right_join(df2, on=('id', 'id1')) >>> df.right_join(df2, on=['name', ('id', 'id1')]) >>> df.right_join(df2, on=[df.name == df2.name, df.id == df2.id1])
-
roles
(clear_features=True, **field_roles)¶ Set roles of fields
Parameters: - clear_features – Clear feature roles on fields
- field_roles –
Returns:
-
sample
(parts=None, columns=None, i=None, n=None, frac=None, replace=False, weights=None, strata=None, random_state=None)¶ Sample collection.
Parameters: - expr – collection
- parts – how many parts to hash
- columns – the columns to sample
- i – the part to sample out, can be a list of parts, must be from 0 to parts-1
- n – how many rows to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample size as values
- frac – how many fraction to sample. If strata is specified, n should be a dict with values in the strata column as dictionary keys and corresponding sample weight as values
- replace – whether to perform replace sampling
- weights – the column name of weights
- strata – the name of strata column
- random_state – the random seed when performing sampling
Returns: collection
Note that n, frac, replace, weights, strata and random_state can only be used under Pandas DataFrames or XFlow.
Example: Sampling with parts:
>>> df.sample(parts=1) >>> df.sample(parts=5, i=0) >>> df.sample(parts=10, columns=['name'])
Sampling with fraction or weights, replacement option can be specified:
>>> df.sample(n=100) >>> df.sample(frac=100) >>> df.sample(frac=100, replace=True)
Sampling with weight column:
>>> df.sample(n=100, weights='weight_col') >>> df.sample(n=100, weights='weight_col', replace=True)
Stratified sampling. Note that currently we do not support stratified sampling with replacement.
>>> df.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})
-
select
(*fields, **kw)[source]¶ Projection columns. Remember to avoid column names’ conflict.
Parameters: - fields – columns to project
- kw – columns and their names to project
Returns: new collection
Return type: odps.df.expr.expression.CollectionExpr
-
select_features
(*args, **kwargs)¶ Select one or more fields as feature fields.
Return type: DataFrame
-
setdiff
(*rights, **kwargs)¶ Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
Parameters: - left – collection to drop data from
- rights – collection or list of collections
- distinct – whether to preserve duplicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
-
size
()¶ Value counts
Parameters: expr – Returns:
-
skew
()¶ Calculate skewness of the sequence
Parameters: expr – Returns:
-
sort
(by, ascending=True)¶ Sort the collection by values. sort is an alias name for sort_values
Parameters: - expr – collection
- by – the sequence or sequences to sort
- ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by
Returns: Sorted collection
Example: >>> df.sort_values(['name', 'id']) # 1 >>> df.sort(['name', 'id'], ascending=False) # 2 >>> df.sort(['name', 'id'], ascending=[False, True]) # 3 >>> df.sort([-df.name, df.id]) # 4, equal to #3
-
sort_values
(by, ascending=True)¶ Sort the collection by values. sort is an alias name for sort_values
Parameters: - expr – collection
- by – the sequence or sequences to sort
- ascending – Sort ascending vs. descending. Sepecify list for multiple sort orders. If this is a list of bools, must match the length of the by
Returns: Sorted collection
Example: >>> df.sort_values(['name', 'id']) # 1 >>> df.sort(['name', 'id'], ascending=False) # 2 >>> df.sort(['name', 'id'], ascending=[False, True]) # 3 >>> df.sort([-df.name, df.id]) # 4, equal to #3
-
split
(frac, seed=None)¶ Split the current column into two column objects with certain ratio.
Parameters: frac (float) – Split ratio Returns: two split DataFrame objects
-
std
(**kw)¶ Standard deviation.
Parameters: - expr –
- kw –
Returns:
-
std_scale
(columns=None, with_means=True, with_std=True, preserve=False, suffix='_scaled', group=None)¶ Resize a data frame by mean and standard error.
Parameters: - expr (DataFrame) – Input DataFrame
- with_means (bool) – Determine whether the output will be subtracted by means
- with_std (bool) – Determine whether the output will be divided by standard deviations
- preserve (bool) – Determine whether input data should be kept. If True, scaled input data will be appended to the data frame with suffix
- columns – Columns names to resize. If set to None, float or int-typed columns will be normalized if the column is not specified as a group column.
- group – determine scale groups. Scaling will be done in each group separately.
- suffix (str) – column suffix to be appended to the scaled columns.
Returns: resized data frame
Return type:
-
sum
()¶ Sum value
Parameters: expr – Returns:
-
switch
(*args, **kw)¶ Similar to the case-when in SQL. Refer to the example below
Parameters: - expr –
- args –
- kw –
Returns: sequence or scalar
Example: >>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
-
tail
(n=None, **kwargs)[source]¶ Return the last n rows. Execute at once.
Parameters: n – Returns: result frame Return type: odps.df.backends.frame.ResultFrame
-
to_kv
(columns=None, kv_delim=':', item_delim=', ', kv_name='kv_col')¶ Merge values in specified columns into a key-value represented column.
Parameters: - expr (DataFrame) – input DataFrame
- columns – the columns to be merged.
- kv_delim (str) – delimiter between key and value.
- item_delim (str) – delimiter between key-value pairs.
- kv_col (str) – name of the new key-value column
Returns: converted data frame
Return type: Example: >>> df name k1 k2 k3 k5 k7 k9 0 name1 1.0 3.0 Nan 10.0 Nan Nan 1 name2 Nan 3.0 5.1 Nan Nan Nan 2 name3 7.1 Nan Nan Nan 8.2 Nan 3 name4 Nan 1.2 1.5 Nan Nan Nan 4 name5 Nan 1.0 Nan Nan Nan 1.1 >>> table = df.to_kv(columns=['A', 'B'], kv_delim='=') >>> table name kv_col 0 name1 k1=1.0,k2=3.0,k5=10.0 1 name2 k2=3.0,k3=5.1 2 name3 k1=7.1,k7=8.2 3 name4 k2=1.2,k3=1.5 4 name5 k2=1.0,k9=1.1
-
to_pandas
(wrap=False, **kwargs)[source]¶ Convert to pandas DataFrame. Execute at once.
Parameters: wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame Returns: pandas DataFrame
-
tolist
(**kwargs)¶ Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
-
union
(right, distinct=False)¶ Union two collections.
Parameters: - left – left collection
- right – right collection
- distinct –
Returns: collection
Example: >>> df['name', 'id'].union(df2['id', 'name'])
-
var
(**kw)¶ Variance
Parameters: - expr –
- ddof – degree of freedom
- kw –
Returns:
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
-
class
odps.df.
SequenceExpr
(*args, **kwargs)[source]¶ Sequence represents for 1-dimension data.
-
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
astype
(data_type)[source]¶ Cast to a new data type.
Parameters: data_type – the new data type Returns: casted sequence Example: >>> df.id.astype('float')
-
between
(left, right, inclusive=True)¶ Return a boolean sequence or scalar show whether each element is between left and right.
Parameters: - expr – sequence or scalar
- left – left value
- right – right value
- inclusive – if true, will be left <= expr <= right, else will be left < expr < right
Returns: boolean sequence or scalar
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
concat
(rights, distinct=False, axis=0)¶ Concat collections.
Parameters: - left – left collection
- rights – right collections, can be a DataFrame object or a list of DataFrames
- distinct – whether to remove duplicate entries. only available when axis == 0
- axis – when axis == 0, the DataFrames are merged vertically, otherwise horizontally.
Returns: collection
Note that axis==1 can only be used under Pandas DataFrames or XFlow.
Example: >>> df['name', 'id'].concat(df2['score'], axis=1)
-
continuous
()¶ Set sequence to be continuous.
Return type: Column Example: >>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=DISCRETE, f2=DISCRETE >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.continuous('f1 f2')
-
count
()¶ Value counts
Parameters: expr – Returns:
-
cut
(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)¶ Return indices of half-open bins to which each value of expr belongs.
Parameters: - expr – sequence or scalar
- bins – list of scalars
- right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
- labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
- include_lowest – Whether the first interval should be left-inclusive or not.
- include_under – include the bin below the leftmost edge or not
- include_over – include the bin above the rightmost edge or not
Returns: sequence or scalar
-
discrete
()¶ Set sequence to be discrete.
Return type: Column Example: >>> # Table schema is create table test(f1 double, f2 string) >>> # Original continuity: f1=CONTINUOUS, f2=CONTINUOUS >>> # Now we want to set ``f1`` and ``f2`` into continuous >>> new_ds = df.discrete('f1 f2')
-
drop
(data, axis=0, columns=None)¶ Drop data from a DataFrame.
Parameters: - expr – collection to drop data from
- data – data to be removed
- axis – 0 for deleting rows, 1 for columns.
- columns – columns of data to select, only useful when axis == 0
Returns: collection
Example: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})) >>> df2 = DataFrame(pd.DataFrame({'a': [2, 3], 'b': [5, 7]})) >>> df1.drop(df2) a b c 0 1 4 7 1 3 6 9 >>> df1.drop(df2, columns='a') a b c 0 1 4 7 >>> df1.drop(['a'], axis=1) b c 0 4 7 1 5 8 2 6 9 >>> df1.drop(df2, axis=1) c 0 7 1 8 2 9
-
dtype
¶ int8, int16, int32, int64, float32, float64, boolean, string, decimal, datetime
Returns: the data type Type: Return the data type. Available types
-
erase_key_value
()¶ Erase key-value represented fields.
Return type: Column Example: >>> new_ds = df.erase_key_value('f1 f2')
-
except_
(*rights, **kwargs)¶ Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
Parameters: - left – collection to drop data from
- rights – collection or list of collections
- distinct – whether to preserve duplicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
fillna
(value)¶ Fill null with value.
Parameters: - expr – sequence or scalar
- value – value to fill into
Returns: sequence or scalar
-
hash
(func=None)¶ Calculate the hash value.
Parameters: - expr –
- func – hash function
Returns:
-
head
(n=None, **kwargs)[source]¶ Return first n rows. Execute at once.
Parameters: n – Returns: result frame Return type: odps.df.expr.expressions.CollectionExpr
-
hll_count
(error_rate=0.01, splitter=None)¶ Calculate HyperLogLog count
Parameters: - expr –
- error_rate (float) – error rate
- splitter – the splitter to split the column value
Returns: sequence or scalar
Example: >>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
-
intersect
(*rights, **kwargs)¶ Calc intersection among datasets,
Parameters: - left – collection
- rights – collection or list of collections
- distinct – whether to preserve duolicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3, 3], 'b': [1, 3, 3]})) >>> df1.intersect(df2) a b 0 1 1 1 3 3 2 3 3 >>> df1.intersect(df2, distinct=True) a b 0 1 1 1 3 3
-
isin
(values)¶ Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
isnull
()¶ Return a sequence or scalar according to the input indicating if the values are null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
key_value
(**kwargs)¶ Set fields to be key-value represented.
Return type: Column Example: >>> new_ds = df.key_value('f1 f2', kv=':', item=',')
-
map
(func, rtype=None, resources=None, args=(), **kwargs)¶ Call func on each element of this sequence.
Parameters: - func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
- rtype – if not provided, will be the dtype of this sequence
Returns: a new sequence
Example: >>> df.id.map(lambda x: x + 1)
- func – lambda, function,
-
max
()¶ Max value
Parameters: expr – Returns:
-
min
()¶ Min value
Parameters: expr – Returns:
-
notin
(values)¶ Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
notnull
()¶ Return a sequence or scalar according to the input indicating if the values are not null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
nunique
()¶ The distinct count.
Parameters: expr – Returns:
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
role
(role_name)¶ Set role of current column
Parameters: role_name – name of the role to be selected. Returns:
-
setdiff
(*rights, **kwargs)¶ Exclude data from a collection, like except clause in SQL. All collections involved should have same schema.
Parameters: - left – collection to drop data from
- rights – collection or list of collections
- distinct – whether to preserve duplicate entries
Returns: collection
Examples: >>> import pandas as pd >>> df1 = DataFrame(pd.DataFrame({'a': [1, 2, 3, 3, 3], 'b': [1, 2, 3, 3, 3]})) >>> df2 = DataFrame(pd.DataFrame({'a': [1, 3], 'b': [1, 3]})) >>> df1.setdiff(df2) a b 0 2 2 1 3 3 2 3 3 >>> df1.setdiff(df2, distinct=True) a b 0 2 2
-
size
()¶ Value counts
Parameters: expr – Returns:
-
switch
(*args, **kw)¶ Similar to the case-when in SQL. Refer to the example below
Parameters: - expr –
- args –
- kw –
Returns: sequence or scalar
Example: >>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
-
to_pandas
(wrap=False, **kwargs)[source]¶ Convert to pandas Series. Execute at once.
Parameters: wrap – if True, wrap the pandas DataFrame into a PyODPS DataFrame Returns: pandas Series
-
tolist
(**kwargs)¶ Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
-
union
(right, distinct=False)¶ Union two collections.
Parameters: - left – left collection
- right – right collection
- distinct –
Returns: collection
Example: >>> df['name', 'id'].union(df2['id', 'name'])
-
value_counts
(sort=True, ascending=False, dropna=False)¶ Return object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occuring element. Exclude NA values by default
Parameters: - expr – sequence
- sort (bool) – if sort
- dropna – Don’t include counts of None, default False
Returns: collection with two columns
Return type: odps.df.expr.expressions.CollectionExpr
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
-
class
odps.df.expr.expressions.
Int64SequenceExpr
(*args, **kwargs)[source]¶ -
kurt
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
kurtosis
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
mean
()¶ Arithmetic mean.
Parameters: expr – Returns:
-
median
()¶ Median value.
Parameters: expr – Returns:
-
moment
(order, central=False)¶ Calculate the n-th order moment of the sequence
Parameters: - expr –
- order – moment order, must be an integer
- central – if central moments are to be computed.
Returns:
-
quantile
(prob=None, **kw)¶ Percentile value.
Parameters: - expr –
- prob – probability or list of probabilities, in [0, 1]
Returns:
-
skew
()¶ Calculate skewness of the sequence
Parameters: expr – Returns:
-
std
(**kw)¶ Standard deviation.
Parameters: - expr –
- kw –
Returns:
-
sum
()¶ Sum value
Parameters: expr – Returns:
-
to_datetime
()¶ Return a sequence or scalar that is the datetime value of the current numeric sequence or scalar.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
var
(**kw)¶ Variance
Parameters: - expr –
- ddof – degree of freedom
- kw –
Returns:
-
-
class
odps.df.expr.expressions.
StringSequenceExpr
(*args, **kwargs)[source]¶ -
capitalize
()¶ Convert strings in the Sequence or string scalar to be capitalized. Equivalent to str.capitalize().
Parameters: expr – Returns: sequence or scalar
-
cat
(others=None, sep=None, na_rep=None)¶ Concatenate strings in sequence with given separator
Parameters: - expr –
- others – other sequences
- sep – string or None, default None
- na_rep – string or None default None, if None, NA in the sequence are ignored
Returns:
-
contains
(pat, case=True, flags=0, regex=True)¶ Return boolean sequence whether given pattern/regex is contained in each string in the sequence
Parameters: - expr – sequence or scalar
- pat – Character sequence or regular expression
- case (bool) – If True, case sensitive
- flags – re module flags, e.g. re.IGNORECASE
- regex – If True use regex, otherwise use string finder
Returns: sequence or scalar
-
endswith
(pat)¶ Return boolean sequence or scalar indicating whether each string in the sequence or scalar ends with passed pattern. Equivalent to str.endswith().
Parameters: - expr –
- pat – Character sequence
Returns: sequence or scalar
-
extract
(pat, flags=0, group=0)¶ Find group in each string in the Series using passed regular expression.
Parameters: - expr –
- pat – Pattern or regular expression
- flags – re module, e.g. re.IGNORECASE
- group – if None as group 0
Returns: sequence or scalar
-
find
(sub, start=0, end=None)¶ Return lowest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.find().
Parameters: - expr –
- sub – substring being searched
- start – left edge index
- end – right edge index
Returns: sequence or scalar
-
get
(index)¶ Extract element from lists, tuples, or strings in each element in the sequence or scalar
Parameters: - expr –
- index – Integer index(location)
Returns: sequence or scalar
-
isalnum
()¶ Check whether all characters in each string in the sequence or scalar are alphanumeric. Equivalent to str.isalnum().
Parameters: expr – Returns: boolean sequence or scalar
-
isalpha
()¶ Check whether all characters in each string in the sequence or scalar are alphabetic. Equivalent to str.isalpha().
Parameters: expr – Returns: boolean sequence or scalar
-
isdecimal
()¶ Check whether all characters in each string in the sequence or scalar are decimal. Equivalent to str.isdecimal().
Parameters: expr – Returns: boolean sequence or scalar
-
isdigit
()¶ Check whether all characters in each string in the sequence or scalar are digits. Equivalent to str.isdigit().
Parameters: expr – Returns: boolean sequence or scalar
-
islower
()¶ Check whether all characters in each string in the sequence or scalar are lowercase. Equivalent to str.islower().
Parameters: expr – Returns: boolean sequence or scalar
-
isnumeric
()¶ Check whether all characters in each string in the sequence or scalar are numeric. Equivalent to str.isnumeric().
Parameters: expr – Returns: boolean sequence or scalar
-
isspace
()¶ Check whether all characters in each string in the sequence or scalar are whitespace. Equivalent to str.isspace().
Parameters: expr – Returns: boolean sequence or scalar
-
istitle
()¶ Check whether all characters in each string in the sequence or scalar are titlecase. Equivalent to str.istitle().
Parameters: expr – Returns: boolean sequence or scalar
-
isupper
()¶ Check whether all characters in each string in the sequence or scalar are uppercase. Equivalent to str.isupper().
Parameters: expr – Returns: boolean sequence or scalar
-
len
()¶ Compute length of each string in the sequence or scalar
Parameters: expr – Returns: lengths
-
ljust
(width, fillchar=' ')¶ Filling right side of strings in the sequence or scalar with an additional character. Equivalent to str.ljust().
Parameters: - expr –
- width – Minimum width of resulting string; additional characters will be filled with fillchar
- fillchar – Additional character for filling, default is whitespace.
Returns: sequence or scalar
-
lower
()¶ Convert strings in the sequence or scalar lowercase. Equivalent to str.lower().
Parameters: expr – Returns: sequence or scalar
-
lstrip
(to_strip=None)¶ Strip whitespace (including newlines) from each string in the sequence or scalar from left side. Equivalent to str.lstrip().
Parameters: - expr –
- to_strip –
Returns: sequence or sclaar
-
pad
(width, side='left', fillchar=' ')¶ Pad strings in the sequence or scalar with an additional character to specified side.
Parameters: - expr –
- width – Minimum width of resulting string; additional characters will be filled with spaces
- side – {‘left’, ‘right’, ‘both’}, default ‘left’
- fillchar – Additional character for filling, default is whitespace
Returns: sequence or scalar
-
repeat
(repeats)¶ Duplicate each string in the sequence or scalar by indicated number of times.
Parameters: - expr –
- repeats – times
Returns: sequence or scalar
-
replace
(pat, repl, n=-1, case=True, flags=0, regex=True)¶ Replace occurrence of pattern/regex in the sequence or scalar with some other string. Equivalent to str.replace()
Parameters: - expr –
- pat – Character sequence or regular expression
- repl – Replacement
- n – Number of replacements to make from start
- case – if True, case sensitive
- flags – re module flag, e.g. re.IGNORECASE
Returns: sequence or scalar
-
rfind
(sub, start=0, end=None)¶ Return highest indexes in each strings in the sequence or scalar where the substring is fully contained between [start:end]. Return -1 on failure. Equivalent to standard str.rfind().
Parameters: - expr –
- sub –
- start –
- end –
Returns: sequence or scalar
-
rjust
(width, fillchar=' ')¶ Filling left side of strings in the sequence or scalar with an additional character. Equivalent to str.rjust().
Parameters: - expr –
- width – Minimum width of resulting string; additional characters will be filled with fillchar
- fillchar – Additional character for filling, default is whitespace.
Returns: sequence or scalar
-
rstrip
(to_strip=None)¶ Strip whitespace (including newlines) from each string in the sequence or scalar from right side. Equivalent to str.rstrip().
Parameters: - expr –
- to_strip –
Returns: sequence or scalar
-
slice
(start=None, stop=None, step=None)¶ Slice substrings from each element in the sequence or scalar
Parameters: - expr –
- start – int or None
- stop – int or None
- step – int or None
Returns: sliced
-
split
(pat=None, n=-1)¶ Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to str.split().
Parameters: - expr –
- pat – Separator to split on. If None, splits on whitespace
- n – not supported right now
Returns: list sequence or scalar
-
startswith
(pat)¶ Return boolean sequence or scalar indicating whether each string in the sequence or scalar starts with passed pattern. Equivalent to str.startswith().
Parameters: - expr –
- pat – Character sequence
Returns: sequence or scalar
-
strip
(to_strip=None)¶ Strip whitespace (including newlines) from each string in the sequence or scalar from left and right sides. Equivalent to str.strip().
Parameters: - expr –
- to_strip –
Returns: sequence or scalar
-
strptime
(date_format)¶ Return datetimes specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc
Parameters: - expr –
- date_format (str) – date format string (e.g. “%Y-%m-%d”)
Returns:
-
sum
()¶ Sum value
Parameters: expr – Returns:
-
swapcase
()¶ Convert strings in the sequence or scalar to be swapcased. Equivalent to str.swapcase().
Parameters: expr – Returns: converted
-
title
()¶ Convert strings in the sequence or scalar to titlecase. Equivalent to str.title().
Parameters: expr – Returns: converted
-
todict
(item_delim=', ', kv_delim='=')¶ Convert the string sequence / expr into a string dict given item and key-value delimiters.
Parameters: - expr –
- item_delim – delimiter between data items
- kv_delim – delimiter between keys and values
Returns: dict sequence or scalar
-
upper
()¶ Convert strings in the sequence or scalar uppercase. Equivalent to str.upper().
Parameters: expr – Returns: sequence or scalar
-
zfill
(width)¶ Filling left side of strings in the sequence or scalar with 0. Equivalent to str.zfill().
Parameters: - expr –
- width – Minimum width of resulting string; additional characters will be filled with 0
Returns: filled
-
-
class
odps.df.
Scalar
(*args, **kwargs)[source]¶ Represent for the scalar type.
Parameters: - _value – value of the scalar
- _value_type – value type of the scalar
Example: >>> df[df, Scalar(4).rename('append_const')]
-
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
between
(left, right, inclusive=True)¶ Return a boolean sequence or scalar show whether each element is between left and right.
Parameters: - expr – sequence or scalar
- left – left value
- right – right value
- inclusive – if true, will be left <= expr <= right, else will be left < expr < right
Returns: boolean sequence or scalar
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
cut
(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)¶ Return indices of half-open bins to which each value of expr belongs.
Parameters: - expr – sequence or scalar
- bins – list of scalars
- right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
- labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
- include_lowest – Whether the first interval should be left-inclusive or not.
- include_under – include the bin below the leftmost edge or not
- include_over – include the bin above the rightmost edge or not
Returns: sequence or scalar
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
fillna
(value)¶ Fill null with value.
Parameters: - expr – sequence or scalar
- value – value to fill into
Returns: sequence or scalar
-
hash
(func=None)¶ Calculate the hash value.
Parameters: - expr –
- func – hash function
Returns:
-
isin
(values)¶ Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
isnull
()¶ Return a sequence or scalar according to the input indicating if the values are null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
map
(func, rtype=None, resources=None, args=(), **kwargs)¶ Call func on each element of this sequence.
Parameters: - func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
- rtype – if not provided, will be the dtype of this sequence
Returns: a new sequence
Example: >>> df.id.map(lambda x: x + 1)
- func – lambda, function,
-
notin
(values)¶ Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
notnull
()¶ Return a sequence or scalar according to the input indicating if the values are not null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
switch
(*args, **kw)¶ Similar to the case-when in SQL. Refer to the example below
Parameters: - expr –
- args –
- kw –
Returns: sequence or scalar
Example: >>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
odps.df.
NullScalar
(tp)[source]¶ Creates a Scalar representing typed None values.
Parameters: tp – type of the scalar Returns: Scalar with None value
-
class
odps.df.
RandomScalar
(seed=None, **kw)[source]¶ Represent for the random scalar type.
Parameters: seed – random seed, None by default Example: >>> df[df, RandomScalar().rename('append_random')]
-
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
between
(left, right, inclusive=True)¶ Return a boolean sequence or scalar show whether each element is between left and right.
Parameters: - expr – sequence or scalar
- left – left value
- right – right value
- inclusive – if true, will be left <= expr <= right, else will be left < expr < right
Returns: boolean sequence or scalar
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
cut
(bins, right=True, labels=None, include_lowest=False, include_under=False, include_over=False)¶ Return indices of half-open bins to which each value of expr belongs.
Parameters: - expr – sequence or scalar
- bins – list of scalars
- right – indicates whether the bins include the rightmost edge or not. If right == True(the default), then the bins [1, 2, 3, 4] indicate (1, 2], (2, 3], (3, 4]
- labels – Usesd as labes for the resulting bins. Must be of the same length as the resulting bins.
- include_lowest – Whether the first interval should be left-inclusive or not.
- include_under – include the bin below the leftmost edge or not
- include_over – include the bin above the rightmost edge or not
Returns: sequence or scalar
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
fillna
(value)¶ Fill null with value.
Parameters: - expr – sequence or scalar
- value – value to fill into
Returns: sequence or scalar
-
hash
(func=None)¶ Calculate the hash value.
Parameters: - expr –
- func – hash function
Returns:
-
isin
(values)¶ Return a boolean sequence or scalar showing whether each element is exactly contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
isnull
()¶ Return a sequence or scalar according to the input indicating if the values are null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
map
(func, rtype=None, resources=None, args=(), **kwargs)¶ Call func on each element of this sequence.
Parameters: - func – lambda, function,
odps.models.Function
, or str which is the name ofodps.models.Funtion
- rtype – if not provided, will be the dtype of this sequence
Returns: a new sequence
Example: >>> df.id.map(lambda x: x + 1)
- func – lambda, function,
-
notin
(values)¶ Return a boolean sequence or scalar showing whether each element is not contained in the passed values.
Parameters: - expr – sequence or scalar
- values – list object or sequence
Returns: boolean sequence or scalar
-
notnull
()¶ Return a sequence or scalar according to the input indicating if the values are not null.
Parameters: expr – sequence or scalar Returns: sequence or scalar
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
switch
(*args, **kw)¶ Similar to the case-when in SQL. Refer to the example below
Parameters: - expr –
- args –
- kw –
Returns: sequence or scalar
Example: >>> # if df.id == 3 then df.name >>> # elif df.id == df.fid.abs() then df.name + 'test' >>> # default: 'test' >>> df.id.switch(3, df.name, df.fid.abs(), df.name + 'test', default='test')
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
-
class
odps.df.expr.groupby.
GroupBy
(*args, **kwargs)[source]¶ -
all
()¶ All is True.
Parameters: expr – Returns:
-
any
()¶ Any is True.
Parameters: expr – Returns:
-
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
count
()¶ Value counts
Parameters: expr – Returns:
-
cume_dist
(sort=None, ascending=True)¶ Calculate cumulative ratio of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
dense_rank
(sort=None, ascending=True)¶ Calculate dense rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
kurt
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
kurtosis
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
max
()¶ Max value
Parameters: expr – Returns:
-
mean
()¶ Arithmetic mean.
Parameters: expr – Returns:
-
median
()¶ Median value.
Parameters: expr – Returns:
-
min
()¶ Min value
Parameters: expr – Returns:
-
min_rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
moment
(order, central=False)¶ Calculate the n-th order moment of the sequence
Parameters: - expr –
- order – moment order, must be an integer
- central – if central moments are to be computed.
Returns:
-
nth_value
(nth, skip_nulls=False, sort=None, ascending=True)¶ Get nth value of a grouped and sorted expression.
Parameters: - expr – expression for calculation
- nth – integer position
- skip_nulls – whether to skip null values, False by default
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
nunique
()¶ The distinct count.
Parameters: expr – Returns:
-
percent_rank
(sort=None, ascending=True)¶ Calculate percentage rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
qcut
(bins, labels=False, sort=None, ascending=True)¶ Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
Parameters: - expr – expression for calculation
- bins – number of bins
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
quantile
(prob=None, **kw)¶ Percentile value.
Parameters: - expr –
- prob – probability or list of probabilities, in [0, 1]
Returns:
-
rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
row_number
(sort=None, ascending=True)¶ Calculate row number of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
size
()¶ Value counts
Parameters: expr – Returns:
-
skew
()¶ Calculate skewness of the sequence
Parameters: expr – Returns:
-
std
(**kw)¶ Standard deviation.
Parameters: - expr –
- kw –
Returns:
-
sum
()¶ Sum value
Parameters: expr – Returns:
-
tolist
(**kwargs)¶ Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
-
var
(**kw)¶ Variance
Parameters: - expr –
- ddof – degree of freedom
- kw –
Returns:
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
-
class
odps.df.expr.groupby.
SequenceGroupBy
(*args, **kwargs)[source]¶ -
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
count
()¶ Value counts
Parameters: expr – Returns:
-
cumcount
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative count of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cume_dist
(sort=None, ascending=True)¶ Calculate cumulative ratio of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
cummax
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative maximum of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cummin
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative minimum of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
dense_rank
(sort=None, ascending=True)¶ Calculate dense rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
hll_count
(error_rate=0.01, splitter=None)¶ Calculate HyperLogLog count
Parameters: - expr –
- error_rate (float) – error rate
- splitter – the splitter to split the column value
Returns: sequence or scalar
Example: >>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
-
lag
(offset, default=None, sort=None, ascending=True)¶ Get value in the row
offset
rows prior to the current row.Parameters: - offset – the offset value
- default – default value for the function, when there are no rows satisfying the offset
- expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
lead
(offset, default=None, sort=None, ascending=True)¶ Get value in the row
offset
rows after to the current row.Parameters: - offset – the offset value
- default – default value for the function, when there are no rows satisfying the offset
- expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
max
()¶ Max value
Parameters: expr – Returns:
-
min
()¶ Min value
Parameters: expr – Returns:
-
min_rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
nth_value
(nth, skip_nulls=False, sort=None, ascending=True)¶ Get nth value of a grouped and sorted expression.
Parameters: - expr – expression for calculation
- nth – integer position
- skip_nulls – whether to skip null values, False by default
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
nunique
()¶ The distinct count.
Parameters: expr – Returns:
-
percent_rank
(sort=None, ascending=True)¶ Calculate percentage rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
qcut
(bins, labels=False, sort=None, ascending=True)¶ Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
Parameters: - expr – expression for calculation
- bins – number of bins
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
row_number
(sort=None, ascending=True)¶ Calculate row number of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
size
()¶ Value counts
Parameters: expr – Returns:
-
tolist
(**kwargs)¶ Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-
-
class
odps.df.expr.groupby.
Int64SequenceGroupBy
(*args, **kwargs)[source]¶ -
ast
()¶ Return the AST string.
Returns: AST tree Return type: str
-
compile
()¶ Compile this expression into an ODPS SQL
Returns: compiled DAG Return type: str
-
count
()¶ Value counts
Parameters: expr – Returns:
-
cumcount
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative count of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cume_dist
(sort=None, ascending=True)¶ Calculate cumulative ratio of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
cummax
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative maximum of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cummean
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative mean of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cummedian
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative median of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cummin
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative minimum of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cumstd
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative standard deviation of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
cumsum
(sort=None, ascending=True, unique=False, preceding=None, following=None)¶ Calculate cumulative summation of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
- unique – whether to eliminate duplicate entries
- preceding – the start point of a window
- following – the end point of a window
Returns: calculated column
-
dense_rank
(sort=None, ascending=True)¶ Calculate dense rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
execute
(**kwargs)¶ Parameters: - hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
Returns: execution result
Return type: odps.df.backends.frame.ResultFrame
-
hll_count
(error_rate=0.01, splitter=None)¶ Calculate HyperLogLog count
Parameters: - expr –
- error_rate (float) – error rate
- splitter – the splitter to split the column value
Returns: sequence or scalar
Example: >>> df = DataFrame(pd.DataFrame({'a': np.random.randint(100000, size=100000)})) >>> df.a.hll_count() 63270 >>> df.a.nunique() 63250
-
kurt
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
kurtosis
()¶ Calculate kurtosis of the sequence
Parameters: expr – Returns:
-
lag
(offset, default=None, sort=None, ascending=True)¶ Get value in the row
offset
rows prior to the current row.Parameters: - offset – the offset value
- default – default value for the function, when there are no rows satisfying the offset
- expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
lead
(offset, default=None, sort=None, ascending=True)¶ Get value in the row
offset
rows after to the current row.Parameters: - offset – the offset value
- default – default value for the function, when there are no rows satisfying the offset
- expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
max
()¶ Max value
Parameters: expr – Returns:
-
mean
()¶ Arithmetic mean.
Parameters: expr – Returns:
-
median
()¶ Median value.
Parameters: expr – Returns:
-
min
()¶ Min value
Parameters: expr – Returns:
-
min_rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
moment
(order, central=False)¶ Calculate the n-th order moment of the sequence
Parameters: - expr –
- order – moment order, must be an integer
- central – if central moments are to be computed.
Returns:
-
nth_value
(nth, skip_nulls=False, sort=None, ascending=True)¶ Get nth value of a grouped and sorted expression.
Parameters: - expr – expression for calculation
- nth – integer position
- skip_nulls – whether to skip null values, False by default
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
nunique
()¶ The distinct count.
Parameters: expr – Returns:
-
percent_rank
(sort=None, ascending=True)¶ Calculate percentage rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
persist
(name, partitions=None, partition=None, lifecycle=None, project=None, **kwargs)¶ Persist the execution into a new table. If partitions not specified, will create a new table without partitions if the table does not exist, and insert the SQL result into it. If partitions are specified, they will be the partition fields of the new table. If partition is specified, the data will be inserted into the exact partition of the table.
Parameters: - name – table name
- partitions (list) – list of string, the partition fields
- partition (string or PartitionSpec) – persist to a specified partition
- lifecycle (int) – table lifecycle. If absent, options.lifecycle will be used.
- project – project name, if not provided, will be the default project
- hints (dict) – settings for SQL, e.g. odps.sql.mapper.split.size
- priority (int) – instance priority, 9 as default
- running_cluster – cluster to run this instance
- overwrite (bool) – overwrite the table, True as default
- drop_table (bool) – drop table if exists, False as default
- create_table (bool) – create table first if not exits, True as default
- drop_partition (bool) – drop partition if exists, False as default
- create_partition (bool) – create partition if not exists, None as default
- cast (bool) – cast all columns’ types as the existed table, False as default
Returns: Example: >>> df = df['name', 'id', 'ds'] >>> df.persist('odps_new_table') >>> df.persist('odps_new_table', partition='pt=test') >>> df.persist('odps_new_table', partitions=['ds'])
-
qcut
(bins, labels=False, sort=None, ascending=True)¶ Get quantile-based bin indices of every element of a grouped and sorted expression. The indices of bins start from 0. If cuts are not of equal sizes, extra items will be appended into the first group.
Parameters: - expr – expression for calculation
- bins – number of bins
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
quantile
(prob=None, **kw)¶ Percentile value.
Parameters: - expr –
- prob – probability or list of probabilities, in [0, 1]
Returns:
-
rank
(sort=None, ascending=True)¶ Calculate rank of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
row_number
(sort=None, ascending=True)¶ Calculate row number of a sequence expression.
Parameters: - expr – expression for calculation
- sort – name of the sort column
- ascending – whether to sort in ascending order
Returns: calculated column
-
size
()¶ Value counts
Parameters: expr – Returns:
-
skew
()¶ Calculate skewness of the sequence
Parameters: expr – Returns:
-
std
(**kw)¶ Standard deviation.
Parameters: - expr –
- kw –
Returns:
-
sum
()¶ Sum value
Parameters: expr – Returns:
-
tolist
(**kwargs)¶ Pack all data in the sequence into a list :param expr: :param unique: make every elements in the sequence to be unique :return:
-
var
(**kw)¶ Variance
Parameters: - expr –
- ddof – degree of freedom
- kw –
Returns:
-
verify
()¶ Verify if this expression can be compiled into ODPS SQL.
Returns: True if compilation succeed else False Return type: bool
-