Column operations
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))
For a Sequence object, operations such as adding a constant or executing the sin function are performed on all elements of the object. For more information, see the following:
NULL-related functions (isnull, notnull, fillna)
The DataFrame API provides several NULL-related built-in functions. You can use isnull or notnull to determine whether or not a field is NULL, and use fillna to replace NULL fields with specified values.
>>> iris.sepallength.isnull().head(5)
sepallength
0 False
1 False
2 False
3 False
4 False
Logic functions (ifelse, switch)
The ifelse
function acts on boolean fields, and returns the first parameter if the condition is true, or the second parameter if the condition is false.
>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5
The switch function is used to handle multiple conditions.
>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0
In PyODPS versions 0.7.8 and higher, you can change the column values of a dataset based on the following conditions:
>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5
Mathematical operations
For numeric fields, operations such as addition (+), subtraction (-), multiplication (*), and division (/) are supported. In addition, operations such as log, sin are also supported.
>>> (iris.sepallength * 10).log().head(5)
sepallength
0 3.931826
1 3.891820
2 3.850148
3 3.828641
4 3.912023
>>> fields = [iris.sepallength,
>>> (iris.sepallength / 2).rename('sepallength_div_2'),
>>> (iris.sepallength ** 2).rename('sepallength_square')]
>>> iris[fields].head(5)
sepallength sepallength_div_2 sepallength_square
0 5.1 2.55 26.01
1 4.9 2.45 24.01
2 4.7 2.35 22.09
3 4.6 2.30 21.16
4 5.0 2.50 25.00
The supported arithmetic operations include:
Arithmetic operation |
Description |
---|---|
abs |
Returns the absolute value of the given number. |
sqrt |
Returns the square root of the given number. |
sin |
|
sinh |
|
cos |
|
cosh |
|
tan |
|
tanh |
|
arccos |
|
arccosh |
|
arcsin |
|
arcsinh |
|
arctan |
|
arctanh |
|
exp |
Returns e^x of the given number x. |
expm1 |
Returns e^x-1 of the given number x. |
log |
A parameter is required as the base. |
log2 |
|
log10 |
|
log1p |
log(1+x) |
radians |
Converts radians to degrees. |
degrees |
Converts degrees to radians. |
ceil |
Returns the smallest integer that is no less than the given number. |
floor |
Returns the largest integer that is no greater than the given number. |
trunc |
Returns a number truncated to the specified decimal place. |
You can compare a Sequence object with another Sequence or Scalar object.
>>> (iris.sepallength < 5).head(5)
sepallength
0 False
1 True
2 True
3 True
4 False
Note that the DataFrame API does not support sequential operations, such as 3 <= iris.sepallength <= 5
. But you can use the between function to determine whether a field is within a certain range.
>>> (iris.sepallength.between(3, 5)).head(5)
sepallength
0 False
1 True
2 True
3 True
4 True
By default, the between function specifies an interval that includes endpoints. To specify an open interval, set inclusive to False.
>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
sepallength
0 False
1 True
2 True
3 True
4 False
String-related operations
The DataFrame API provides a number of string-related operations for Sequence and Scalar objects.
>>> fields = [
>>> iris.name.upper().rename('upper_name'),
>>> iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
upper_name name
0 IRIS-SETOSA -setosa
1 IRIS-SETOSA -setosa
2 IRIS-SETOSA -setosa
3 IRIS-SETOSA -setosa
4 IRIS-SETOSA -setosa
The operations are:
Operation |
Description |
---|---|
capitalize |
|
contains |
Returns whether the given string contains a substring. The substring is a regular expression if the regex parameter is set to True. The regex parameter is set to True by default. |
count |
Counts the number of occurrences of the specified string. |
endswith |
Returns whether the given string ends with the specified string. |
startswith |
Returns whether the given string starts with the specified string. |
extract |
Extracts a regular expression, and if the group has not been specified, returns the substrings that satisfy the pattern. If the group has been specified, the specified group is returned. |
find |
Searches from left to right and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found. |
rfind |
Searches from right to left and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found. |
replace |
Replaces the substrings that satisfy the pattern with another substring. If |
get |
Returns the string at the specified position. |
len |
Returns the length of the string. |
ljust |
Pads the string with the character |
rjust |
Pads the string with the character |
lower |
Converts the string to lowercase. |
upper |
Converts the string to uppercase. |
lstrip |
Remove spaces (including blank lines) on the left of the string. |
rstrip |
Remove spaces (including blank lines) on the right of the string. |
strip |
Remove spaces (including blank lines) on both sides of the string. |
split |
Splits the string at the specified separator and returns a list<0/> type. |
pad |
Pads the string with the character |
repeat |
Repeats n times. |
slice |
Performs slice operations. |
swapcase |
Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the string. |
title |
The same as str.title. |
zfill |
Pads the string with the character |
isalnum |
The same as str.isalnum. |
isalpha |
The same as str.isalpha. |
isdigit |
Returns True if all the characters in the string are digits. This is the same as str.isdigit. |
isspace |
Returns True if all the characters in the string are spaces. This is the same as str.isspace. |
islower |
Returns True if all the cased characters in the string are lowercase. This is the same as str.islower. |
isupper |
Returns True if all the cased characters in the string are uppercase. This is the same as str.isupper. |
istitle |
This is the same as str.istitle. |
isnumeric |
The same as str.isnumeric. |
isdecimal |
This is the same as str.isdecimal. |
todict |
Splits the string at the specified separator into a dict. Two parameters, the project separator and the Key-Value separator are required. Returns a dict<1/> type. |
strptime |
Converts the string representing a time to a time type according to the specified format. The time format is the same as specified in the standard Python library. For details, see Python time formats. |
Time-related operations
For datetime type Sequence or Scalar objects, the following time-related built-in functions are provided.
>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>> df.dt.year.rename('year'),
>>> df.dt.month.rename('month'),
>>> df.dt.day.rename('day'),
>>> df.dt.hour.rename('hour')].head(5)
dt year month day hour
0 1998-04-08 11:02:00 1998 4 8 11
1 1998-04-08 10:57:55 1998 4 8 10
2 1998-04-08 10:45:26 1998 4 8 10
3 1998-04-08 10:25:52 1998 4 8 10
4 1998-04-08 10:44:19 1998 4 8 10
Time-related attributes are:
Time-related attribute |
Description |
---|---|
year |
|
month |
|
day |
|
hour |
|
minute |
|
second |
|
weekofyear |
Returns the week of the year for the specified date. Monday is the the first day of a week. |
weekday |
Returns the day of the week for the specified date. |
dayofweek |
The same as weekday. |
strftime |
Converts a time to a string type according to the specified format. The time format is the same as in the standard Python library. For more information, see Python time formats. |
PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date 3 days before the current date. Subtracting one date column from another returns the difference in milliseconds.
>>> df
a b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
a
0 3600000
1 3600000
2 3600000
Supported time types include:
Attribute |
Description |
---|---|
year |
|
month |
|
day |
|
hour |
|
minute |
|
second |
|
millisecond |
Collection type related operations
PyODPS sequences supports List and Dict types. You can use subscripts to retrieve an item from both types. You can also use len
method to retrieve the number of items in each element.
Additionally, List and Dict types support explode
method, which can be used to display the contents of the collection. For List, explode returns one column by default. When the pos parameter has been specified, it returns two columns, with one of them representing the serial number of the value in the array. The explode method is similar to the enumerate function in Python. For Dict, explode returns two columns, which represent the keys and values respectively. You can also pass in a column name to explode to specify the result column name.
The following are some examples:
>>> df
id a b
0 1 [a1, b1] {'a2': 0, 'b2': 1, 'c2': 2}
1 2 [c1] {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
id a b
0 1 a1 1
1 2 c1 NaN
>>> df[df.id, df.a.len(), df.b.len()]
id a b
0 1 2 3
1 2 1 2
>>> df.a.explode()
a
0 a1
1 b1
2 c1
>>> df.a.explode(pos=True)
a_pos a
0 0 a1
1 1 b1
2 0 c1
>>> # designate column names
>>> df.a.explode(['pos', 'value'], pos=True)
pos value
0 0 a1
1 1 b1
2 0 c1
>>> df.b.explode()
b_key b_value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4
>>> # designate column names
>>> df.b.explode(['key', 'value'])
key value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4
You can also combine explode with the lateral view as follows:
>>> df[df.id, df.a.explode()]
id a
0 1 a1
1 1 b1
2 2 c1
>>> df[df.id, df.a.explode(), df.b.explode()]
id a b_key b_value
0 1 a1 a2 0
1 1 a1 b2 1
2 1 a1 c2 2
3 1 b1 a2 0
4 1 b1 b2 1
5 1 b1 c2 2
6 2 c1 d2 3
7 2 c1 e2 4
In addition to subscripts, len, and explode, the list type also supports the following methods:
List operation |
Description |
---|---|
contains(v) |
Returns whether the list contains the specified element. |
sort |
Sorts the list and returns a list type. |
Dict also supports the following methods:
Dict operation |
Description |
---|---|
keys |
Retrieves Dict keys and returns a list type. |
values |
Retrieves Dict values and returns a list type. |
Other operations
isin
or notin
returns whether or not the elements in a Sequence object exist in a collection element.
>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
sepallength
0 True
1 True
2 False
3 False
4 False
The cut method divides Sequence data into several segments.
>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
sepallength_cut
0 None
1 4-5
2 4-5
3 4-5
4 4-5
include_under
and include_over
specify the lower interval and upper interval to be included respectively.
>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
sepallength_cut
0 5-
1 4-5
2 4-5
3 4-5
4 4-5
Use custom functions
DataFrame allows you to call the map method on a Sequence object so as to call custom functions on all of its elements.
>>> iris.sepallength.map(lambda x: x + 1).head(5)
sepallength
0 6.1
1 5.9
2 5.7
3 5.6
4 6.0
Warning
Custom functions are currently not allowed to use lists or dicts as inputs or outputs because of Python UDF limitations.
If the type of Sequence has been changed after calling the map method, you need to explicitly specify the new type.
>>> iris.sepallength.map(lambda x: 't' + str(x), rtype='string').head(5)
sepallength
0 t5.1
1 t4.9
2 t4.7
3 t4.6
4 t5.0
If a function contains a closure, note that if the value of a closure variable changes outside the function, the value of this variable within the function also changes. For example:
>>> dfs = []
>>> for i in range(10):
>>> dfs.append(df.sepal_length.map(lambda x: x + i))
Each SequenceExpr object in dfs is df.sepal_length + 9
now. To solve this problem, you can use the function as the return of another function, or use functools.partial
. For example:
>>> dfs = []
>>> def get_mapper(i):
>>> return lambda x: x + i
>>> for i in range(10):
>>> dfs.append(df.sepal_length.map(get_mapper(i)))
Or
>>> import functools
>>> dfs = []
>>> for i in range(10):
>>> dfs.append(df.sepal_length.map(functools.partial(lambda v, x: x + v, i)))
The map method also supports existing UDFs. You can pass in str type parameters, which represent function names, or Function objects.
The implementation of map
depends on MaxCompute Python UDF. If your project does not support Python UDF, you cannot use map. All Python UDF limitations apply.
The only builtin third-party library in MaxCompute is numpy
. If you need to use other libraries, you need to upload these libraries yourself. For more information, see :ref:` Use third-party Python libraries<third_party_library>`.
Warning
Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from
, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.
Warning
PyODPS DataFrame recognizes all collections and sequences as distributed objects, and does not support referencing these objects inside user-defined functions. Please consider using methods like join to reference data in multiple DataFrames, or referencing collections as resources, which is stated in the next section.
Reference Resources
Custom functions can also read MaxCompute resources, such as table and file resources, or reference Collection objects as resources. To do that, you need to write your functions as a closure or callable class.
>>> file_resource = o.create_resource('pyodps_iris_file', 'file', file_obj='Iris-setosa')
>>>
>>> iris_names_collection = iris.distinct('name')[:2]
>>> iris_names_collection
sepallength
0 Iris-setosa
1 Iris-versicolor
>>> def myfunc(resources): # resources passed by calling order
>>> names = set()
>>> fileobj = resources[0] # file resources are file-like objects
>>> for l in fileobj:
>>> names.add(l)
>>> collection = resources[1]
>>> for r in collection:
>>> names.add(r.name) # values can be obtained via column name or column offset
>>> def h(x):
>>> if x in names:
>>> return True
>>> else:
>>> return False
>>> return h
>>>
>>> df = iris.distinct('name')
>>> df = df[df.name,
>>> df.name.map(myfunc, resources=[file_resource, iris_names_collection], rtype='boolean').rename('isin')]
>>>
>>> df
name isin
0 Iris-setosa True
1 Iris-versicolor True
2 Iris-virginica False
Note: when reading the partitioned tables, partition fields are not included.
Use third-party Python libraries
You can upload third-party wheel packages as resources to MaxCompute. You need to specify the package files globally or in methods that execute DataFrames immediately. Note that you also need to add dependencies of your third-party libraries, or import could fail.
If your MaxCompute service supports specifying images when executing SQL statements, you may specify image
argument with execute
, persist
or to_pandas
to use these images. Meanwhile libraries
argument can be used with execute
, persist
or to_pandas
to specify resources as thirdparty libraries. PyODPS installation provides pyodps-pack
tool for packing third-party libraries. You may take a look at documents here to see how to build and use these third-party libraries.
Warning
Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from
, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.
Use the counter
from odps.udf import get_execution_context
def h(x):
ctx = get_execution_context()
counters = ctx.get_counters()
counters.get_counter('df', 'add_one').increment(1)
return x + 1
df.field.map(h)
You can find the value of the counter in the JSONSummary of LogView.
Call MaxCompute built-in functions or UDFs
You can use the func
interface to call MaxCompute built-in functions or UDFs to generate columns. This interface returns a string type by default. You can use the rtype parameter to specify the return type.
>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # call UDFs defined on MaxCompute. you need to name your result sequence
>>> # explicitly if the name of returned sequence cannot be determined
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]>>> # call a UDF from another project. sequence can also be renamed via name arg
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project',
>>> name='new_col')]
Note
Note: when executing under pandas backend, executing expressions containing func
is not supported.