Column operations

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

For a Sequence object, operations such as adding a constant or executing the sin function are performed on all elements of the object. For more information, see the following:

NULL-related functions (isnull, notnull, fillna)

The DataFrame API provides several NULL-related built-in functions. You can use isnull or notnull to determine whether or not a field is NULL, and use fillna to replace NULL fields with specified values.

>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False

Logic functions (ifelse, switch)

The ifelsefunction acts on boolean fields, and returns the first parameter if the condition is true, or the second parameter if the condition is false.

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

The switch function is used to handle multiple conditions.

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

In PyODPS versions 0.7.8 and higher, you can change the column values of a dataset based on the following conditions:

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

For numeric fields, operations such as addition (+), subtraction (-), multiplication (*), and division (/) are supported. In addition, operations such as log, sin are also supported.

>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023
>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength_div_2'),
>>>           (iris.sepallength ** 2).rename('sepallength_square')]
>>> iris[fields].head(5)
   sepallength  sepallength_div_2  sepallength_square
0          5.1               2.55               26.01
1          4.9               2.45               24.01
2          4.7               2.35               22.09
3          4.6               2.30               21.16
4          5.0               2.50               25.00

The supported arithmetic operations include:

Arithmetic operation

Description

abs

Returns the absolute value of the given number.

sqrt

Returns the square root of the given number.

sin

sinh

cos

cosh

tan

tanh

arccos

arccosh

arcsin

arcsinh

arctan

arctanh

exp

Returns e^x of the given number x.

expm1

Returns e^x-1 of the given number x.

log

A parameter is required as the base.

log2

log10

log1p

log(1+x)

radians

Converts radians to degrees.

degrees

Converts degrees to radians.

ceil

Returns the smallest integer that is no less than the given number.

floor

Returns the largest integer that is no greater than the given number.

trunc

Returns a number truncated to the specified decimal place.

You can compare a Sequence object with another Sequence or Scalar object.

>>> (iris.sepallength < 5).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

Note that the DataFrame API does not support sequential operations, such as 3 <= iris.sepallength <= 5. But you can use the between function to determine whether a field is within a certain range.

>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4         True

By default, the between function specifies an interval that includes endpoints. To specify an open interval, set inclusive to False.

>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

String-related operations

The DataFrame API provides a number of string-related operations for Sequence and Scalar objects.

>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

The operations are:

Operation

Description

capitalize

contains

Returns whether the given string contains a substring. The substring is a regular expression if the regex parameter is set to True. The regex parameter is set to True by default.

count

Counts the number of occurrences of the specified string.

endswith

Returns whether the given string ends with the specified string.

startswith

Returns whether the given string starts with the specified string.

extract

Extracts a regular expression, and if the group has not been specified, returns the substrings that satisfy the pattern. If the group has been specified, the specified group is returned.

find

Searches from left to right and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.

rfind

Searches from right to left and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.

replace

Replaces the substrings that satisfy the pattern with another substring. If n has been specified, replace n times.

get

Returns the string at the specified position.

len

Returns the length of the string.

ljust

Pads the string with the character fillchar on the right until it reaches the specified length width. The space character is used by default.

rjust

Pads the string with the character fillchar on the left until it reaches the specified length width. The space character is used by default.

lower

Converts the string to lowercase.

upper

Converts the string to uppercase.

lstrip

Remove spaces (including blank lines) on the left of the string.

rstrip

Remove spaces (including blank lines) on the right of the string.

strip

Remove spaces (including blank lines) on both sides of the string.

split

Splits the string at the specified separator and returns a list<0/> type.

pad

Pads the string with the character fillchar on the specified position which may be left, right or both. The space character is used by default.

repeat

Repeats n times.

slice

Performs slice operations.

swapcase

Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the string.

title

The same as str.title.

zfill

Pads the string with the character 0 on the left until it reaches the specified length width.

isalnum

The same as str.isalnum.

isalpha

The same as str.isalpha.

isdigit

Returns True if all the characters in the string are digits. This is the same as str.isdigit.

isspace

Returns True if all the characters in the string are spaces. This is the same as str.isspace.

islower

Returns True if all the cased characters in the string are lowercase. This is the same as str.islower.

isupper

Returns True if all the cased characters in the string are uppercase. This is the same as str.isupper.

istitle

This is the same as str.istitle.

isnumeric

The same as str.isnumeric.

isdecimal

This is the same as str.isdecimal.

todict

Splits the string at the specified separator into a dict. Two parameters, the project separator and the Key-Value separator are required. Returns a dict<1/> type.

strptime

Converts the string representing a time to a time type according to the specified format. The time format is the same as specified in the standard Python library. For details, see Python time formats.

Time-related operations

For datetime type Sequence or Scalar objects, the following time-related built-in functions are provided.

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

Time-related attributes are:

Time-related attribute

Description

year

month

day

hour

minute

second

weekofyear

Returns the week of the year for the specified date. Monday is the the first day of a week.

weekday

Returns the day of the week for the specified date.

dayofweek

The same as weekday.

strftime

Converts a time to a string type according to the specified format. The time format is the same as in the standard Python library. For more information, see Python time formats.

PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date 3 days before the current date. Subtracting one date column from another returns the difference in milliseconds.

>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
         a
0  3600000
1  3600000
2  3600000

Supported time types include:

Attribute

Description

year

month

day

hour

minute

second

millisecond

Collection type related operations

PyODPS sequences supports List and Dict types. You can use subscripts to retrieve an item from both types. You can also use len method to retrieve the number of items in each element.

Additionally, List and Dict types support explode method, which can be used to display the contents of the collection. For List, explode returns one column by default. When the pos parameter has been specified, it returns two columns, with one of them representing the serial number of the value in the array. The explode method is similar to the enumerate function in Python. For Dict, explode returns two columns, which represent the keys and values respectively. You can also pass in a column name to explode to specify the result column name.

The following are some examples:

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
   id   a    b
0   1  a1    1
1   2  c1  NaN
>>> df[df.id, df.a.len(), df.b.len()]
   id  a  b
0   1  2  3
1   2  1  2
>>> df.a.explode()
    a
0  a1
1  b1
2  c1
>>> df.a.explode(pos=True)
   a_pos   a
0      0  a1
1      1  b1
2      0  c1
>>> # designate column names
>>> df.a.explode(['pos', 'value'], pos=True)
   pos value
0    0    a1
1    1    b1
2    0    c1
>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4
>>> # designate column names
>>> df.b.explode(['key', 'value'])
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

You can also combine explode with the lateral view as follows:

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4

In addition to subscripts, len, and explode, the list type also supports the following methods:

List operation

Description

contains(v)

Returns whether the list contains the specified element.

sort

Sorts the list and returns a list type.

Dict also supports the following methods:

Dict operation

Description

keys

Retrieves Dict keys and returns a list type.

values

Retrieves Dict values and returns a list type.

Other operations

isinor notinreturns whether or not the elements in a Sequence object exist in a collection element.

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

The cut method divides Sequence data into several segments.

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

include_underand include_overspecify the lower interval and upper interval to be included respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Use custom functions

DataFrame allows you to call the map method on a Sequence object so as to call custom functions on all of its elements.

>>> iris.sepallength.map(lambda x: x + 1).head(5)
   sepallength
0          6.1
1          5.9
2          5.7
3          5.6
4          6.0

Warning

Custom functions are currently not allowed to use lists or dicts as inputs or outputs because of Python UDF limitations.

If the type of Sequence has been changed after calling the map method, you need to explicitly specify the new type.

>>> iris.sepallength.map(lambda x: 't' + str(x), rtype='string').head(5)
   sepallength
0         t5.1
1         t4.9
2         t4.7
3         t4.6
4         t5.0

If a function contains a closure, note that if the value of a closure variable changes outside the function, the value of this variable within the function also changes. For example:

>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(lambda x: x + i))

Each SequenceExpr object in dfs is df.sepal_length + 9 now. To solve this problem, you can use the function as the return of another function, or use functools.partial. For example:

>>> dfs = []
>>> def get_mapper(i):
>>>     return lambda x: x + i
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(get_mapper(i)))

Or

>>> import functools
>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(functools.partial(lambda v, x: x + v, i)))

The map method also supports existing UDFs. You can pass in str type parameters, which represent function names, or Function objects.

The implementation of map depends on MaxCompute Python UDF. If your project does not support Python UDF, you cannot use map. All Python UDF limitations apply.

The only builtin third-party library in MaxCompute is numpy. If you need to use other libraries, you need to upload these libraries yourself. For more information, see :ref:` Use third-party Python libraries<third_party_library>`.

Warning

Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.

Warning

PyODPS DataFrame recognizes all collections and sequences as distributed objects, and does not support referencing these objects inside user-defined functions. Please consider using methods like join to reference data in multiple DataFrames, or referencing collections as resources, which is stated in the next section.

Reference Resources

Custom functions can also read MaxCompute resources, such as table and file resources, or reference Collection objects as resources. To do that, you need to write your functions as a closure or callable class.

>>> file_resource = o.create_resource('pyodps_iris_file', 'file', file_obj='Iris-setosa')
>>>
>>> iris_names_collection = iris.distinct('name')[:2]
>>> iris_names_collection
       sepallength
0      Iris-setosa
1  Iris-versicolor
>>> def myfunc(resources):  # resources passed by calling order
>>>     names = set()
>>>     fileobj = resources[0] # file resources are file-like objects
>>>     for l in fileobj:
>>>         names.add(l)
>>>     collection = resources[1]
>>>     for r in collection:
>>>         names.add(r.name)  # values can be obtained via column name or column offset
>>>     def h(x):
>>>         if x in names:
>>>             return True
>>>         else:
>>>             return False
>>>     return h
>>>
>>> df = iris.distinct('name')
>>> df = df[df.name,
>>>         df.name.map(myfunc, resources=[file_resource, iris_names_collection], rtype='boolean').rename('isin')]
>>>
>>> df
              name   isin
0      Iris-setosa   True
1  Iris-versicolor   True
2   Iris-virginica  False

Note: when reading the partitioned tables, partition fields are not included.

Use third-party Python libraries

You can upload third-party wheel packages as resources to MaxCompute. You need to specify the package files globally or in methods that execute DataFrames immediately. Note that you also need to add dependencies of your third-party libraries, or import could fail.

If your MaxCompute service supports specifying images when executing SQL statements, you may specify image argument with execute, persist or to_pandas to use these images. Meanwhile libraries argument can be used with execute, persist or to_pandas to specify resources as thirdparty libraries. PyODPS installation provides pyodps-pack tool for packing third-party libraries. You may take a look at documents here to see how to build and use these third-party libraries.

Warning

Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.

Use the counter

from odps.udf import get_execution_context

def h(x):
    ctx = get_execution_context()
    counters = ctx.get_counters()
    counters.get_counter('df', 'add_one').increment(1)
    return x + 1

df.field.map(h)

You can find the value of the counter in the JSONSummary of LogView.

Call MaxCompute built-in functions or UDFs

You can use the func interface to call MaxCompute built-in functions or UDFs to generate columns. This interface returns a string type by default. You can use the rtype parameter to specify the return type.

>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # call UDFs defined on MaxCompute. you need to name your result sequence
>>> # explicitly if the name of returned sequence cannot be determined
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]>>> # call a UDF from another project. sequence can also be renamed via name arg
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project',
>>>                               name='new_col')]

Note

Note: when executing under pandas backend, executing expressions containing func is not supported.