Column operations

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

For a Sequence object, operations such as adding a constant or executing the sin function are performed on all elements of the object. For more information, see the following:

NULL-related functions (isnull, notnull, fillna)

The DataFrame API provides several NULL-related built-in functions. You can use isnull or notnull to determine whether or not a field is NULL, and use fillna to replace NULL fields with specified values.

>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False

Logic functions (ifelse, switch)

The ifelsefunction acts on boolean fields, and returns the first parameter if the condition is true, or the second parameter if the condition is false.

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

The switch function is used to handle multiple conditions.

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

In PyODPS versions 0.7.8 and higher, you can change the column values of a dataset based on the following conditions:

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

For numeric fields, operations such as addition (+), subtraction (-), multiplication (*), and division (/) are supported. In addition, operations such as log, sin are also supported.

>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023
>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength_div_2'),
>>>           (iris.sepallength ** 2).rename('sepallength_square')]
>>> iris[fields].head(5)
   sepallength  sepallength_div_2  sepallength_square
0          5.1               2.55               26.01
1          4.9               2.45               24.01
2          4.7               2.35               22.09
3          4.6               2.30               21.16
4          5.0               2.50               25.00

The supported arithmetic operations include:

Arithmetic operation Description
abs Returns the absolute value of the given number.
sqrt Returns the square root of the given number.
sin  
sinh  
cos  
cosh  
tan  
tanh  
arccos  
arccosh  
arcsin  
arcsinh  
arctan  
arctanh  
exp Returns e^x of the given number x.
expm1 Returns e^x-1 of the given number x.
log A parameter is required as the base.
log2  
log10  
log1p log(1+x)
radians Converts radians to degrees.
degrees Converts degrees to radians.
ceil Returns the smallest integer that is no less than the given number.
floor Returns the largest integer that is no greater than the given number.
trunc Returns a number truncated to the specified decimal place.

You can compare a Sequence object with another Sequence or Scalar object.

>>> (iris.sepallength < 5).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

Note that the DataFrame API does not support sequential operations, such as 3 <= iris.sepallength <= 5. But you can use the between function to determine whether a field is within a certain range.

>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4         True

By default, the between function specifies an interval that includes endpoints. To specify an open interval, set inclusive to False.

>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

String-related operations

The DataFrame API provides a number of string-related operations for Sequence and Scalar objects.

>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

The operations are:

Operation Description
capitalize  
contains Returns whether the given string contains a substring. The substring is a regular expression if the regex parameter is set to True. The regex parameter is set to True by default.
count Counts the number of occurrences of the specified string.
endswith Returns whether the given string ends with the specified string.
startswith Returns whether the given string starts with the specified string.
extract Extracts a regular expression, and if the group has not been specified, returns the substrings that satisfy the pattern. If the group has been specified, the specified group is returned.
find Searches from left to right and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.
rfind Searches from right to left and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.
replace Replaces the substrings that satisfy the pattern with another substring. If n has been specified, replace n times.
get Returns the string at the specified position.
len Returns the length of the string.
ljust Pads the string with the character fillchar on the right until it reaches the specified length width. The space character is used by default.
rjust Pads the string with the character fillchar on the left until it reaches the specified length width. The space character is used by default.
lower Converts the string to lowercase.
upper Converts the string to uppercase.
lstrip Remove spaces (including blank lines) on the left of the string.
rstrip Remove spaces (including blank lines) on the right of the string.
strip Remove spaces (including blank lines) on both sides of the string.
split Splits the string at the specified separator and returns a list<0/> type.
pad Pads the string with the character fillchar on the specified position which may be left, right or both. The space character is used by default.
repeat Repeats n times.
slice Performs slice operations.
swapcase Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the string.
title The same as str.title.
zfill Pads the string with the character 0 on the left until it reaches the specified length width.
isalnum The same as str.isalnum.
isalpha The same as str.isalpha.
isdigit Returns True if all the characters in the string are digits. This is the same as str.isdigit.
isspace Returns True if all the characters in the string are spaces. This is the same as str.isspace.
islower Returns True if all the cased characters in the string are lowercase. This is the same as str.islower.
isupper Returns True if all the cased characters in the string are uppercase. This is the same as str.isupper.
istitle This is the same as str.istitle.
isnumeric The same as str.isnumeric.
isdecimal This is the same as str.isdecimal.
todict Splits the string at the specified separator into a dict. Two parameters, the project separator and the Key-Value separator are required. Returns a dict<1/> type.
strptime Converts the string representing a time to a time type according to the specified format. The time format is the same as specified in the standard Python library. For details, see Python time formats.

Time-related operations

For datetime type Sequence or Scalar objects, the following time-related built-in functions are provided.

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

Time-related attributes are:

Time-related attribute Description
year  
month  
day  
hour  
minute  
second  
weekofyear Returns the week of the year for the specified date. Monday is the the first day of a week.
weekday Returns the day of the week for the specified date.
dayofweek The same as weekday.
strftime Converts a time to a string type according to the specified format. The time format is the same as in the standard Python library. For more information, see Python time formats.

PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date 3 days before the current date. Subtracting one date column from another returns the difference in milliseconds.

>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
         a
0  3600000
1  3600000
2  3600000

Supported time types include:

Attribute Description
year  
month  
day  
hour  
minute  
second  
millisecond  

Collection type related operations

PyODPS sequences supports List and Dict types. You can use subscripts to retrieve an item from both types. You can also use len method to retrieve the number of items in each element.

Additionally, List and Dict types support explode method, which can be used to display the contents of the collection. For List, explode returns one column by default. When the pos parameter has been specified, it returns two columns, with one of them representing the serial number of the value in the array. The explode method is similar to the enumerate function in Python. For Dict, explode returns two columns, which represent the keys and values respectively. You can also pass in a column name to explode to specify the result column name.

The following are some examples:

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
   id   a    b
0   1  a1    1
1   2  c1  NaN
>>> df[df.id, df.a.len(), df.b.len()]
   id  a  b
0   1  2  3
1   2  1  2
>>> df.a.explode()
    a
0  a1
1  b1
2  c1
>>> df.a.explode(pos=True)
   a_pos   a
0      0  a1
1      1  b1
2      0  c1
>>> # designate column names
>>> df.a.explode(['pos', 'value'], pos=True)
   pos value
0    0    a1
1    1    b1
2    0    c1
>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4
>>> # designate column names
>>> df.b.explode(['key', 'value'])
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

You can also combine explode with the lateral view as follows:

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4

In addition to subscripts, len, and explode, the list type also supports the following methods:

List operation Description
contains(v) Returns whether the list contains the specified element.
sort Sorts the list and returns a list type.

Dict also supports the following methods:

Dict operation Description
keys Retrieves Dict keys and returns a list type.
values Retrieves Dict values and returns a list type.

Other operations

isinor notinreturns whether or not the elements in a Sequence object exist in a collection element.

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

The cut method divides Sequence data into several segments.

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

include_underand include_overspecify the lower interval and upper interval to be included respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Use custom functions

DataFrame allows you to call the map method on a Sequence object so as to call custom functions on all of its elements.

>>> iris.sepallength.map(lambda x: x + 1).head(5)
   sepallength
0          6.1
1          5.9
2          5.7
3          5.6
4          6.0

Warning

Custom functions are currently not allowed to use lists or dicts as inputs or outputs because of Python UDF limitations.

If the type of Sequence has been changed after calling the map method, you need to explicitly specify the new type.

>>> iris.sepallength.map(lambda x: 't'+str(x), 'string').head(5)
   sepallength
0         t5.1
1         t4.9
2         t4.7
3         t4.6
4         t5.0

If a function contains a closure, note that if the value of a closure variable changes outside the function, the value of this variable within the function also changes. For example:

>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(lambda x: x + i))

Each SequenceExpr object in dfs is df.sepal_length + 9 now. To solve this problem, you can use the function as the return of another function, or use functools.partial. For example:

>>> dfs = []
>>> def get_mapper(i):
>>>     return lambda x: x + i
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(get_mapper(i)))

Or

>>> import functools
>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(functools.partial(lambda v, x: x + v, i)))

The map method also supports existing UDFs. You can pass in str type parameters, which represent function names, or Function objects.

The implementation of map depends on MaxCompute Python UDF. If your project does not support Python UDF, you cannot use map. All Python UDF limitations apply.

The only builtin third-party library in MaxCompute is numpy. If you need to use other libraries, you need to upload these libraries yourself. For more information, see :ref:` Use third-party Python libraries<third_party_library>`.

In addition to using custom functions, you can also use many of the built-in functions provided by DataFrame. The implementation of these functions may depend on the map function, which means **your project must support Python UDF or you cannot use these functions (Note: You need to apply for Python UDF support)**.

Warning

Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.

Reference Resources

Custom functions can also read MaxCompute resources, such as table and file resources, or reference Collection objects as resources. To do that, you need to write your functions as a closure or callable class.

>>> file_resource = o.create_resource('pyodps_iris_file', 'file', file_obj='Iris-setosa')
>>>
>>> iris_names_collection = iris.distinct('name')[:2]
>>> iris_names_collection
       sepallength
0      Iris-setosa
1  Iris-versicolor
>>> def myfunc(resources):  # resources passed by calling order
>>>     names = set()
>>>     fileobj = resources[0] # file resources are file-like objects
>>>     for l in fileobj:
>>>         names.add(l)
>>>     collection = resources[1]
>>>     for r in collection:
>>>         names.add(r.name)  # values can be obtained via column name or column offset
>>>     def h(x):
>>>         if x in names:
>>>             return True
>>>         else:
>>>             return False
>>>     return h
>>>
>>> df = iris.distinct('name')
>>> df = df[df.name,
>>>         df.name.map(myfunc, resources=[file_resource, iris_names_collection], rtype='boolean').rename('isin')]
>>>
>>> df
              name   isin
0      Iris-setosa   True
1  Iris-versicolor   True
2   Iris-virginica  False

Note: when reading the partitioned tables, partition fields are not included.

Use third-party Python libraries

You can upload third-party Python packages as resources to MaxCompute. Supported formats include: .whl, .egg., .zip, and .tar.gz. You need to specify the package files when using them globally or in actions. Custom functions can also use third-party libraries.

Note that the dependencies of the third-party libraries also need to be specified. Otherwise errors may occur when importing the files.

Take the following python-dateutil package as an example.

First, you can use the pip download command to download the package and its dependencies to a specific path. Two packages are downloaded: six-1.10.0-py2.py3-none-any.whl and python_dateutil-2.5.3-py2.py3-none-any.whl. Note that the packages must support Linux environment.

$ pip download python-dateutil -d /to/path/

Then upload the files to MaxCompute as resources.

>>> # make sure that file extensions are correct
>>> odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb'))
>>> odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))

Now you have a DataFrame object that only contains a string field.

>>> df
               datestr
0  2016-08-26 14:03:29
1  2015-08-26 14:03:29

Set the third-party library as global:

>>> from odps import options
>>>
>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> options.df.libraries = ['six.whl', 'python_dateutil.whl']
>>> df.datestr.map(get_year)
   datestr
0     2016
1     2015

Or use the libraries attribute of an action to specify the package:

>>> def get_year(t):
>>>     from dateutil.parser import parse
>>>     return parse(t).strftime('%Y')
>>>
>>> df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
   datestr
0     2016
1     2015

By default, PyODPS supports third-party libraries that contain pure Python code but no file operations. In newer versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or file operations. These libraries must be suffixed with cp27-cp27m-manylinux1_x86_64 and uploaded in the archive format .whl packages must be renamed to .zip files. You also need to enable the odps.isolation.session.enable option or enable Isolation in your project. The following example demonstrates how to upload and use the special functions in scipy:

>>> # packages containing binaries should be uploaded with archive method,
>>> # replacing extension .whl with .zip.
>>> odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
>>>
>>> # if your project has already been configured with isolation, the line below can be omitted
>>> options.sql.settings = { 'odps.isolation.session.enable': True }
>>>
>>> def psi(value):
>>>     # it is recommended to import third-party libraries inside your function
>>>     # in case that structures of the same package differ between different systems.
>>>     from scipy.special import psi
>>>     return float(psi(value))
>>>
>>> df.float_col.map(psi).execute(libraries=['scipy.zip'])

For binary packages that only contain source code, they can be packaged into .whl files and uploaded through Linux shell. .whl files generated in Mac and Windows are not usable in MaxCompute:

python setup.py bdist_wheel

Use the counter

from odps.udf import get_execution_context

def h(x):
    ctx = get_execution_context()
    counters = ctx.get_counters()
    counters.get_counter('df', 'add_one').increment(1)
    return x + 1

df.field.map(h)

You can find the value of the counter in the JSONSummary of LogView.

Call MaxCompute built-in functions or UDFs

You can use the func interface to call MaxCompute built-in functions or UDFs to generate columns. This interface returns a string type by default. You can use the rtype parameter to specify the return type.

>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # call UDFs defined on MaxCompute. you need to name your result sequence
>>> # explicitly if the name of returned sequence cannot be determined
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]>>> # call a UDF from another project. sequence can also be renamed via name arg
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project',
>>>                               name='new_col')]

Note

Note: when executing under pandas backend, executing expressions containing func is not supported.