Column operations

from odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

For a Sequence object, operations such as adding a constant or executing the sin function are performed on all elements of the object. For more information, see the following:

NULL-related functions (isnull, notnull, fillna)

The DataFrame API provides several NULL-related built-in functions. You can use isnull or notnull to determine whether or not a field is NULL, and use fillna to replace NULL fields with specified values.

>>> iris.sepallength.isnull().head(5)
   sepallength
      False
      False
      False
      False
      False

Logic functions (ifelse, switch)

The ifelsefunction acts on boolean fields, and returns the first parameter if the condition is true, or the second parameter if the condition is false.

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
 gt5
lte5
lte5
lte5
lte5

The switch function is used to handle multiple conditions.

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
     noeq
    eq4.9
     noeq
     noeq
    eq5.0

>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', iris.sepallength == 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

In PyODPS versions 0.7.8 and higher, you can change the column values of a dataset based on the following conditions:

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

For numeric fields, operations such as addition (+), subtraction (-), multiplication (*), and division (/) are supported. In addition, operations such as log, sin are also supported.

>>> (iris.sepallength * 10).log().head(5)
   sepallength
   3.931826
   3.891820
   3.850148
   3.828641
   3.912023

>>> fields = [iris.sepallength,
>>>           (iris.sepallength / 2).rename('sepallength_div_2'),
>>>           (iris.sepallength ** 2).rename('sepallength_square')]
>>> iris[fields].head(5)
   sepallength  sepallength_div_2  sepallength_square
0          5.1               2.55               26.01
1          4.9               2.45               24.01
2          4.7               2.35               22.09
3          4.6               2.30               21.16
4          5.0               2.50               25.00

The supported arithmetic operations include:

Arithmetic operation	Description
abs	Returns the absolute value of the given number.
sqrt	Returns the square root of the given number.
sin
sinh
cos
cosh
tan
tanh
arccos
arccosh
arcsin
arcsinh
arctan
arctanh
exp	Returns e^x of the given number x.
expm1	Returns e^x-1 of the given number x.
log	A parameter is required as the base.
log2
log10
log1p	log(1+x)
radians	Converts radians to degrees.
degrees	Converts degrees to radians.
ceil	Returns the smallest integer that is no less than the given number.
floor	Returns the largest integer that is no greater than the given number.
trunc	Returns a number truncated to the specified decimal place.

You can compare a Sequence object with another Sequence or Scalar object.

>>> (iris.sepallength < 5).head(5)
   sepallength
      False
       True
       True
       True
      False

Note that the DataFrame API does not support sequential operations, such as 3 <= iris.sepallength <= 5. But you can use the between function to determine whether a field is within a certain range.

>>> (iris.sepallength.between(3, 5)).head(5)
   sepallength
      False
       True
       True
       True
       True

By default, the between function specifies an interval that includes endpoints. To specify an open interval, set inclusive to False.

>>> (iris.sepallength.between(3, 5, inclusive=False)).head(5)
   sepallength
      False
       True
       True
       True
      False

String-related operations

The DataFrame API provides a number of string-related operations for Sequence and Scalar objects.

>>> fields = [
>>>     iris.name.upper().rename('upper_name'),
>>>     iris.name.extract('Iris(.*)', group=1)
>>> ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

The operations are:

Operation	Description
capitalize
contains	Returns whether the given string contains a substring. The substring is a regular expression if the regex parameter is set to True. The regex parameter is set to True by default.
count	Counts the number of occurrences of the specified string.
endswith	Returns whether the given string ends with the specified string.
startswith	Returns whether the given string starts with the specified string.
extract	Extracts a regular expression, and if the group has not been specified, returns the substrings that satisfy the pattern. If the group has been specified, the specified group is returned.
find	Searches from left to right and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.
rfind	Searches from right to left and returns the position of the first occurrence of the specified substring. -1 is returned if no matching has been found.
replace	Replaces the substrings that satisfy the pattern with another substring. If `n` has been specified, replace n times.
get	Returns the string at the specified position.
len	Returns the length of the string.
ljust	Pads the string with the character `fillchar` on the right until it reaches the specified length `width`. The space character is used by default.
rjust	Pads the string with the character `fillchar` on the left until it reaches the specified length `width`. The space character is used by default.
lower	Converts the string to lowercase.
upper	Converts the string to uppercase.
lstrip	Remove spaces (including blank lines) on the left of the string.
rstrip	Remove spaces (including blank lines) on the right of the string.
strip	Remove spaces (including blank lines) on both sides of the string.
split	Splits the string at the specified separator and returns a list<0/> type.
pad	Pads the string with the character `fillchar` on the specified position which may be left, right or both. The space character is used by default.
repeat	Repeats n times.
slice	Performs slice operations.
swapcase	Converts all the uppercase characters to lowercase and all the lowercase characters to uppercase in the string.
title	The same as str.title.
zfill	Pads the string with the character `0` on the left until it reaches the specified length `width`.
isalnum	The same as str.isalnum.
isalpha	The same as str.isalpha.
isdigit	Returns True if all the characters in the string are digits. This is the same as str.isdigit.
isspace	Returns True if all the characters in the string are spaces. This is the same as str.isspace.
islower	Returns True if all the cased characters in the string are lowercase. This is the same as str.islower.
isupper	Returns True if all the cased characters in the string are uppercase. This is the same as str.isupper.
istitle	This is the same as str.istitle.
isnumeric	The same as str.isnumeric.
isdecimal	This is the same as str.isdecimal.
todict	Splits the string at the specified separator into a dict. Two parameters, the project separator and the Key-Value separator are required. Returns a dict<1/> type.
strptime	Converts the string representing a time to a time type according to the specified format. The time format is the same as specified in the standard Python library. For details, see Python time formats.

Time-related operations

For datetime type Sequence or Scalar objects, the following time-related built-in functions are provided.

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
>>>    df.dt.year.rename('year'),
>>>    df.dt.month.rename('month'),
>>>    df.dt.day.rename('day'),
>>>    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

Time-related attributes are:

Time-related attribute	Description
year
month
day
hour
minute
second
weekofyear	Returns the week of the year for the specified date. Monday is the the first day of a week.
weekday	Returns the day of the week for the specified date.
dayofweek	The same as weekday.
strftime	Converts a time to a string type according to the specified format. The time format is the same as in the standard Python library. For more information, see Python time formats.

PyODPS also supports the addition and subtraction of time. For example, you can retrieve the date 3 days before the current date. Subtracting one date column from another returns the difference in milliseconds.

>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> from odps.df import day
>>> df.a - day(3)
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype
int64
>>> (df.b - df.a).rename('a')
         a
0  3600000
1  3600000
2  3600000

Supported time types include:

Attribute	Description
year
month
day
hour
minute
second
millisecond

Collection type related operations

PyODPS sequences supports List and Dict types. You can use subscripts to retrieve an item from both types. You can also use len method to retrieve the number of items in each element.

Additionally, List and Dict types support explode method, which can be used to display the contents of the collection. For List, explode returns one column by default. When the pos parameter has been specified, it returns two columns, with one of them representing the serial number of the value in the array. The explode method is similar to the enumerate function in Python. For Dict, explode returns two columns, which represent the keys and values respectively. You can also pass in a column name to explode to specify the result column name.

The following are some examples:

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']]
   id   a    b
0   1  a1    1
1   2  c1  NaN
>>> df[df.id, df.a.len(), df.b.len()]
   id  a  b
0   1  2  3
1   2  1  2
>>> df.a.explode()
    a
0  a1
1  b1
2  c1
>>> df.a.explode(pos=True)
   a_pos   a
0      0  a1
1      1  b1
2      0  c1
>>> # designate column names
>>> df.a.explode(['pos', 'value'], pos=True)
   pos value
0    0    a1
1    1    b1
2    0    c1
>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4
>>> # designate column names
>>> df.b.explode(['key', 'value'])
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

You can also combine explode with the lateral view as follows:

>>> df[df.id, df.a.explode()]
   id   a
 1  a1
 1  b1
 2  c1
>>> df[df.id, df.a.explode(), df.b.explode()]
   id   a b_key  b_value
 1  a1    a2        0
 1  a1    b2        1
 1  a1    c2        2
 1  b1    a2        0
 1  b1    b2        1
 1  b1    c2        2
 2  c1    d2        3
 2  c1    e2        4

In addition to subscripts, len, and explode, the list type also supports the following methods:

List operation	Description
contains(v)	Returns whether the list contains the specified element.
sort	Sorts the list and returns a list type.

Dict also supports the following methods:

Dict operation	Description
keys	Retrieves Dict keys and returns a list type.
values	Retrieves Dict values and returns a list type.

Other operations

isinor notinreturns whether or not the elements in a Sequence object exist in a collection element.

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
       True
       True
      False
      False
      False

The cut method divides Sequence data into several segments.

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
           None
            4-5
            4-5
            4-5
            4-5

include_underand include_overspecify the lower interval and upper interval to be included respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Use custom functions

DataFrame allows you to call the map method on a Sequence object so as to call custom functions on all of its elements.

>>> iris.sepallength.map(lambda x: x + 1).head(5)
   sepallength
        6.1
        5.9
        5.7
        5.6
        6.0

Warning

Custom functions are currently not allowed to use lists or dicts as inputs or outputs because of Python UDF limitations.

If the type of Sequence has been changed after calling the map method, you need to explicitly specify the new type.

>>> iris.sepallength.map(lambda x: 't' + str(x), rtype='string').head(5)
   sepallength
       t5.1
       t4.9
       t4.7
       t4.6
       t5.0

If a function contains a closure, note that if the value of a closure variable changes outside the function, the value of this variable within the function also changes. For example:

>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(lambda x: x + i))

Each SequenceExpr object in dfs is df.sepal_length + 9 now. To solve this problem, you can use the function as the return of another function, or use functools.partial. For example:

>>> dfs = []
>>> def get_mapper(i):
>>>     return lambda x: x + i
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(get_mapper(i)))

Or

>>> import functools
>>> dfs = []
>>> for i in range(10):
>>>     dfs.append(df.sepal_length.map(functools.partial(lambda v, x: x + v, i)))

The map method also supports existing UDFs. You can pass in str type parameters, which represent function names, or Function objects.

The implementation of map depends on MaxCompute Python UDF. If your project does not support Python UDF, you cannot use map. All Python UDF limitations apply.

The only builtin third-party library in MaxCompute is numpy. If you need to use other libraries, you need to upload these libraries yourself. For more information, see :ref:` Use third-party Python libraries<third_party_library>`.

Warning

Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.

Warning

PyODPS DataFrame recognizes all collections and sequences as distributed objects, and does not support referencing these objects inside user-defined functions. Please consider using methods like join to reference data in multiple DataFrames, or referencing collections as resources, which is stated in the next section.

Reference Resources

Custom functions can also read MaxCompute resources, such as table and file resources, or reference Collection objects as resources. To do that, you need to write your functions as a closure or callable class.

>>> file_resource = o.create_resource('pyodps_iris_file', 'file', file_obj='Iris-setosa')
>>>
>>> iris_names_collection = iris.distinct('name')[:2]
>>> iris_names_collection
       sepallength
0      Iris-setosa
1  Iris-versicolor

>>> def myfunc(resources):  # resources passed by calling order
>>>     names = set()
>>>     fileobj = resources[0] # file resources are file-like objects
>>>     for l in fileobj:
>>>         names.add(l)
>>>     collection = resources[1]
>>>     for r in collection:
>>>         names.add(r.name)  # values can be obtained via column name or column offset
>>>     def h(x):
>>>         if x in names:
>>>             return True
>>>         else:
>>>             return False
>>>     return h
>>>
>>> df = iris.distinct('name')
>>> df = df[df.name,
>>>         df.name.map(myfunc, resources=[file_resource, iris_names_collection], rtype='boolean').rename('isin')]
>>>
>>> df
              name   isin
0      Iris-setosa   True
1  Iris-versicolor   True
2   Iris-virginica  False

Note: when reading the partitioned tables, partition fields are not included.

Use third-party Python libraries

You can upload third-party wheel packages as resources to MaxCompute. You need to specify the package files globally or in methods that execute DataFrames immediately. Note that you also need to add dependencies of your third-party libraries, or import could fail.

If your MaxCompute service supports specifying images when executing SQL statements, you may specify image argument with execute, persist or to_pandas to use these images. Meanwhile libraries argument can be used with execute, persist or to_pandas to specify resources as thirdparty libraries. PyODPS installation provides pyodps-pack tool for packing third-party libraries. You may take a look at documents here to see how to build and use these third-party libraries.

Warning

Due to the differences in bytecode definitions, new features supported by Python 3, such as yield from, may cause errors when executed by MaxCompute Worker of Python 2.7. We recommend that you make sure your code executes normally before writing production code using MapReduce API in Python 3.

Use the counter

from odps.udf import get_execution_context

def h(x):
    ctx = get_execution_context()
    counters = ctx.get_counters()
    counters.get_counter('df', 'add_one').increment(1)
    return x + 1

df.field.map(h)

You can find the value of the counter in the JSONSummary of LogView.

Call MaxCompute built-in functions or UDFs

You can use the func interface to call MaxCompute built-in functions or UDFs to generate columns. This interface returns a string type by default. You can use the rtype parameter to specify the return type.

>>> from odps.df import func
>>>
>>> iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
>>> iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
>>> # call UDFs defined on MaxCompute. you need to name your result sequence
>>> # explicitly if the name of returned sequence cannot be determined
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]>>> # call a UDF from another project. sequence can also be renamed via name arg
>>> iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project',
>>>                               name='new_col')]

Note

Note: when executing under pandas backend, executing expressions containing func is not supported.