Data merging

from odps.df import DataFrame
movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
movies.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  imdb_url                            string
}
ratings.dtypes
odps.Schema {
  user_id                     int64
  movie_id                    int64
  rating                      int64
  unix_timestamp              int64
}

Join operation

DataFrame supports the join operation for two Collection objects. If you do not specify the join conditions, the DataFrame application program interface (API) uses columns of the same name to join them.

>>> movies.join(ratings).head(3)
   movie_id              title  release_date  video_release_date                                           imdb_url  user_id  rating  unix_timestamp
0         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...       49       3       888068877
1         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      621       5       881444887
2         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      291       3       874833936

You can also explicitly specify the join conditions in the following ways:

>>> movies.join(ratings, on='movie_id').head(3)
   movie_id              title  release_date  video_release_date                                           imdb_url  user_id  rating  unix_timestamp
0         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...       49       3       888068877
1         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      621       5       881444887
2         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      291       3       874833936

During the join operation, if the field name in the on condition for both Collection objects is the same, the system selects one field name. In other types of join operations such as left_join, right_join, and outer_join, one of the field names in the on condition is renamed.

>>> movies.left_join(ratings, on='movie_id').head(3)
   movie_id_x              title  release_date  video_release_date                                           imdb_url  user_id  movie_id_y  rating  unix_timestamp
0           3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...       49           3       3       888068877
1           3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      621           3       5       881444887
2           3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      291           3       3       874833936

In this code, movie_idis renamed as movie_id_x and movie_id_y. This renaming rule depends on the suffixesparameter (('_x', '_y')as the default). The fields of the same name are renamed using the specified suffix.

>>> ratings2 = ratings[ratings.exclude('movie_id'), ratings.movie_id.rename('movie_id2')]
>>> ratings2.dtypes
odps.Schema {
  user_id                     int64
  rating                      int64
  unix_timestamp              int64
  movie_id2                   int64
}
>>> movies.join(ratings2, on=[('movie_id', 'movie_id2')]).head(3)
   movie_id              title  release_date  video_release_date                                           imdb_url  user_id  rating  unix_timestamp  movie_id2
0         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...       49       3       888068877          3
1         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      621       5       881444887          3
2         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      291       3       874833936          3

You can also write the equality operators to indicate renaming.

>>> movies.join(ratings2, on=[movies.movie_id == ratings2.movie_id2]).head(3)
   movie_id              title  release_date  video_release_date                                           imdb_url  user_id  rating  unix_timestamp  movie_id2
0         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...       49       3       888068877          3
1         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      621       5       881444887          3
2         3  Four Rooms (1995)   01-Jan-1995                      http://us.imdb.com/M/title-exact?Four%20Rooms%...      291       3       874833936          3

During the self-join operation, you can use the viewmethod to retrieve fields respectively from the left and right DataFrame objects.

>>> movies2 = movies.view()
>>> movies.join(movies2, movies.movie_id == movies2.movie_id)[movies, movies2.movie_id.rename('movie_id2')].head(3)
   movie_id            title_x release_date_x video_release_date_x  \
0         2   GoldenEye (1995)    01-Jan-1995                 True
1         3  Four Rooms (1995)    01-Jan-1995                 True
2         4  Get Shorty (1995)    01-Jan-1995                 True

                                          imdb_url_x  movie_id2
0  http://us.imdb.com/M/title-exact?GoldenEye%20(...          2
1  http://us.imdb.com/M/title-exact?Four%20Rooms%...          3
2  http://us.imdb.com/M/title-exact?Get%20Shorty%...          4

In addition to join, DataFrame supports left_join, right_join, and outer_join. In these join operations, renamed columns are suffixed with _x and _y by default. You can use a 2‑tuple to define the suffix in the suffixes parameter.

During the outer_join operation, to avoid repeated columns of the same equal predicate, set the merge_columns option to True. Therefore, the system selects non-null values from these columns as the values in a new column:

>>> movies.left_join(ratings, on='movie_id', merge_columns=True)

To use the mapjoin operation, set mapjoin to True. Therefore, the system executes the mapjoin operation for the right DataFrame object.

>>> movies.left_join(ratings, on='movie_id', mapjoin=True)

To use the skewjoin operation, set skewjoin to True. Therefore, the system executes the skewjoin operation for the right DataFrame object.

>>> movies.left_join(ratings, on='movie_id', skewjoin=True)

To specify skewness on specific columns, you may set skewjoin with a list of names of these columns.

>>> movies.left_join(ratings, on=['user_id', 'movie_id'], skewjoin=['user_id', 'movie_id'])

When you are aware of the skew values, you may specify combination of these values by passing a list of dict of values to skewjoin.

>>> movies.left_join(
    ratings, on=['user_id', 'movie_id'],
    skewjoin=[{'user_id': 0, 'movie_id': 0}, {'user_id': 1, 'movie_id': 1}],
)

It is recommended by MaxCompute to set specific values for skew columns, or extra computation cost for column statistics might be introduced. When specifying skew values, make sure all columns are included for every pair of values.

You can also join Collection objects respectively from MaxCompute and pandas, or join those respectively from MaxCompute and a database. MaxCompute rather than pandas or the database executes the computation.

Union operation

For two tables of consistent fields and types, which are automatically ordered, you can use the union or concat operation to combine both tables into a single table.

>>> mov1 = movies[movies.movie_id < 3]['movie_id', 'title']
>>> mov2 = movies[(movies.movie_id > 3) & (movies.movie_id < 6)]['title', 'movie_id']
>>> mov1.union(mov2)
   movie_id              title
0         1   Toy Story (1995)
1         2   GoldenEye (1995)
2         4  Get Shorty (1995)
3         5     Copycat (1995)

You can execute the union operation for Collection objects respectively from MaxCompute and pandas, or for those respectively from MaxCompute and a database. MaxCompute rather than pandas or the database executes the computation.