Data merging
from odps.df import DataFrame
movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
movies.dtypes
odps.Schema {
movie_id int64
title string
release_date string
video_release_date string
imdb_url string
}
ratings.dtypes
odps.Schema {
user_id int64
movie_id int64
rating int64
unix_timestamp int64
}
Join operation
DataFrame supports the join operation for two Collection objects. If you do not specify the join conditions, the DataFrame application program interface (API) uses columns of the same name to join them.
>>> movies.join(ratings).head(3)
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936
You can also explicitly specify the join conditions in the following ways:
>>> movies.join(ratings, on='movie_id').head(3)
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936
During the join operation, if the field name in the on condition for both Collection objects is the same, the system selects one field name. In other types of join operations such as left_join, right_join, and outer_join, one of the field names in the on condition is renamed.
>>> movies.left_join(ratings, on='movie_id').head(3)
movie_id_x title release_date video_release_date imdb_url user_id movie_id_y rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 3 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 3 874833936
In this code, movie_id
is renamed as movie_id_x and movie_id_y. This renaming rule depends on the suffixes
parameter (('_x', '_y')
as the default). The fields of the same name are renamed using the specified suffix.
>>> ratings2 = ratings[ratings.exclude('movie_id'), ratings.movie_id.rename('movie_id2')]
>>> ratings2.dtypes
odps.Schema {
user_id int64
rating int64
unix_timestamp int64
movie_id2 int64
}
>>> movies.join(ratings2, on=[('movie_id', 'movie_id2')]).head(3)
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp movie_id2
0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3
1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3
2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3
You can also write the equality operators to indicate renaming.
>>> movies.join(ratings2, on=[movies.movie_id == ratings2.movie_id2]).head(3)
movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp movie_id2
0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3
1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3
2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3
During the self-join operation, you can use the view
method to retrieve fields respectively from the left and right DataFrame objects.
>>> movies2 = movies.view()
>>> movies.join(movies2, movies.movie_id == movies2.movie_id)[movies, movies2.movie_id.rename('movie_id2')].head(3)
movie_id title_x release_date_x video_release_date_x \
0 2 GoldenEye (1995) 01-Jan-1995 True
1 3 Four Rooms (1995) 01-Jan-1995 True
2 4 Get Shorty (1995) 01-Jan-1995 True
imdb_url_x movie_id2
0 http://us.imdb.com/M/title-exact?GoldenEye%20(... 2
1 http://us.imdb.com/M/title-exact?Four%20Rooms%... 3
2 http://us.imdb.com/M/title-exact?Get%20Shorty%... 4
In addition to join
, DataFrame supports left_join
, right_join
, and outer_join
. In these join operations, renamed columns are suffixed with _x and _y by default. You can use a 2‑tuple to define the suffix in the suffixes parameter.
During the outer_join operation, to avoid repeated columns of the same equal predicate, set the merge_columns option to True. Therefore, the system selects non-null values from these columns as the values in a new column:
>>> movies.left_join(ratings, on='movie_id', merge_columns=True)
To use the mapjoin operation, set mapjoin to True. Therefore, the system executes the mapjoin operation for the right DataFrame object.
>>> movies.left_join(ratings, on='movie_id', mapjoin=True)
To use the skewjoin operation, set skewjoin to True. Therefore, the system executes the skewjoin operation for the right DataFrame object.
>>> movies.left_join(ratings, on='movie_id', skewjoin=True)
To specify skewness on specific columns, you may set skewjoin with a list of names of these columns.
>>> movies.left_join(ratings, on=['user_id', 'movie_id'], skewjoin=['user_id', 'movie_id'])
When you are aware of the skew values, you may specify combination of these values by passing a list of dict of values to skewjoin.
>>> movies.left_join(
ratings, on=['user_id', 'movie_id'],
skewjoin=[{'user_id': 0, 'movie_id': 0}, {'user_id': 1, 'movie_id': 1}],
)
It is recommended by MaxCompute to set specific values for skew columns, or extra computation cost for column statistics might be introduced. When specifying skew values, make sure all columns are included for every pair of values.
You can also join Collection objects respectively from MaxCompute and pandas, or join those respectively from MaxCompute and a database. MaxCompute rather than pandas or the database executes the computation.
Union operation
For two tables of consistent fields and types, which are automatically ordered, you can use the union or concat operation to combine both tables into a single table.
>>> mov1 = movies[movies.movie_id < 3]['movie_id', 'title']
>>> mov2 = movies[(movies.movie_id > 3) & (movies.movie_id < 6)]['title', 'movie_id']
>>> mov1.union(mov2)
movie_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 4 Get Shorty (1995)
3 5 Copycat (1995)
You can execute the union operation for Collection objects respectively from MaxCompute and pandas, or for those respectively from MaxCompute and a database. MaxCompute rather than pandas or the database executes the computation.