Quick start
Here, movielens 100K is used as an example. Assume that three tables already exist, which are pyodps_ml_100k_movies
(movie-related data), pyodps_ml_100k_users
(user-related data), and pyodps_ml_100k_ratings
(rating-related data).
Create a MaxCompute object before starting the following steps:
>>> import os
>>> from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
>>> o = ODPS(
... os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
... os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
... project='**your-project**',
... endpoint='**your-endpoint**',
... )
You only need to input a Table object to create a DataFrame object. For example:
>>> from odps.df import DataFrame
>>> users = DataFrame(o.get_table('pyodps_ml_100k_users'))
View fields of DataFrame and the types of the fields through the dtypes attribute, as shown in the following code:
>>> users.dtypes
odps.Schema {
user_id int64
age int64
sex string
occupation string
zip_code string
}
You can use the head method to obtain the first N data records for easy and quick data preview. For example:
>>> users.head(10)
user_id age sex occupation zip_code
0 1 24 M technician 85711
1 2 53 F other 94043
2 3 23 M writer 32067
3 4 24 M technician 43537
4 5 33 F other 15213
5 6 42 M executive 98101
6 7 57 M administrator 91344
7 8 36 M administrator 05201
8 9 29 M student 01002
9 10 53 M lawyer 90703
You can add a filter on the fields if you do not want to view all of them. For example:
>>> users[['user_id', 'age']].head(5)
user_id age
0 1 24
1 2 53
2 3 23
3 4 24
4 5 33
You can also exclude several fields. For example:
>>> users.exclude('zip_code', 'age').head(5)
user_id sex occupation
0 1 M technician
1 2 F other
2 3 M writer
3 4 M technician
4 5 F other
When excluding some fields, you may want to obtain new columns through computation. For example, add the sex_bool attribute and set it to True if sex is Male. Otherwise, set it to False. For example:
>>> users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5)
user_id age occupation sex_bool
0 1 24 technician True
1 2 53 other False
2 3 23 writer True
3 4 24 technician True
4 5 33 other False
Obtain the number of persons at age of 20 to 25, as shown in the following code:
>>> users[users.age.between(20, 25)].count()
195
Obtain the numbers of male and female users, as shown in the following code:
>>> users.groupby(users.sex).agg(count=users.count())
sex count
0 F 273
1 M 670
To divide users by job, obtain the first 10 jobs that have the largest population, and sort the jobs in the descending order of population. See the following:
>>> df = users.groupby('occupation').agg(count=users['occupation'].count())
>>> df.sort(df['count'], ascending=False)[:10]
occupation count
0 student 196
1 other 105
2 educator 95
3 administrator 79
4 engineer 67
5 programmer 66
6 librarian 51
7 writer 45
8 executive 32
9 scientist 31
DataFrame APIs provide the value_counts method to quickly achieve the same result. An example is shown below. Note that the number of records returned by this method is limited by options.df.odps.sort.limit
, whose default value is 10,000. More information can be found in configuration section.
>>> uses.occupation.value_counts()[:10]
occupation count
0 student 196
1 other 105
2 educator 95
3 administrator 79
4 engineer 67
5 programmer 66
6 librarian 51
7 writer 45
8 executive 32
9 scientist 31
Show data in a more intuitive graph, as shown in the following code:
>>> %matplotlib inline
Use a horizontal bar chart to visualize data, as shown in the following code:
>>> users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
<matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>
Divide ages into 30 groups and view the histogram of age distribution, as shown in the following code:
>>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
<matplotlib.axes._subplots.AxesSubplot at 0x10667a510>
Use join to join the three tables and save the joined tables as a new table. For example:
>>> movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
>>> ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
>>>
>>> o.delete_table('pyodps_ml_100k_lens', if_exists=True)
>>> lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
>>>
>>> lens.dtypes
odps.Schema {
movie_id int64
title string
release_date string
video_release_date string
imdb_url string
user_id int64
rating int64
unix_timestamp int64
age int64
sex string
occupation string
zip_code string
}
Divide ages of 0 to 80 into eight groups, as shown in the following code:
>>> labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
>>> cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]
View the first 10 data records of a single age in a group, as shown in the following code:
>>> cut_lens['age_group', 'age'].distinct()[:10]
age_group age
0 0-9 7
1 10-19 10
2 10-19 11
3 10-19 13
4 10-19 14
5 10-19 15
6 10-19 16
7 10-19 17
8 10-19 18
9 10-19 19
View users’ total rating and average rating of each age group, as shown in the following code:
>>> cut_lens.groupby('age_group').agg(cut_lens.rating.count().rename('total_rating'), cut_lens.rating.mean().rename('avg_rating'))
age_group avg_rating total_rating
0 0-9 3.767442 43
1 10-19 3.486126 8181
2 20-29 3.467333 39535
3 30-39 3.554444 25696
4 40-49 3.591772 15021
5 50-59 3.635800 8704
6 60-69 3.648875 2623
7 70-79 3.649746 197