Quick start

Here, movielens 100K is used as an example. Assume that three tables already exist, which are pyodps_ml_100k_movies (movie-related data), pyodps_ml_100k_users (user-related data), and pyodps_ml_100k_ratings (rating-related data).

Create a MaxCompute object before starting the following steps:

>>> import os
>>> from odps import ODPS
# Make sure environment variable ALIBABA_CLOUD_ACCESS_KEY_ID already set to Access Key ID of user
# while environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET set to Access Key Secret of user.
# Not recommended to hardcode Access Key ID or Access Key Secret in your code.
>>> o = ODPS(
...     os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
...     os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
...     project='**your-project**',
...     endpoint='**your-endpoint**',
... )

You only need to input a Table object to create a DataFrame object. For example:

>>> from odps.df import DataFrame
>>> users = DataFrame(o.get_table('pyodps_ml_100k_users'))

View fields of DataFrame and the types of the fields through the dtypes attribute, as shown in the following code:

>>> users.dtypes
odps.Schema {
  user_id             int64
  age                 int64
  sex                 string
  occupation          string
  zip_code            string
}

You can use the head method to obtain the first N data records for easy and quick data preview. For example:

>>> users.head(10)
   user_id  age  sex     occupation  zip_code
0        1   24    M     technician     85711
1        2   53    F          other     94043
2        3   23    M         writer     32067
3        4   24    M     technician     43537
4        5   33    F          other     15213
5        6   42    M      executive     98101
6        7   57    M  administrator     91344
7        8   36    M  administrator     05201
8        9   29    M        student     01002
9       10   53    M         lawyer     90703

You can add a filter on the fields if you do not want to view all of them. For example:

>>> users[['user_id', 'age']].head(5)
   user_id  age
0        1   24
1        2   53
2        3   23
3        4   24
4        5   33

You can also exclude several fields. For example:

>>> users.exclude('zip_code', 'age').head(5)
   user_id  sex  occupation
0        1    M  technician
1        2    F       other
2        3    M      writer
3        4    M  technician
4        5    F       other

When excluding some fields, you may want to obtain new columns through computation. For example, add the sex_bool attribute and set it to True if sex is Male. Otherwise, set it to False. For example:

>>> users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5)
   user_id  age  occupation  sex_bool
0        1   24  technician      True
1        2   53       other     False
2        3   23      writer      True
3        4   24  technician      True
4        5   33       other     False

Obtain the number of persons at age of 20 to 25, as shown in the following code:

>>> users[users.age.between(20, 25)].count()
195

Obtain the numbers of male and female users, as shown in the following code:

>>> users.groupby(users.sex).agg(count=users.count())
   sex  count
0    F    273
1    M    670

To divide users by job, obtain the first 10 jobs that have the largest population, and sort the jobs in the descending order of population. See the following:

>>> df = users.groupby('occupation').agg(count=users['occupation'].count())
>>> df.sort(df['count'], ascending=False)[:10]
      occupation  count
0        student    196
1          other    105
2       educator     95
3  administrator     79
4       engineer     67
5     programmer     66
6      librarian     51
7         writer     45
8      executive     32
9      scientist     31

DataFrame APIs provide the value_counts method to quickly achieve the same result. An example is shown below. Note that the number of records returned by this method is limited by options.df.odps.sort.limit, whose default value is 10,000. More information can be found in configuration section.

>>> uses.occupation.value_counts()[:10]
      occupation  count
0        student    196
1          other    105
2       educator     95
3  administrator     79
4       engineer     67
5     programmer     66
6      librarian     51
7         writer     45
8      executive     32
9      scientist     31

Show data in a more intuitive graph, as shown in the following code:

>>> %matplotlib inline

Use a horizontal bar chart to visualize data, as shown in the following code:

>>> users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
<matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>
_images/df-value-count-plot.png

Divide ages into 30 groups and view the histogram of age distribution, as shown in the following code:

>>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
<matplotlib.axes._subplots.AxesSubplot at 0x10667a510>
_images/df-age-hist.png

Use join to join the three tables and save the joined tables as a new table. For example:

>>> movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
>>> ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
>>>
>>> o.delete_table('pyodps_ml_100k_lens', if_exists=True)
>>> lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
>>>
>>> lens.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  imdb_url                            string
  user_id                             int64
  rating                              int64
  unix_timestamp                      int64
  age                                 int64
  sex                                 string
  occupation                          string
  zip_code                            string
}

Divide ages of 0 to 80 into eight groups, as shown in the following code:

>>> labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
>>> cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]

View the first 10 data records of a single age in a group, as shown in the following code:

>>> cut_lens['age_group', 'age'].distinct()[:10]
   age_group  age
0        0-9    7
1      10-19   10
2      10-19   11
3      10-19   13
4      10-19   14
5      10-19   15
6      10-19   16
7      10-19   17
8      10-19   18
9      10-19   19

View users’ total rating and average rating of each age group, as shown in the following code:

>>> cut_lens.groupby('age_group').agg(cut_lens.rating.count().rename('total_rating'), cut_lens.rating.mean().rename('avg_rating'))
     age_group  avg_rating  total_rating
0          0-9    3.767442            43
1        10-19    3.486126          8181
2        20-29    3.467333         39535
3        30-39    3.554444         25696
4        40-49    3.591772         15021
5        50-59    3.635800          8704
6        60-69    3.648875          2623
7        70-79    3.649746           197