Introduction
There are 30 questions to practice Python Pandas and MySQL on https://leetcode.cn/studyplan/30-days-of-pandas/ . These questions basically practice the fundamental skills. Pandas and MySQL are similar in some aspects, and one can write the same functional code in two languages. I’ll practice my skill with these questions, and write notes about the equivalent operations in two languages.
Accordingly, there are six sections in this section (so will use 6 articles in total):
- Condition Filter & Misc
- String Manipulation
- Data Manipulation
- Data Statistics
- Grouping
- Merging
1. Condition Filter & Join & Misc
The four related question in this area are:
Basic where
clause filtering, with OR
operator: https://leetcode.cn/problems/big-countries/
Basic where
clause filtering, with AND
operator: https://leetcode.cn/problems/recyclable-and-low-fat-products/
Joining two tables, with where
clause filtering null: https://leetcode.cn/problems/customers-who-never-order/
Basic where
clause filtering, with DISTINCT
and sort
: https://leetcode.cn/problems/article-views-i
1.1 Condition
Typical condition filter work is asking us to do like this:
Filter the big country in World scheme. A country is big if:
- it has an area of at least three million (i.e.,
3000000 km2
), or- it has a population of at least twenty-five million (i.e.,
25000000
).
We’ll use this example to illustrate the point.
1.1.a MySQL
A typical condition filter in SQL looks like this:
1 | select <col_names> |
SQL allows the use of the logical connectives and, or, and not.
The operands of the logical connectives can be expressions involving the comparison operators <, <=, >, >=, =, and <> (Note that, equal symbol is one =
, but not ==
!)
For example, in the codes above, the answer is
1 | SELECT name, population, area |
1.1.b Pandas
Filter operations in pandas is a little bit difference. We should know two basics things first:
- Condition result in Index Series
If we want to get all data that are greater than 5 in df
‘s col1
column, the code and result is as below:
1 | print(df) |
As you can see, the result of df['col1'] > 5
is a bool Series with len(df)
size. The entry is True only when the corresponding entry at col1
satisfies the condition.
- Indexing by the Index Series
We can pass this bool index series in to the df.loc[]
. Then, only the rows with the condition satisfied are displayed:
1 | a = df.loc[df['col1'] > 5] |
By such method, we can do the filtering in the pandas. Note that:
For the and
, or
and not
logic operators, the corresponding operator in pandas is &
, |
and ~
. For example:
1 | cond1 = df['col1'] > 5 |
Note that the ()
symbols are necessary because of the operator precedence. For example, in the codes above, the answer is:
1 | cond1 = world['area'] >= 300_0000 |
1.2 Misc
1.2.1 Rename output columns
1.2.1.a MySQL
In MySQL, rename a column is very easy. By default, if the command is
1 | SELECT name, population, area |
Then the output columns are named name
, population
, and area
. To rename, just use the as
symbol to give a new name as output:
1 | -- change name => Name, population => Population, area => Area |
1.2.1.b Pandas
In pandas, we can use the pd.DataFrame.rename
function to rename the columns. The input parameter columns
is a dict[str, str]
, where key
is the old name, and the value
is the new name. For example in the example below, we make all names capitalized:
1 | df = df.rename(columns={'name': 'Name', 'population': 'Population', 'area': 'Area'}) |
1.2.2 Swap output columns order
1.2.2.a MySQL
In SQL, this work is relatively easy. You only need to swap the column names in the select
command:
1 | -- show with A, B, C columns |
1.2.2.b Pandas
The method is relatively similar as the code in MySQL, by:
1 | # show with A, B, C columns |
1.2.3 Remove duplicate rows
1.2.3.a MySQL
It is quite easy to do that: you only need to add DISTINCT
keyword. For example:
1 | select DISTINCT student_id |
1.2.3.b Pandas
The DataFrame has a method called drop_duplicates()
. For example:
1 | selected_views = selected_views.drop_duplicates() |
More advanced usage is use parameter subset
to tell which columns for identifying duplicates, by default use all of the columns. For example, the following code drop the duplicate rows whenever they have same value in column teacher_id
, subject_id
.
1 | teacher = teacher.drop_duplicates(subset=['teacher_id', 'subject_id']) |
1.2.4 Sort
1.2.4.a MySQL
In the select
statement, add with order by <col> asc/desc
statement. <col>
indicates sort according to which column, asc
means ascending sorting, desc
means descending sorting.
1 | select v.author_id |
To sort according multiple columns (if first column is the same, sort by second column), do it by:
1 | select v.author_id |
1.2.4.b Pandas
Use the pd.DataFrame.sort_values()
to sort according to which column(s). If we want to sort from biggest to smallest, use ascending
parameter.
1 | # sort by one column, id |