0%

Pandas & MySQL Leetcode Practice Notes 1--Condition Filter & Misc

Introduction

There are 30 questions to practice Python Pandas and MySQL on https://leetcode.cn/studyplan/30-days-of-pandas/ . These questions basically practice the fundamental skills. Pandas and MySQL are similar in some aspects, and one can write the same functional code in two languages. I’ll practice my skill with these questions, and write notes about the equivalent operations in two languages.

Accordingly, there are six sections in this section (so will use 6 articles in total):

  1. Condition Filter & Misc
  2. String Manipulation
  3. Data Manipulation
  4. Data Statistics
  5. Grouping
  6. Merging

1. Condition Filter & Join & Misc

The four related question in this area are:

Basic where clause filtering, with OR operator: https://leetcode.cn/problems/big-countries/

Basic where clause filtering, with AND operator: https://leetcode.cn/problems/recyclable-and-low-fat-products/

Joining two tables, with where clause filtering null: https://leetcode.cn/problems/customers-who-never-order/

Basic where clause filtering, with DISTINCT and sort: https://leetcode.cn/problems/article-views-i

1.1 Condition

Typical condition filter work is asking us to do like this:

Filter the big country in World scheme. A country is big if:

  • it has an area of at least three million (i.e., 3000000 km2), or
  • it has a population of at least twenty-five million (i.e., 25000000).

We’ll use this example to illustrate the point.

1.1.a MySQL

A typical condition filter in SQL looks like this:

1
2
3
4
5
select <col_names>
from <table_name>
where (
<condition predicate>
)

SQL allows the use of the logical connectives and, or, and not.

The operands of the logical connectives can be expressions involving the comparison operators <, <=, >, >=, =, and <> (Note that, equal symbol is one =, but not ==!)

For example, in the codes above, the answer is

1
2
3
SELECT name, population, area
FROM World
WHERE area >= 3000000 OR population >= 25000000

1.1.b Pandas

Filter operations in pandas is a little bit difference. We should know two basics things first:

  • Condition result in Index Series

If we want to get all data that are greater than 5 in df‘s col1 column, the code and result is as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
print(df)
# OUTPUT:
'''
name col1 col2
0 a 1 10
1 b 2 9
2 c 3 8
3 d 4 7
4 e 5 6
5 f 6 5
6 g 7 4
7 h 8 3
8 i 9 2
9 j 10 1
'''

print(df['col1'] > 5)
# OUTPUT:
'''
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: col1, dtype: bool
'''

As you can see, the result of df['col1'] > 5 is a bool Series with len(df) size. The entry is True only when the corresponding entry at col1 satisfies the condition.

  • Indexing by the Index Series

We can pass this bool index series in to the df.loc[]. Then, only the rows with the condition satisfied are displayed:

1
2
3
4
5
6
7
8
9
10
11
a = df.loc[df['col1'] > 5]
print(a)
# OUTPUT
'''
name col1 col2
5 f 6 5
6 g 7 4
7 h 8 3
8 i 9 2
9 j 10 1
'''

By such method, we can do the filtering in the pandas. Note that:

For the and, or and not logic operators, the corresponding operator in pandas is &, | and ~. For example:

1
2
3
4
5
6
cond1 = df['col1'] > 5
cond2 = df['col2'] > 2

b = df.loc[cond1 & cond2]

c = df.loc[ (df['col1'] > 5) & (df['col2'] > 2) ]

Note that the () symbols are necessary because of the operator precedence. For example, in the codes above, the answer is:

1
2
3
4
cond1 = world['area'] >= 300_0000
cond2 = world['population'] >= 2500_0000

return world.loc[cond1 | cond2, ['name', 'population', 'area']]

1.2 Misc

1.2.1 Rename output columns

1.2.1.a MySQL

In MySQL, rename a column is very easy. By default, if the command is

1
2
3
SELECT name, population, area
FROM World
WHERE area >= 3000000 OR population >= 25000000

Then the output columns are named name, population, and area. To rename, just use the as symbol to give a new name as output:

1
2
3
4
-- change name => Name, population => Population, area => Area
SELECT name as Name, population as Population, area as Area
FROM World
WHERE area >= 3000000 OR population >= 25000000

1.2.1.b Pandas

In pandas, we can use the pd.DataFrame.rename function to rename the columns. The input parameter columns is a dict[str, str], where key is the old name, and the value is the new name. For example in the example below, we make all names capitalized:

1
df = df.rename(columns={'name': 'Name', 'population': 'Population', 'area': 'Area'})

1.2.2 Swap output columns order

1.2.2.a MySQL

In SQL, this work is relatively easy. You only need to swap the column names in the select command:

1
2
3
4
5
-- show with A, B, C columns 
select A, B, C from table;

-- show with C, B, A columns
select C, B, A from table;

1.2.2.b Pandas

The method is relatively similar as the code in MySQL, by:

1
2
3
4
5
# show with A, B, C columns 
df.loc[:, ['A', 'B', 'C']]

# show with C, B, A columns
df.loc[:, ['C', 'B', 'A']]

1.2.3 Remove duplicate rows

1.2.3.a MySQL

It is quite easy to do that: you only need to add DISTINCT keyword. For example:

1
2
select DISTINCT student_id
from courses

1.2.3.b Pandas

The DataFrame has a method called drop_duplicates(). For example:

1
selected_views = selected_views.drop_duplicates()

More advanced usage is use parameter subset to tell which columns for identifying duplicates, by default use all of the columns. For example, the following code drop the duplicate rows whenever they have same value in column teacher_id, subject_id.

1
teacher = teacher.drop_duplicates(subset=['teacher_id', 'subject_id'])

1.2.4 Sort

1.2.4.a MySQL

In the select statement, add with order by <col> asc/desc statement. <col> indicates sort according to which column, asc means ascending sorting, desc means descending sorting.

1
2
3
select v.author_id 
from Views as v
order by id asc

To sort according multiple columns (if first column is the same, sort by second column), do it by:

1
2
3
select v.author_id 
from Views as v
order by id asc, col2 desc

1.2.4.b Pandas

Use the pd.DataFrame.sort_values() to sort according to which column(s). If we want to sort from biggest to smallest, use ascending parameter.

1
2
3
4
5
6
7
8
# sort by one column, id
selected_views = selected_views.sort_values(by=['id'])
selected_views = selected_views.sort_values(by=['id'], ascending=True)

# sort by two columns
# if first column is the same, sort by second column
selected_views = selected_views.sort_values(by=['id', 'view_date'])
selected_views = selected_views.sort_values(by=['id', 'view_date'], ascending=[True, False])