0%

Pandas & MySQL Leetcode Practice Notes 3--Data Manipulation

3. Data Manipulation

In this article, we will discuss four traditional data manipulation problems in SQL / Pandas. They are from

3.1: https://leetcode.cn/problems/nth-highest-salary/

3.2: https://leetcode.cn/problems/rank-scores/

3.3: https://leetcode.cn/problems/rearrange-products-table/

3.4: https://leetcode.cn/problems/calculate-special-bonus/

In our article, we won’t use the same data in these question. We’ll use a more simpler one to illustrate the functions.

3.1 TopK & n-th largest

TopK means selecting top k rows according to some rules from the table. n-th largest means selecting the just n-th largest row according to some rules from the table

3.1.1 MySQL

To select some first couple of rows from the table, you can use the LIMIT keyword. LIMIT M, N will select from M row (0-index), a total of N rows. See the following example and explanation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- table is
+-------+-------+
| name | score |
+-------+-------+
| Alice | 80 |
| Bob | 90 |
| Cindy | 100 |
| David | 70 |
| Ella | 60 |
| Frank | 80 |
+-------+-------+

-- select from 1st row, a total of 2 rows:
select * from Scores LIMIT 1, 2;
+-------+-------+
| name | score |
+-------+-------+
| Bob | 90 |
| Cindy | 100 |
+-------+-------+

-- select total of 0 rows means nothing
select * from Scores LIMIT 1, 0;
+------+-------+
| name | score |
+------+-------+

-- if starting index > len(table), nothing would be returned
select * from Scores LIMIT 7, 1;
+------+-------+
| name | score |
+------+-------+

-- select from minux number means select from first row
select * from Scores LIMIT -17, 2;
+-------+-------+
| name | score |
+-------+-------+
| Alice | 80 |
| Bob | 90 |
+-------+-------+

Therefore, it would be quite easy to implement the TopK and n-th largest. For example, if we want to find top 2 grade students and the 2nd highest student in this table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- find top 2 grade students
select * from Scores order by score desc LIMIT 0, 2;
+-------+-------+
| name | score |
+-------+-------+
| Cindy | 100 |
| Bob | 90 |
+-------+-------+

-- find the 2nd highest student (note that it is 0-index)
select * from Scores order by score desc LIMIT 1, 1;
+------+-------+
| name | score |
+------+-------+
| Bob | 90 |
+------+-------+

3.1.2 Pandas

In pandas, we have the method pd.DataFrame.nlargest(N, col_name) to find the top N rows. Note that, after executing this method, the result table is already sorted descending. Access the last one row will just be the n-th largest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# the df is
name score
0 Alice 80
1 Bob 90
2 Cindy 100
3 David 70
4 Ella 60
5 Frank 80

# find top 2 grade students
df.nlargest(N, 'score')
# result is:
name score
2 Cindy 100
1 Bob 90

# find the 2nd highest student (note that it is 0-index)
Scores.nlargest(N, 'score').iloc[-1]
# result is:
name Bob
score 90

3.2 rank, dense_rank

This question requires us to build a dense rank. That is, when there are rows that “score” are same, they should both appear and they should have the same score. Fortunately, we have built-in functions to help us achieve that.

3.2.1 MySQL

There is a function called DENSE_RANK(), which can be used to create a new column based on the ranking information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- The table is:
+----+-------+
| id | score |
+----+-------+
| 1 | 3.5 |
| 2 | 3.65 |
| 3 | 4 |
| 4 | 3.85 |
| 5 | 4 |
| 6 | 3.65 |
+----+-------+

-- Do the ranking (the problems requires the "dense rank")
SELECT S.score, DENSE_RANK() OVER (
ORDER BY S.score DESC
) AS 'rank'
FROM
Scores S;
-- result is:
+-------+------+
| score | rank |
+-------+------+
| 4 | 1 |
| 4 | 1 |
| 3.85 | 2 |
| 3.65 | 3 |
| 3.65 | 3 |
| 3.5 | 4 |
+-------+------+

Usually, our ranking is not like this: if we have two 4.0 students, the rank of 3.85 student should be 3, not 2. The RANK() functions helps in this case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT S.score, RANK() OVER (
ORDER BY S.score DESC
) AS 'rank'
FROM
Scores S;
-- result is:
+-------+------+
| score | rank |
+-------+------+
| 4 | 1 |
| 4 | 1 |
| 3.85 | 3 |
| 3.65 | 4 |
| 3.65 | 4 |
| 3.5 | 6 |
+-------+------+

3.2.2 Pandas

In pandas, such operation is quite easy: we have the pd.Series.rank() function. This function has multiple parameters. The most important two are method and ascending. When method is 'dense', it performs as dense_rank. When method is min, it is the common ranking (meaning that when we have multiple same score, we should take minimum rank for all of them). ascending controls whether 1 is assigned to highest or lowest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# scores is:
id score
0 1 3.5
1 2 3.65
2 3 4.0
3 4 3.85
4 5 4.0
5 6 3.65

# create a new column, assign the dense_rank:
scores['rank'] = scores['score'].rank(method='dense', ascending=False)
# result is:
id score rank
0 1 3.5 4.0
1 2 3.65 3.0
2 3 4.0 1.0
3 4 3.85 2.0
4 5 4.0 1.0
5 6 3.65 3.0

# create a new column, assign the common rank:
scores['rank'] = scores['score'].rank(method='min', ascending=False)
# result is:
id score rank
0 1 3.5 6.0
1 2 3.65 4.0
2 3 4.0 1.0
3 4 3.85 3.0
4 5 4.0 1.0
5 6 3.65 4.0

3.3 row to column conversion

3.3.1 MySQL

The idea is quite simple as illustrate below:

q3

First, we extract each column to form a new table corresponding to each store. Then, we concatenate (“Union”) three tables together. Note that when we are extracting the table, we should remove “null” rows.

Follow this idea, our implementation would be relatively easy:

1
2
3
4
5
select product_id, 'store1' as store, store1 as price from Products where store1 is not null
union all
select product_id, 'store2' as store, store2 as price from Products where store2 is not null
union all
select product_id, 'store3' as store, store3 as price from Products where store3 is not null

3.3.2 Pandas

In pandas, you can use the similar idea as listed in the above subsection. But Pandas provides more powerful tools, pd.melt, which helps us to convert wide (row) format into long (column) format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: 
Products table:
+------------+--------+--------+--------+
| product_id | store1 | store2 | store3 |
+------------+--------+--------+--------+
| 0 | 95 | 100 | 105 |
| 1 | 70 | null | 80 |
+------------+--------+--------+--------+
Output:
+------------+--------+-------+
| product_id | store | price |
+------------+--------+-------+
| 0 | store1 | 95 |
| 0 | store2 | 100 |
| 0 | store3 | 105 |
| 1 | store1 | 70 |
| 1 | store3 | 80 |
+------------+--------+-------+

The id_vars is the identifier column (will not be changed), value_vars is the columns to be converted. var_name is the new column name from converted columns.

1
2
3
4
5
6
df = products.melt(
id_vars='product_id',
value_vars=['store1', 'store2', 'store3'],
var_name='store',
value_name='price'
)

3.4 Conditional operator

The question requires us to set each value according to a predicate. Such basic requirement is well-supported by the MySQL and Pandas.

3.4.1 MySQL

In MySQL, if(cond, true_value, false_value) is a good helper. This should appear after SELECT.

1
2
3
4
5
SELECT 
employee_id,
IF(employee_id % 2 = 1 AND name NOT REGEXP '^M', salary, 0) AS bonus
FROM
employees

The second column, bonus, entry would be the same value of salary if the condition is satisfied, otherwise it is 0. For the condition, the second is the grammar of regular expression, asserting true only when name does not start with M. You may also use name NOT LIKE 'M%' for the second condition.

3.4.2 Pandas

In pandas, we can use the pd.DataFrame.apply function to achieve this. The apply function accepts a function, f(x), which x is the “current” row. Based on the information on that row, it returns a value. For example, in this question, the condition is “the employee is an odd number and the employee’s name does not start with the character 'M'. “

1
2
3
4
5
6
7
8
9
# judge function
def f(x):
if x['employee_id'] % 2 == 1 and not x['name'].startswith('M'):
return x['salary']
else:
return 0

# Apply it to the dataframe for the new column:
employees['bonus'] = employees.apply(f, axis=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Sample input:
| employee_id | name | salary |
| ----------- | ------- | ------ |
| 2 | Meir | 3000 |
| 3 | Michael | 3800 |
| 7 | Addilyn | 7400 |
| 8 | Juan | 6100 |
| 9 | Kannon | 7700 |

# Sample result of employees['bonus']:
0 0
1 0
2 7400
3 0
4 7700
Name: bonus, dtype: int64

Which satisfies the requirement.