0%

Pandas & MySQL Leetcode Practice Notes 2--String operations

2. String Operations

2.1 MySQL

Filtering string in SQL is generally same as the process for filtering numeric. There are many built-in functions to help you achieve some string operation. For example, UPPER(), LOWER(), CONCAT(), SUBSTRING() .

  • length() function returns length of that columns’ entries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Get len of each entry in "content" column
select length(content) from fruits;
+-----------------+
| length(content) |
+-----------------+
| 5 |
| 6 |
| 6 |
+-----------------+

-- Assign the length column to table, named "content_len":
select *, length(content) as content_len from fruits;
+----+---------+-------------+
| id | content | content_len |
+----+---------+-------------+
| 1 | apple | 5 |
| 2 | banana | 6 |
| 3 | orange | 6 |
+----+---------+-------------+
  • LIKE "pattern" tells whether a column match this pattern, so that you can filter others out. The pattern uses _ to match any ONE character, and % to match any MULTIPLE (>1) character.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Get a boolean column where "1" entry means content starts with "a"
+-------------------+
| content like "a%" |
+-------------------+
| 1 |
| 0 |
| 0 |
+-------------------+

-- Combine with "where" to do filtering
select * from fruits where content like "a%";
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
+----+---------+
  • Sometimes, the pattern above cannot handle some complex queries. We can use regular expression here. The grammar is: <col_name> REGEXP 'pattern'. Note that this by default behaves like “match”, checking from the beginning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Match all entries, begining with "a"
select * from fruits where content REGEXP "a\w*";
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
+----+---------+

-- Match all entries, containing "an"
select * from fruits where content REGEXP "\w*an\w*";
+----+---------+
| id | content |
+----+---------+
| 2 | banana |
| 3 | orange |
+----+---------+

2.2 Pandas

For numeric data types, we usually directly use grammar like df['col'] > 5 to get the indexing series. For string operations, we have some similar ways to get the indexing series.

String Accessor. For each column, we can use .str to get its pandas.core.strings.accessor.StringMethods. This object has various of string operation functions, such as .replace(), .find(), .len(), .startswith()……

  • Some function, such as .len(), returns a new integer series, containing length of string for each entry:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Get len of each entry in "content" column
df['content'].str.len()
0 5
1 6
2 6
Name: content, dtype: int64

# Assign the len series to df, named "content_len":
df['content_len'] = df['content'].str.len()

# df now is:
id content content_len
0 1 apple 5
1 2 banana 6
2 3 orange 6
  • Some function, like .startwith(), returns a new boolean series, so you can do indexing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Get indexing series where "content" entry starts with "a"
cond = df['content'].str.startswith('a')

# cond is:
0 True
1 False
2 False
Name: content, dtype: bool

# Do filtering
df = df.loc[cond, :]

# df is:
id content
0 1 apple
  • Regular Expression

Sometimes, we want to utilize Regex to help us match entries. We have .match() function. Note that match will exactly match from the beginning. If you just want to match any substring, use .contains instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Match all entries, begining with "a"
cond = df['content'].str.match(r'a[a-zA-Z]*')

# Result will be:
df.loc[cond, :]
id content
0 1 apple

# Match all entries, containing "an"
cond = df['content'].str.contains(r'an')

# Result will be:
df.loc[cond, :]
id content
1 2 banana
2 3 orange