Filtering string in SQL is generally same as the process for filtering numeric. There are many built-in functions to help you achieve some string operation. For example, UPPER(), LOWER(), CONCAT(), SUBSTRING() .
length() function returns length of that columns’ entries:
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Get len of each entry in "content" column select length(content) from fruits; +-----------------+ | length(content) | +-----------------+ |5| |6| |6| +-----------------+
-- Assign the length column to table, named "content_len": select*, length(content) as content_len from fruits; +----+---------+-------------+ | id | content | content_len | +----+---------+-------------+ |1| apple |5| |2| banana |6| |3| orange |6| +----+---------+-------------+
LIKE "pattern" tells whether a column match this pattern, so that you can filter others out. The pattern uses _ to match any ONE character, and % to match any MULTIPLE (>1) character.
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Get a boolean column where "1" entry means content starts with "a" +-------------------+ | content like "a%" | +-------------------+ |1| |0| |0| +-------------------+
-- Combine with "where" to do filtering select*from fruits where content like "a%"; +----+---------+ | id | content | +----+---------+ |1| apple | +----+---------+
Sometimes, the pattern above cannot handle some complex queries. We can use regular expression here. The grammar is: <col_name> REGEXP 'pattern'. Note that this by default behaves like “match”, checking from the beginning.
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Match all entries, begining with "a" select*from fruits where content REGEXP "a\w*"; +----+---------+ | id | content | +----+---------+ |1| apple | +----+---------+
-- Match all entries, containing "an" select*from fruits where content REGEXP "\w*an\w*"; +----+---------+ | id | content | +----+---------+ |2| banana | |3| orange | +----+---------+
2.2 Pandas
For numeric data types, we usually directly use grammar like df['col'] > 5 to get the indexing series. For string operations, we have some similar ways to get the indexing series.
String Accessor. For each column, we can use .str to get its pandas.core.strings.accessor.StringMethods. This object has various of string operation functions, such as .replace(), .find(), .len(), .startswith()……
Some function, such as .len(), returns a new integer series, containing length of string for each entry:
# Get indexing series where "content" entry starts with "a" cond = df['content'].str.startswith('a')
# cond is: 0True 1False 2False Name: content, dtype: bool
# Do filtering df = df.loc[cond, :]
# df is: id content 01 apple
Regular Expression
Sometimes, we want to utilize Regex to help us match entries. We have .match() function. Note that match will exactly match from the beginning. If you just want to match any substring, use .contains instead.