I’m @I-am-Future, currently I am an undergraduate of computer science in CUHK-Shenzhen, interested in deep learning, robotics and computer vision. I love computer science, from low level microprocessors and assembly to high level applications, such as deep learning, computer graphics, etc.
Welcome to my blog! This blog is established on Mar. 21st, 2023. The aim is to build such a blog website which can be used to share my programming / other knowledges. There are few categories in my blog, you can access them from the button at the top. These include:
Series of knowledge (such as “Deep Learning”, “Pandas-SQL”)
Some problems I had met (“Problem Solving”)
Some package/tools Quick Reference Handbook (“QRH”), etc.
Also, you can use the search button at the top for the topics you like. All of them are very interesting.
The native screen recording in Windows/MacOS has very high quality, so that the video are usually too large to share with others. For example, a 2-hour screen recording on MacOS consumes 14 GB memory!
However, we noticed that the screen recording from zoom/tencent meeting is very light-weight. A 2-hour screen recording costs only hundreds of megabytes!
So, if we want to share the screen recording from native system to others, can we convert it to the one like the zoom? It has smaller file size and the video is quality is enough for watching.
Observation & Analysis
We get two videos, from Tencent Meeting (tencent.mp4) and Windows Native recording (windows.mp4) respectively:
tencent.mp4
windows.mp4
Duration
6.7s
11.6s
File Size
261KB
11675KB
We found that the size of tencent.mp4 is far smaller (~20x) than the windows.mp4! Let’s analysis some details by the ffprobe provided by ffmpeg:
1 2 3 4
# Description: Get the information of the video file using ffprobe ffprobe -v error -show_format -show_streams tencent.mp4
tencent.mp4 uses the H.264 High Profile, which is suitable for high resolution and high compression efficiency.
windows.mp4 uses the H.264 Main Profile, which typically has slightly lower compression efficiency compared to the High Profile.
Frame Rate:
tencent.mp4 has a relatively low frame rate (approximately 15 fps), which helps to reduce file size.
windows.mp4 has a higher frame rate (60 fps), which significantly increases the file size.
Bit Rate:
tencent.mp4 has a very low bit rate (184243 bps), which greatly reduces file size but may affect video quality.
windows.mp4 has a very high bit rate (7457076 bps), which ensures higher video quality but also results in a much larger file size.
Audio Settings:
tencent.mp4 has mono audio and a lower sampling rate (32000 Hz), which helps to reduce file size.
windows.mp4 has stereo audio and a higher sampling rate (48000 Hz), which increases file size.
Solution
From the analysis above, we come up the plan to reduce the video file size. We use ffmpeg to convert the videos. The following Python script helps making the ffmpeg commands:
To convert the file, we apply the high profile, a low crf, a low bitrate and framerate. You may export the video and check whether the video is over-compressed.
Apple Silicon Acceleration
For Apple Silicon users, we may apply the Apple’s own H264 encoding library for acceleration. The command may look like this:
This blog is used to record my experience in installing Isaac Gym environment on the Linux Server. I’ll also write down all problems I met with troubleshooting.
Prerequisite
The followings are requirements provided by the project pages:
Ubuntu 18.04, or 20.04. Python 3.6, 3.7, or 3.8 Minimum recommended NVIDIA driver version: 470.74 (470 or above required) Minimum required hardware: NVIDIA Pascal or later GPU with at least 8gb of VRAM
Possible Problems
Don’t use WSL (Windows Subsystem on Linux)!!! The Nvidia driver on WSL may not support Vulkan well thus causing failure!
Download & Install
Register and download from https://developer.nvidia.com/isaac-gym/. Press “join now” first, and then fill the form, and finally you are ready to download it.
Put the download file, probably named IsaacGym_Preview_4_Package.tar.gz on a proper folder. Then unzip it:
1
tar -xvf IsaacGym_Preview_4_Package.tar.gz
Go to the unzipped folder, and execute the create_conda_env_rlgpu.sh. This would automatically create a new environment, installed with all packages.
1
./create_conda_env_rlgpu.sh
Possible Problems
The shell may stuck at “Solving…” stage. In that case, please follow step 4 below.
[Alternative ways for step 3] If step 3 failed, consider the following method:
1 2 3 4 5 6 7 8
# Create an env manually. conda create -n rlgpu python=3.8
# Go the the subdirectory, cd python/
# Install manually. pip install -e .
The above code would also install the isaacgym, with necessary dependencies (pytorch, numpy, etc.)
As in my machine, installing with latest PyTorch (2.2.0) and CUDA (12.1) is fine. It doesn’t require a former version.
Test demo
To run the test demo, cd to python/examples, and run:
1
python joint_monkey.py
Possible Problem 1
When importing isaacgym, returns error ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
Solution 1: This is because the module failed to locate the static link library. We can add the corresponding path to the LD_LIBRARY_PATH so that the python could locate the library.
1 2
# in ~/.bashrc export LD_LIBRARY_PATH=/<miniconda_root>/envs/<env_name>/lib:$LD_LIBRARY_PATH
Possible Problem 2
After executing python joint_monkey.py, it returns error [Error] [carb.windowing-glfw.plugin] GLFW initialization failed.
Solution 2: The test programs requires GUI. In the headless server, you may use X11 forwarding (e.g., MobaXterm) or xvfb virtual screen.
Possible Problem 3
After executing python joint_monkey.py, it returns error Segment Fault
Solution 3: There are no direct solutions on the Internet. After my experience, I can give you some possible reasons.
Newer PyTorch, CUDA version: It should not be a big problem. As for me, I am using PyTorch (2.2.0) and CUDA (12.1), the program stills works!
Don’t use WSL: By 2/1/2024, the Vulkan has not fully supported on WSL. Therefore, the program may occur such problems. Use Linux machine instead.
Vulkan Installation: Please make sure the vulkan works fine on the server. You may visit here for further information.
In this blog, we’ll learn pytorch-lightning framework with some simple example codes. I believe the code will significantly help you understand it. The official doc is at here.
1 2
# Before everything starts: import pytorch_lightning as pl
To train a deep learning model, we need to code the following components: data, model, training&testing procedures, and hyper-parameter configs. In pytorch-lightning, we have different classes handling them.
Data: we can define a subclass of pl.LightningDataModule to implement procedures that initialize the Dataset and DataLoader.
Model: implement the model just like what you did without pytorch-lightning – a subclass of nn.Module.
Training & Testing procedures: we can define a subclass of pl.LightningModule to implement the procedures.
Configs (pytorch-lightning have utilities for CLI arguments. But here we use another package called configargparse, which support both config file and CLI arguments)
1. Data
We can define a pl.LightningDataModule which implements procedures to initialize the Dataset and DataLoader.
Main components of pl.LightningDataModule:
__init__(): The constructor. You can save some configurations (hyper-parameters) to the class here.
setup(): Will be called before fit(), validate(), test(), and predict(). You can initialize the datasets as class attributes here. Note that it has an argument stage, to indicate what stage is it (Because we may create different datasets for different stages).
prepare_data(): Will be called only on Master process. In distribute training, setup() would be called on all processes. If you only want to only do one thing (e.g., downloading data), please implement it in the prepare_data().
train_dataloader(), val_dataloader(), test_dataloader(), predict_dataloader(). Will be called during the corresponding phase to get the DataLoader. It should return a DataLoader for that stage.
Here’s an Example. You can check detailed descriptions of each function at the function docstring:
# DataModule classMyDataModule(pl.LightningDataModule): def__init__(self, data_dir='./', batch_size=32): """ Initializes the data module with the some given config arguments. This constructor sets up the initial configurations like data directory, batch size, and transformations to be applied to the data. Called when an instance of MyDataModule is created. """ super().__init__() self.data_dir = data_dir self.batch_size = batch_size self.transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) ]) defsetup(self, stage=None): """ Prepares the data for use. It is automatically called at the beginning of fit, validate, test, and predict, or when manually calling `trainer.setup`. This method is responsible for setting up internal datasets (e.g., splitting the dataset into training, validation, and testing sets, etc.) `stage` can be used to differentiate between fit, validate, and test stages. """ # Assign train/val datasets for use in dataloaders if stage == "fit": mnist_full = datasets.MNIST(self.data_dir, train=True, transform=self.transform) self.mnist_train, self.mnist_val = random_split( mnist_full, [55000, 5000], generator=torch.Generator().manual_seed(42) ) # Assign test dataset for use in dataloader(s) if stage == "test": self.mnist_test = datasets.MNIST(self.data_dir, train=False, transform=self.transform) if stage == "predict": self.mnist_predict = datasets.MNIST(self.data_dir, train=False, transform=self.transform) deftrain_dataloader(self): """ Creates and returns the DataLoader for the training dataset. This method is used to load the training data and is automatically called during the training phase. Returns a DataLoader instance for the training data. """ return DataLoader(self.mnist_train, batch_size=self.batch_size) defval_dataloader(self): """ Creates and returns the DataLoader for the validation dataset. This method is used to load the validation data and is automatically called during the validation phase. Returns a DataLoader instance for the validation data. """ return DataLoader(self.mnist_val, batch_size=self.batch_size, shuffle=False) deftest_dataloader(self): """ Creates and returns the DataLoader for the testing dataset. This method is used to load the test data and is automatically called during the testing phase. Returns a DataLoader instance for the test data. """ return DataLoader(self.mnist_test, batch_size=self.batch_size, shuffle=False)
2. Model
You may directly implement the model in pl.LightningModule, but for clarity, I recommend you to implement your model as a separate nn.Module subclass, and then create the instances in the pl.LightningModule.
1 2 3 4 5 6 7 8 9 10 11 12 13
# You can do like this... classMyLightningModule(pl.LightningModule): def__init__(self, hidden_dim=64): super().__init__() self.hidden_dim = hidden_dim self.layer1 = torch.nn.Linear(28 * 28, self.hidden_dim) self.layer2 = torch.nn.Linear(self.hidden_dim, 10)
defforward(self, x): x = x.view(x.size(0), -1) x = torch.relu(self.layer1(x)) x = self.layer2(x) return x
1 2 3 4 5 6 7 8 9 10 11
# But I recommend you to do like this: classModel(nn.Module): ......
We can define a pl.LightningModule which implements procedures for training, validating, testing and predicting.
Main components of pl.LightningModule:
__init__(): The constructor. You can save some configurations (hyper-parameters) to the class here.
forward(): Make it behave like a nn.Module. You can implement the forward procedures here (See Sec. 2 for the example).
configure_optimizers(): Defines the training optimizers. See here for some advanced usage.
train_step(), validation_step(), test_step(), predict_step(): Will be called during the corresponding phase, to perform operations for a batch. It has an argument batch_idx, which indicates the batch index of current epoch. The train_step() should return the loss (a scalar or a dict), and the module will automatically perform backward propagation.
on_train_epoch_start(), on_train_epoch_end(), on_XXX_epoch_start(), on_XXX_epoch_end(): Will be called at the corresponding start/end point. You can perform some epoch-level operations here. (See below example, where we collect all training batch losses to calculate the average loss on an epoch)
Use self.log(tag, value) to log the batch’s “tag” indicator with value “value” to the tensorboard / log file.
Here’s an Example. You can check detailed descriptions of each function at the function docstring:
classMyLightningModule(pl.LightningModule): def__init__(self, hidden_dim=64): """ Initializes the LightningModule with two linear layers. This constructor sets up the neural network architecture for the module. Called when an instance of MyLightningModule is created. """ super().__init__() self.hidden_dim = hidden_dim self.layer1 = torch.nn.Linear(28 * 28, self.hidden_dim) self.layer2 = torch.nn.Linear(self.hidden_dim, 10) self.training_results = [] # To store training loss in each epoch
defforward(self, x): """ Defines the forward pass of the model. It is automatically called when the model is used to make predictions. 'x' is the input data. The method reshapes the input, applies a ReLU activation after the first linear layer, and then passes it through the second linear layer. """ x = x.view(x.size(0), -1) x = torch.relu(self.layer1(x)) x = self.layer2(x) return x
deftraining_step(self, batch, batch_idx): """ Defines a single step in the training loop. It is automatically called for each batch of data during training. 'batch' contains the input data and labels, and 'batch_idx' is the index of the current batch. The method computes the model's loss using cross-entropy. """ x, y = batch logits = self(x) loss = F.cross_entropy(logits, y) self.log('train_loss', loss) self.training_results.append(loss) return loss
defvalidation_step(self, batch, batch_idx): ...
deftest_step(self, batch, batch_idx): ...
defconfigure_optimizers(self): """ Configures the optimizers used for training. This method is automatically called to configure the optimization algorithm. Returns an optimizer, in this case, Adam, with a set learning rate. See advanced usage in official docs. """ return torch.optim.Adam(self.parameters(), lr=0.001)
# Epoch-wise procedures defon_train_epoch_start(self): """ Called at the start of every training epoch. Can be used to perform actions specific to the beginning of each training epoch. """ print("Training Epoch Start")
defon_train_epoch_end(self): """ Called at the end of every training epoch. Can be used to perform actions specific to the end of each training epoch. """ print("Training Epoch End") print("Training Loss: ", sum(self.training_results) / len(self.training_results)) self.training_results.clear()
In some cases, when we import VM into the “Oracle VM VirtualBox Manager”, the network is not well configured, and we cannot connect it by SSH/other things on the host machine. The default network setting may looks like this:
The Adapter 1 (an NAT adapter):
The Adapter 2-4 (Not Enabled):
After typing ip a in the terminal of the VM (assume it is a Linux machine), the result looks like below:
1 2 3 4 5 6 7 8 9 10 11 12
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 08:00:27:b3:8d:9b brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute enp0s3 valid_lft 86376sec preferred_lft 86376sec inet6 fe80::6dea:8795:c7b1:a1a5/64 scope link noprefixroute valid_lft forever preferred_lft forever
There are no available network adapter to the host.
Method 1: By Host-only Adapter
Important!
Pros: It has a stable IP address, with all ports available. (e.g., setup different services on different ports, 22, 3389, ...)
Cons: The VM cannot be accessed from machine outside of the host (e.g., we cannot ssh host_name@host_ip to connect to the VM from other machines than host), unless you perform some port forwarding strategies on the host.
Steps:
To create a network connection from VM to host, we can set the Adapter 2 like below (Use a “Host-only Adapter”):
After typing ip a in the terminal of the VM (assume it is a Linux machine) NOW, the result looks like below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 08:00:27:b3:8d:9b brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute enp0s3 valid_lft 86337sec preferred_lft 86337sec inet6 fe80::6dea:8795:c7b1:a1a5/64 scope link noprefixroute valid_lft forever preferred_lft forever 3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 08:00:27:85:54:b0 brd ff:ff:ff:ff:ff:ff inet 192.168.56.101/24 brd 192.168.56.255 scope global dynamic noprefixroute enp0s8 valid_lft 537sec preferred_lft 537sec inet6 fe80::a598:692:2336:67d9/64 scope link noprefixroute valid_lft forever preferred_lft forever
The third network adapter, with an ip address 192.168.56.101, is the ip address of the “VirtualBox Host-Only Ethernet Adapter”. On the host machine, we can connect it by SSH fluently:
1
ssh ubuntu@192.168.56.101
By so far, you can connect the machine via SSH well. You can also connect to other services by ip address 192.168.56.101 on the host machine.
Method 2: Forward Port to Host
Important!
Pros: You can access the VM outside of the host machine, from a different machine.
Cons: Only forwarded ports available. Ports may conflict with others.
Steps:
Go to the Adapter 1, click the “Advanced” button to expand the menu. Click the “Port Forwarding”.
In the “Port Forwarding Rules”, Set up new rules. The “Host Port” is the port number on the host machine. The “Guest Port” is the corresponding port number on the VM. In this example, I forwarded port 4016 on host to the port 22 on the VM.
After that, you can connect to the VM on the host machine. Type:
1
ssh -p 4016 ubuntu@127.0.0.1
You can also connect to the VM outside of the host machine. Assume in the local network, my ip address is 10.31.38.94. So from the computer in the same local network, type:
1
ssh -p 4016 ubuntu@10.31.38.94
And everything should go fine. If something goes run, please check whether we have firewall permissions. If not, you can set a firewall rule for port 4016 (The port number you forwarded).
Conclusion
You can setup both Method 1 & Method 2!! Then you can experience all of these benefits (connect from other machines, all ports available on local machines, etc.) and avoid the disadvantages.
We usually use WSL and docker container for development. In most cases, when we install the WSL and container, we may need setup the SSH (secure shell) configuration by ourselves. This article shares my experience in setting up SSH in these environments.
1. Install & Config
Step 1: Install (Optional): If your the system has no ssh installed, you may use the package manager (e.g., yum, apk, apt based on which operating system you are using). E.g.,
(Optional): If you haven’t set the password yet, set one first.
1
passwd <user_name>
set up configs in sshd_config
Use any text editor, like vim, to modify the /etc/ssh/sshd_config if needed. Some important configs are:
1 2 3 4 5 6 7 8
# indicates which port listening to Port 22 # Use public key authentication PubkeyAuthentication yes # Use password authentication PasswordAuthentication yes # Use X11 forwarding to forward a display in the remote shell X11Forwarding yes
Use ssh -p <port> <username>@<ip_address> to start a SSH connection.
2.1 WSL
<port>: Unless specified in the /etc/ssh/sshd_config, the default port number is 22.
username: The one in the WSL.
The WSL is an independently installed operating system running on your machine, so it has a different ip address. To get the ip address, you can use the following command in the host’s powershell to get the <ip_address>:
<port>: When configuring the container, we usually forward the container’s 22 port to one in the host machine. Use that port number at the host machine.
<username>: Unless create other user, the container has only one default user, root.
ip_address: Since we have port forwarding, the ip_address is the host machine, 127.0.0.1 (or equivalently, the ip address from command ipconfig (Windows), or ifconfig/iwconfig (Linux))
3 Remote Visiting
In some cases, we may want to visit the WSL/docker on this host machine from other machines in the same local network.
3.1 WSL
Host Port Forwarding. Since WSL behaves like a separate machine, it has its own ip address. We should perform port forwarding so that the remote connection to this host machine can be forwarded to the WSL. Run the following command to set port forwarding through Powershell in admin privilage.
Firewall rules. We should set firewall permission rules, so that the connection would not be refused by the host machine. You may search additional online materials for how to allow TCP connection to the <listenport> you defined above, e.g., here.
Connect. Then, we can access by ssh -p 2222 <wsl_username>@<host_ipaddr>. The <host_ipaddr can be obtained by command ipconfig (Windows), or ifconfig/iwconfig (Linux).
Additional note: if something goes run, use netsh interface portproxy reset to reset the port forwarding.
3.2 Container
Connect by ssh -p <port> <wsl_username>@<host_ipaddr>, where:
<port> is the forwarded port number, and
<host_ipaddr> is the host’s ip address in the local network, obtained by command ipconfig (Windows), or ifconfig/iwconfig (Linux).
Image: just like OS image, can be used to create a container.
Create image by:
.tar file, with docker load -i <tar_file> command.
Dockerfile script, with docker build -t command.
List images: Use docker images to show all available images on the device.
Container
Container: An instance running the OS. built from image.
Create from image by docker run command. You can pass in some configuration. (Note some config are fixed after creation. Be careful!) Some common parameters are shown below:
1
docker run -it -d -p <localport>:<dockerport> --gpus all --name <name> --volume </path/host/directory>:</path/container/directory> <IMAGE ID>
-d means “detach”, running the docker container in the backend.
-p <hostport>:<dockerport> means forward the docker’s port <dockerport> to the host machine’s port <hostport>. Usually, we forward 22 port from container to e.g., 2222 port in host for ssh connection.
--gpus all means all host’s GPU are visible to the container.
--name <name> indicates the name of the container. It will be shown in the docker ps and other places.
--volume </path/host/directory>:</path/container/directory> indicates which host directory are shared with container’s directory, therefore, you can access it in both the container and host.
<IMAGE ID> is the id of the image to be built.
List containers: Use docker ps -a to check status of all containers. Use docker ps to check all running containers.
Three ways to interact with the container:
Docker desktop’s shell
docker exec -it <container id> bash
SSH, if you have set so
Start the container: By docker start <CONTAINER ID>, or do it in docker desktop.
Stop the container: By docker stop <CONTAINER ID>, or do it in docker desktop.
More info: please see the official cheat sheet at here.
In our article, we won’t use the same data in these question. We’ll use a more simpler one to illustrate the functions.
3.1 TopK & n-th largest
TopK means selecting top k rows according to some rules from the table. n-th largest means selecting the just n-th largest row according to some rules from the table
3.1.1 MySQL
To select some first couple of rows from the table, you can use the LIMIT keyword. LIMIT M, N will select from M row (0-index), a total of N rows. See the following example and explanation.
-- table is +-------+-------+ | name | score | +-------+-------+ | Alice |80| | Bob |90| | Cindy |100| | David |70| | Ella |60| | Frank |80| +-------+-------+
-- select from 1st row, a total of 2 rows: select*from Scores LIMIT 1, 2; +-------+-------+ | name | score | +-------+-------+ | Bob |90| | Cindy |100| +-------+-------+
-- select total of 0 rows means nothing select*from Scores LIMIT 1, 0; +------+-------+ | name | score | +------+-------+
-- if starting index > len(table), nothing would be returned select*from Scores LIMIT 7, 1; +------+-------+ | name | score | +------+-------+
-- select from minux number means select from first row select*from Scores LIMIT -17, 2; +-------+-------+ | name | score | +-------+-------+ | Alice |80| | Bob |90| +-------+-------+
Therefore, it would be quite easy to implement the TopK and n-th largest. For example, if we want to find top 2 grade students and the 2nd highest student in this table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-- find top 2 grade students select*from Scores orderby score desc LIMIT 0, 2; +-------+-------+ | name | score | +-------+-------+ | Cindy |100| | Bob |90| +-------+-------+
-- find the 2nd highest student (note that it is 0-index) select*from Scores orderby score desc LIMIT 1, 1; +------+-------+ | name | score | +------+-------+ | Bob |90| +------+-------+
3.1.2 Pandas
In pandas, we have the method pd.DataFrame.nlargest(N, col_name) to find the top N rows. Note that, after executing this method, the result table is already sorted descending. Access the last one row will just be the n-th largest.
# the df is name score 0 Alice 80 1 Bob 90 2 Cindy 100 3 David 70 4 Ella 60 5 Frank 80
# find top 2 grade students df.nlargest(N, 'score') # result is: name score 2 Cindy 100 1 Bob 90
# find the 2nd highest student (note that it is 0-index) Scores.nlargest(N, 'score').iloc[-1] # result is: name Bob score 90
3.2 rank, dense_rank
This question requires us to build a dense rank. That is, when there are rows that “score” are same, they should both appear and they should have the same score. Fortunately, we have built-in functions to help us achieve that.
3.2.1 MySQL
There is a function called DENSE_RANK(), which can be used to create a new column based on the ranking information.
-- The table is: +----+-------+ | id | score | +----+-------+ |1|3.5| |2|3.65| |3|4| |4|3.85| |5|4| |6|3.65| +----+-------+
-- Do the ranking (the problems requires the "dense rank") SELECT S.score, DENSE_RANK() OVER ( ORDERBY S.score DESC ) AS'rank' FROM Scores S; -- result is: +-------+------+ | score | rank | +-------+------+ |4|1| |4|1| |3.85|2| |3.65|3| |3.65|3| |3.5|4| +-------+------+
Usually, our ranking is not like this: if we have two 4.0 students, the rank of 3.85 student should be 3, not 2. The RANK() functions helps in this case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
SELECT S.score, RANK() OVER ( ORDERBY S.score DESC ) AS'rank' FROM Scores S; -- result is: +-------+------+ | score | rank | +-------+------+ |4|1| |4|1| |3.85|3| |3.65|4| |3.65|4| |3.5|6| +-------+------+
3.2.2 Pandas
In pandas, such operation is quite easy: we have the pd.Series.rank() function. This function has multiple parameters. The most important two are method and ascending. When method is 'dense', it performs as dense_rank. When method is min, it is the common ranking (meaning that when we have multiple same score, we should take minimum rank for all of them). ascending controls whether 1 is assigned to highest or lowest.
# scores is: id score 013.5 123.65 234.0 343.85 454.0 563.65
# create a new column, assign the dense_rank: scores['rank'] = scores['score'].rank(method='dense', ascending=False) # result is: id score rank 013.54.0 123.653.0 234.01.0 343.852.0 454.01.0 563.653.0
# create a new column, assign the common rank: scores['rank'] = scores['score'].rank(method='min', ascending=False) # result is: id score rank 013.56.0 123.654.0 234.01.0 343.853.0 454.01.0 563.654.0
3.3 row to column conversion
3.3.1 MySQL
The idea is quite simple as illustrate below:
First, we extract each column to form a new table corresponding to each store. Then, we concatenate (“Union”) three tables together. Note that when we are extracting the table, we should remove “null” rows.
Follow this idea, our implementation would be relatively easy:
1 2 3 4 5
select product_id, 'store1'as store, store1 as price from Products where store1 isnotnull unionall select product_id, 'store2'as store, store2 as price from Products where store2 isnotnull unionall select product_id, 'store3'as store, store3 as price from Products where store3 isnotnull
3.3.2 Pandas
In pandas, you can use the similar idea as listed in the above subsection. But Pandas provides more powerful tools, pd.melt, which helps us to convert wide (row) format into long (column) format:
The id_vars is the identifier column (will not be changed), value_vars is the columns to be converted. var_name is the new column name from converted columns.
The question requires us to set each value according to a predicate. Such basic requirement is well-supported by the MySQL and Pandas.
3.4.1 MySQL
In MySQL, if(cond, true_value, false_value) is a good helper. This should appear after SELECT.
1 2 3 4 5
SELECT employee_id, IF(employee_id %2=1AND name NOT REGEXP '^M', salary, 0) AS bonus FROM employees
The second column, bonus, entry would be the same value of salary if the condition is satisfied, otherwise it is 0. For the condition, the second is the grammar of regular expression, asserting true only when name does not start with M. You may also use name NOT LIKE 'M%' for the second condition.
3.4.2 Pandas
In pandas, we can use the pd.DataFrame.apply function to achieve this. The apply function accepts a function, f(x), which x is the “current” row. Based on the information on that row, it returns a value. For example, in this question, the condition is “the employee is an odd number and the employee’s name does not start with the character'M'. “
1 2 3 4 5 6 7 8 9
# judge function deff(x): if x['employee_id'] % 2 == 1andnot x['name'].startswith('M'): return x['salary'] else: return0 # Apply it to the dataframe for the new column: employees['bonus'] = employees.apply(f, axis=1)
Filtering string in SQL is generally same as the process for filtering numeric. There are many built-in functions to help you achieve some string operation. For example, UPPER(), LOWER(), CONCAT(), SUBSTRING() .
length() function returns length of that columns’ entries:
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Get len of each entry in "content" column select length(content) from fruits; +-----------------+ | length(content) | +-----------------+ |5| |6| |6| +-----------------+
-- Assign the length column to table, named "content_len": select*, length(content) as content_len from fruits; +----+---------+-------------+ | id | content | content_len | +----+---------+-------------+ |1| apple |5| |2| banana |6| |3| orange |6| +----+---------+-------------+
LIKE "pattern" tells whether a column match this pattern, so that you can filter others out. The pattern uses _ to match any ONE character, and % to match any MULTIPLE (>1) character.
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Get a boolean column where "1" entry means content starts with "a" +-------------------+ | content like "a%" | +-------------------+ |1| |0| |0| +-------------------+
-- Combine with "where" to do filtering select*from fruits where content like "a%"; +----+---------+ | id | content | +----+---------+ |1| apple | +----+---------+
Sometimes, the pattern above cannot handle some complex queries. We can use regular expression here. The grammar is: <col_name> REGEXP 'pattern'. Note that this by default behaves like “match”, checking from the beginning.
-- table is +----+---------+ | id | content | +----+---------+ |1| apple | |2| banana | |3| orange | +----+---------+
-- Match all entries, begining with "a" select*from fruits where content REGEXP "a\w*"; +----+---------+ | id | content | +----+---------+ |1| apple | +----+---------+
-- Match all entries, containing "an" select*from fruits where content REGEXP "\w*an\w*"; +----+---------+ | id | content | +----+---------+ |2| banana | |3| orange | +----+---------+
2.2 Pandas
For numeric data types, we usually directly use grammar like df['col'] > 5 to get the indexing series. For string operations, we have some similar ways to get the indexing series.
String Accessor. For each column, we can use .str to get its pandas.core.strings.accessor.StringMethods. This object has various of string operation functions, such as .replace(), .find(), .len(), .startswith()……
Some function, such as .len(), returns a new integer series, containing length of string for each entry:
# Get indexing series where "content" entry starts with "a" cond = df['content'].str.startswith('a')
# cond is: 0True 1False 2False Name: content, dtype: bool
# Do filtering df = df.loc[cond, :]
# df is: id content 01 apple
Regular Expression
Sometimes, we want to utilize Regex to help us match entries. We have .match() function. Note that match will exactly match from the beginning. If you just want to match any substring, use .contains instead.
There are 30 questions to practice Python Pandas and MySQL on https://leetcode.cn/studyplan/30-days-of-pandas/ . These questions basically practice the fundamental skills. Pandas and MySQL are similar in some aspects, and one can write the same functional code in two languages. I’ll practice my skill with these questions, and write notes about the equivalent operations in two languages.
Accordingly, there are six sections in this section (so will use 6 articles in total):
Typical condition filter work is asking us to do like this:
Filter the big country in World scheme. A country is big if:
it has an area of at least three million (i.e., 3000000 km2), or
it has a population of at least twenty-five million (i.e., 25000000).
We’ll use this example to illustrate the point.
1.1.a MySQL
A typical condition filter in SQL looks like this:
1 2 3 4 5
select<col_names> from<table_name> where ( <condition predicate> )
SQL allows the use of the logical connectives and, or, and not.
The operands of the logical connectives can be expressions involving the comparison operators <, <=, >, >=, =, and <> (Note that, equal symbol is one =, but not ==!)
For example, in the codes above, the answer is
1 2 3
SELECT name, population, area FROM World WHERE area >=3000000OR population >=25000000
1.1.b Pandas
Filter operations in pandas is a little bit difference. We should know two basics things first:
Condition result in Index Series
If we want to get all data that are greater than 5 in df‘s col1 column, the code and result is as below:
As you can see, the result of df['col1'] > 5 is a bool Series with len(df) size. The entry is True only when the corresponding entry at col1 satisfies the condition.
Indexing by the Index Series
We can pass this bool index series in to the df.loc[]. Then, only the rows with the condition satisfied are displayed:
1 2 3 4 5 6 7 8 9 10 11
a = df.loc[df['col1'] > 5] print(a) # OUTPUT ''' name col1 col2 5 f 6 5 6 g 7 4 7 h 8 3 8 i 9 2 9 j 10 1 '''
By such method, we can do the filtering in the pandas. Note that:
For the and, or and not logic operators, the corresponding operator in pandas is &, | and ~. For example:
1 2 3 4 5 6
cond1 = df['col1'] > 5 cond2 = df['col2'] > 2
b = df.loc[cond1 & cond2]
c = df.loc[ (df['col1'] > 5) & (df['col2'] > 2) ]
Note that the () symbols are necessary because of the operator precedence. For example, in the codes above, the answer is:
In MySQL, rename a column is very easy. By default, if the command is
1 2 3
SELECT name, population, area FROM World WHERE area >=3000000OR population >=25000000
Then the output columns are named name, population, and area. To rename, just use the as symbol to give a new name as output:
1 2 3 4
-- change name => Name, population => Population, area => Area SELECT name as Name, population as Population, area as Area FROM World WHERE area >=3000000OR population >=25000000
1.2.1.b Pandas
In pandas, we can use the pd.DataFrame.rename function to rename the columns. The input parameter columns is a dict[str, str], where key is the old name, and the value is the new name. For example in the example below, we make all names capitalized:
In SQL, this work is relatively easy. You only need to swap the column names in the select command:
1 2 3 4 5
-- show with A, B, C columns select A, B, C fromtable;
-- show with C, B, A columns select C, B, A fromtable;
1.2.2.b Pandas
The method is relatively similar as the code in MySQL, by:
1 2 3 4 5
# show with A, B, C columns df.loc[:, ['A', 'B', 'C']]
# show with C, B, A columns df.loc[:, ['C', 'B', 'A']]
1.2.3 Remove duplicate rows
1.2.3.a MySQL
It is quite easy to do that: you only need to add DISTINCT keyword. For example:
1 2
selectDISTINCT student_id from courses
1.2.3.b Pandas
The DataFrame has a method called drop_duplicates(). For example:
1
selected_views = selected_views.drop_duplicates()
More advanced usage is use parameter subset to tell which columns for identifying duplicates, by default use all of the columns. For example, the following code drop the duplicate rows whenever they have same value in column teacher_id, subject_id.
In the select statement, add with order by <col> asc/desc statement. <col> indicates sort according to which column, asc means ascending sorting, desc means descending sorting.
1 2 3
select v.author_id from Views as v orderby id asc
To sort according multiple columns (if first column is the same, sort by second column), do it by:
1 2 3
select v.author_id from Views as v orderby id asc, col2 desc
1.2.4.b Pandas
Use the pd.DataFrame.sort_values() to sort according to which column(s). If we want to sort from biggest to smallest, use ascending parameter.
1 2 3 4 5 6 7 8
# sort by one column, id selected_views = selected_views.sort_values(by=['id']) selected_views = selected_views.sort_values(by=['id'], ascending=True)
# sort by two columns # if first column is the same, sort by second column selected_views = selected_views.sort_values(by=['id', 'view_date']) selected_views = selected_views.sort_values(by=['id', 'view_date'], ascending=[True, False])