0%

Hello! 你好!

I’m @I-am-Future, currently I am an undergraduate of computer science in CUHK-Shenzhen, interested in deep learning, robotics and computer vision. I love computer science, from low level microprocessors and assembly to high level applications, such as deep learning, computer graphics, etc.

Welcome to my blog! This blog is established on Mar. 21st, 2023. The aim is to build such a blog website which can be used to share my programming / other knowledges. There are few categories in my blog, you can access them from the button at the top. These include:

  • Series of knowledge (such as “Deep Learning”, “Pandas-SQL”)
  • Some problems I had met (“Problem Solving”)
  • Some package/tools Quick Reference Handbook (“QRH”), etc.

Also, you can use the search button at the top for the topics you like. All of them are very interesting.

Enjoy them!

Compress Native System Screen Recording

Problem Statement

The native screen recording in Windows/MacOS has very high quality, so that the video are usually too large to share with others. For example, a 2-hour screen recording on MacOS consumes 14 GB memory!

However, we noticed that the screen recording from zoom/tencent meeting is very light-weight. A 2-hour screen recording costs only hundreds of megabytes!

So, if we want to share the screen recording from native system to others, can we convert it to the one like the zoom? It has smaller file size and the video is quality is enough for watching.

Observation & Analysis

We get two videos, from Tencent Meeting (tencent.mp4) and Windows Native recording (windows.mp4) respectively:

tencent.mp4 windows.mp4
Duration 6.7s 11.6s
File Size 261KB 11675KB

We found that the size of tencent.mp4 is far smaller (~20x) than the windows.mp4! Let’s analysis some details by the ffprobe provided by ffmpeg:

1
2
3
4
# Description: Get the information of the video file using ffprobe
ffprobe -v error -show_format -show_streams tencent.mp4

ffprobe -v error -show_format -show_streams windows.mp4

And here are some key observations comparison:

  1. Codec Profile:
    • tencent.mp4 uses the H.264 High Profile, which is suitable for high resolution and high compression efficiency.
    • windows.mp4 uses the H.264 Main Profile, which typically has slightly lower compression efficiency compared to the High Profile.
  2. Frame Rate:
    • tencent.mp4 has a relatively low frame rate (approximately 15 fps), which helps to reduce file size.
    • windows.mp4 has a higher frame rate (60 fps), which significantly increases the file size.
  3. Bit Rate:
    • tencent.mp4 has a very low bit rate (184243 bps), which greatly reduces file size but may affect video quality.
    • windows.mp4 has a very high bit rate (7457076 bps), which ensures higher video quality but also results in a much larger file size.
  4. Audio Settings:
    • tencent.mp4 has mono audio and a lower sampling rate (32000 Hz), which helps to reduce file size.
    • windows.mp4 has stereo audio and a higher sampling rate (48000 Hz), which increases file size.

Solution

From the analysis above, we come up the plan to reduce the video file size. We use ffmpeg to convert the videos. The following Python script helps making the ffmpeg commands:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Description: This script compresses a video file using ffmpeg with some preset parameters. 
import os, pyperclip

input_file = 'input.mp4'
output_file = 'compressed_output.mp4'

# H264 (libx264) encoder
profile = 'high' # H.264 encoding profile (e.g., 'high', 'main', 'baseline')
crf = 23 # Constant Rate Factor, the lower the value, the higher the quality (range: 0-51)
bitrate = '184k' # Video bitrate (e.g., '500k', '1000k')
framerate = 15 # Frame rate (fps)
vcodec = 'libx264' # Video codec (e.g., 'libx264', 'libx265')

cmd = f'ffmpeg -i {input_file} -vcodec {vcodec} -profile:v {profile} -crf {crf} ' \
f'-preset fast -b:v {bitrate} -r {framerate} -pix_fmt yuv420p -acodec aac ' \
f'-b:a 128k -ar 32000 -ac 1 {output_file}'

pyperclip.copy(cmd)
print(cmd)
os.system(cmd)

The output command example:

1
ffmpeg -i input.mp4 -vcodec libx264 -profile:v high -crf 23 -preset fast -b:v 184k -r 15 -pix_fmt yuv420p -acodec aac -b:a 128k -ar 32000 -ac 1 compressed_output.mp4

To convert the file, we apply the high profile, a low crf, a low bitrate and framerate. You may export the video and check whether the video is over-compressed.

Apple Silicon Acceleration

For Apple Silicon users, we may apply the Apple’s own H264 encoding library for acceleration. The command may look like this:

1
2
3
4
5
6
# Apple's H264 encoder
framerate = 15 # Frame rate (fps)
quality = 10 # Quality
cmd = f'ffmpeg -i {input_file} -vf "scale=1920:-2,fps={framerate}" -c:v h264_videotoolbox ' \
f'-q:v {quality} -acodec copy ' \
f'{output_file}'

The output command example:

1
ffmpeg -i input.mp4 -vf "scale=1920:-2,fps=15" -c:v h264_videotoolbox -q:v 10 -acodec copy compressed_output.mp4

This blog is used to record my experience in installing Isaac Gym environment on the Linux Server. I’ll also write down all problems I met with troubleshooting.

Prerequisite

The followings are requirements provided by the project pages:

Ubuntu 18.04, or 20.04.
Python 3.6, 3.7, or 3.8
Minimum recommended NVIDIA driver version: 470.74 (470 or above required)
Minimum required hardware: NVIDIA Pascal or later GPU with at least 8gb of VRAM

Possible Problems
Don’t use WSL (Windows Subsystem on Linux)!!! The Nvidia driver on WSL may not support Vulkan well thus causing failure!

Download & Install

  1. Register and download from https://developer.nvidia.com/isaac-gym/. Press “join now” first, and then fill the form, and finally you are ready to download it.

download

  1. Put the download file, probably named IsaacGym_Preview_4_Package.tar.gz on a proper folder. Then unzip it:
1
tar -xvf IsaacGym_Preview_4_Package.tar.gz
  1. Go to the unzipped folder, and execute the create_conda_env_rlgpu.sh. This would automatically create a new environment, installed with all packages.
1
./create_conda_env_rlgpu.sh
Possible Problems
The shell may stuck at “Solving…” stage. In that case, please follow step 4 below.
  1. [Alternative ways for step 3] If step 3 failed, consider the following method:
1
2
3
4
5
6
7
8
# Create an env manually.
conda create -n rlgpu python=3.8

# Go the the subdirectory,
cd python/

# Install manually.
pip install -e .

The above code would also install the isaacgym, with necessary dependencies (pytorch, numpy, etc.)

As in my machine, installing with latest PyTorch (2.2.0) and CUDA (12.1) is fine. It doesn’t require a former version.

Test demo

To run the test demo, cd to python/examples, and run:

1
python joint_monkey.py

Possible Problem 1
When importing isaacgym, returns error ImportError: libpython3.8.so.1.0: cannot open shared object file: No such file or directory

Solution 1: This is because the module failed to locate the static link library. We can add the corresponding path to the LD_LIBRARY_PATH so that the python could locate the library.

1
2
# in ~/.bashrc
export LD_LIBRARY_PATH=/<miniconda_root>/envs/<env_name>/lib:$LD_LIBRARY_PATH

Possible Problem 2
After executing python joint_monkey.py, it returns error [Error] [carb.windowing-glfw.plugin] GLFW initialization failed.

Solution 2: The test programs requires GUI. In the headless server, you may use X11 forwarding (e.g., MobaXterm) or xvfb virtual screen.


Possible Problem 3
After executing python joint_monkey.py, it returns error Segment Fault

Solution 3: There are no direct solutions on the Internet. After my experience, I can give you some possible reasons.

  • Newer PyTorch, CUDA version: It should not be a big problem. As for me, I am using PyTorch (2.2.0) and CUDA (12.1), the program stills works!
  • Don’t use WSL: By 2/1/2024, the Vulkan has not fully supported on WSL. Therefore, the program may occur such problems. Use Linux machine instead.
  • Vulkan Installation: Please make sure the vulkan works fine on the server. You may visit here for further information.

In this blog, we’ll learn pytorch-lightning framework with some simple example codes. I believe the code will significantly help you understand it. The official doc is at here.

1
2
# Before everything starts:
import pytorch_lightning as pl

To train a deep learning model, we need to code the following components: data, model, training&testing procedures, and hyper-parameter configs. In pytorch-lightning, we have different classes handling them.

  • Data: we can define a subclass of pl.LightningDataModule to implement procedures that initialize the Dataset and DataLoader.
  • Model: implement the model just like what you did without pytorch-lightning – a subclass of nn.Module.
  • Training & Testing procedures: we can define a subclass of pl.LightningModule to implement the procedures.
  • Configs (pytorch-lightning have utilities for CLI arguments. But here we use another package called configargparse, which support both config file and CLI arguments)

1. Data

We can define a pl.LightningDataModule which implements procedures to initialize the Dataset and DataLoader.

Main components of pl.LightningDataModule:

  • __init__(): The constructor. You can save some configurations (hyper-parameters) to the class here.
  • setup(): Will be called before fit(), validate(), test(), and predict(). You can initialize the datasets as class attributes here. Note that it has an argument stage, to indicate what stage is it (Because we may create different datasets for different stages).
  • prepare_data(): Will be called only on Master process. In distribute training, setup() would be called on all processes. If you only want to only do one thing (e.g., downloading data), please implement it in the prepare_data().
  • train_dataloader(), val_dataloader(), test_dataloader(), predict_dataloader(). Will be called during the corresponding phase to get the DataLoader. It should return a DataLoader for that stage.

Here’s an Example. You can check detailed descriptions of each function at the function docstring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# DataModule
class MyDataModule(pl.LightningDataModule):
def __init__(self, data_dir='./', batch_size=32):
"""
Initializes the data module with the some given config arguments.
This constructor sets up the initial configurations like data directory,
batch size, and transformations to be applied to the data.

Called when an instance of MyDataModule is created.
"""
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])

def setup(self, stage=None):
"""
Prepares the data for use. It is automatically called at the beginning of fit,
validate, test, and predict, or when manually calling `trainer.setup`. This method
is responsible for setting up internal datasets (e.g., splitting the dataset into
training, validation, and testing sets, etc.)

`stage` can be used to differentiate between fit, validate, and test stages.
"""
# Assign train/val datasets for use in dataloaders
if stage == "fit":
mnist_full = datasets.MNIST(self.data_dir, train=True, transform=self.transform)
self.mnist_train, self.mnist_val = random_split(
mnist_full, [55000, 5000], generator=torch.Generator().manual_seed(42)
)

# Assign test dataset for use in dataloader(s)
if stage == "test":
self.mnist_test = datasets.MNIST(self.data_dir, train=False, transform=self.transform)

if stage == "predict":
self.mnist_predict = datasets.MNIST(self.data_dir, train=False, transform=self.transform)

def train_dataloader(self):
"""
Creates and returns the DataLoader for the training dataset. This method is used to
load the training data and is automatically called during the training phase.

Returns a DataLoader instance for the training data.
"""
return DataLoader(self.mnist_train, batch_size=self.batch_size)

def val_dataloader(self):
"""
Creates and returns the DataLoader for the validation dataset. This method is used to
load the validation data and is automatically called during the validation phase.

Returns a DataLoader instance for the validation data.
"""
return DataLoader(self.mnist_val, batch_size=self.batch_size, shuffle=False)

def test_dataloader(self):
"""
Creates and returns the DataLoader for the testing dataset. This method is used to
load the test data and is automatically called during the testing phase.

Returns a DataLoader instance for the test data.
"""
return DataLoader(self.mnist_test, batch_size=self.batch_size, shuffle=False)

2. Model

You may directly implement the model in pl.LightningModule, but for clarity, I recommend you to implement your model as a separate nn.Module subclass, and then create the instances in the pl.LightningModule.

1
2
3
4
5
6
7
8
9
10
11
12
13
# You can do like this...
class MyLightningModule(pl.LightningModule):
def __init__(self, hidden_dim=64):
super().__init__()
self.hidden_dim = hidden_dim
self.layer1 = torch.nn.Linear(28 * 28, self.hidden_dim)
self.layer2 = torch.nn.Linear(self.hidden_dim, 10)

def forward(self, x):
x = x.view(x.size(0), -1)
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
1
2
3
4
5
6
7
8
9
10
11
# But I recommend you to do like this:
class Model(nn.Module):
......

class MyLightningModule(pl.LightningModule):
def __init__(self, hidden_dim=64):
super().__init__()
self.model = Model(hidden_dim)

def forward(self, x):
return self.model(x)

3. Training & Testing procedures

We can define a pl.LightningModule which implements procedures for training, validating, testing and predicting.

Main components of pl.LightningModule:

  • __init__(): The constructor. You can save some configurations (hyper-parameters) to the class here.
  • forward(): Make it behave like a nn.Module. You can implement the forward procedures here (See Sec. 2 for the example).
  • configure_optimizers(): Defines the training optimizers. See here for some advanced usage.
  • train_step(), validation_step(), test_step(), predict_step(): Will be called during the corresponding phase, to perform operations for a batch. It has an argument batch_idx, which indicates the batch index of current epoch. The train_step() should return the loss (a scalar or a dict), and the module will automatically perform backward propagation.
  • on_train_epoch_start(), on_train_epoch_end(), on_XXX_epoch_start(), on_XXX_epoch_end(): Will be called at the corresponding start/end point. You can perform some epoch-level operations here. (See below example, where we collect all training batch losses to calculate the average loss on an epoch)
  • Use self.log(tag, value) to log the batch’s “tag” indicator with value “value” to the tensorboard / log file.

Here’s an Example. You can check detailed descriptions of each function at the function docstring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
class MyLightningModule(pl.LightningModule):
def __init__(self, hidden_dim=64):
"""
Initializes the LightningModule with two linear layers. This constructor
sets up the neural network architecture for the module.

Called when an instance of MyLightningModule is created.
"""
super().__init__()
self.hidden_dim = hidden_dim
self.layer1 = torch.nn.Linear(28 * 28, self.hidden_dim)
self.layer2 = torch.nn.Linear(self.hidden_dim, 10)
self.training_results = [] # To store training loss in each epoch

def forward(self, x):
"""
Defines the forward pass of the model. It is automatically called when
the model is used to make predictions.

'x' is the input data. The method reshapes the input, applies a ReLU
activation after the first linear layer, and then passes it through the
second linear layer.
"""
x = x.view(x.size(0), -1)
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x

def training_step(self, batch, batch_idx):
"""
Defines a single step in the training loop. It is automatically called for each
batch of data during training.

'batch' contains the input data and labels, and 'batch_idx' is the index of
the current batch. The method computes the model's loss using cross-entropy.
"""
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
self.log('train_loss', loss)
self.training_results.append(loss)
return loss

def validation_step(self, batch, batch_idx):
...

def test_step(self, batch, batch_idx):
...

def configure_optimizers(self):
"""
Configures the optimizers used for training. This method is automatically
called to configure the optimization algorithm.

Returns an optimizer, in this case, Adam, with a set learning rate.
See advanced usage in official docs.
"""
return torch.optim.Adam(self.parameters(), lr=0.001)

# Epoch-wise procedures
def on_train_epoch_start(self):
"""
Called at the start of every training epoch. Can be used to perform
actions specific to the beginning of each training epoch.
"""
print("Training Epoch Start")

def on_train_epoch_end(self):
"""
Called at the end of every training epoch. Can be used to perform
actions specific to the end of each training epoch.
"""
print("Training Epoch End")
print("Training Loss: ", sum(self.training_results) / len(self.training_results))
self.training_results.clear()

def on_validation_epoch_start(self):
...

def on_validation_epoch_end(self):
...

def on_test_epoch_start(self):
...

def on_test_epoch_end(self):
...

def on_predict_epoch_start(self):
...

def on_predict_epoch_end(self):
...

In some cases, when we import VM into the “Oracle VM VirtualBox Manager”, the network is not well configured, and we cannot connect it by SSH/other things on the host machine. The default network setting may looks like this:

  • The Adapter 1 (an NAT adapter):
Network-setting1
  • The Adapter 2-4 (Not Enabled):
Network-setting2
  • After typing ip a in the terminal of the VM (assume it is a Linux machine), the result looks like below:
1
2
3
4
5
6
7
8
9
10
11
12
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:b3:8d:9b brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute enp0s3
valid_lft 86376sec preferred_lft 86376sec
inet6 fe80::6dea:8795:c7b1:a1a5/64 scope link noprefixroute
valid_lft forever preferred_lft forever

There are no available network adapter to the host.

Method 1: By Host-only Adapter

Important!

Pros: It has a stable IP address, with all ports available. (e.g., setup different services on different ports, 22, 3389, ...)

Cons: The VM cannot be accessed from machine outside of the host (e.g., we cannot ssh host_name@host_ip to connect to the VM from other machines than host), unless you perform some port forwarding strategies on the host.

Steps:

To create a network connection from VM to host, we can set the Adapter 2 like below (Use a “Host-only Adapter”):

Network-setting3

After typing ip a in the terminal of the VM (assume it is a Linux machine) NOW, the result looks like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:b3:8d:9b brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute enp0s3
valid_lft 86337sec preferred_lft 86337sec
inet6 fe80::6dea:8795:c7b1:a1a5/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:85:54:b0 brd ff:ff:ff:ff:ff:ff
inet 192.168.56.101/24 brd 192.168.56.255 scope global dynamic noprefixroute enp0s8
valid_lft 537sec preferred_lft 537sec
inet6 fe80::a598:692:2336:67d9/64 scope link noprefixroute
valid_lft forever preferred_lft forever

The third network adapter, with an ip address 192.168.56.101, is the ip address of the “VirtualBox Host-Only Ethernet Adapter”. On the host machine, we can connect it by SSH fluently:

1
ssh ubuntu@192.168.56.101

By so far, you can connect the machine via SSH well. You can also connect to other services by ip address 192.168.56.101 on the host machine.

Method 2: Forward Port to Host

Important!

Pros: You can access the VM outside of the host machine, from a different machine.

Cons: Only forwarded ports available. Ports may conflict with others.

Steps:

Go to the Adapter 1, click the “Advanced” button to expand the menu. Click the “Port Forwarding”.

Network-setting4

In the “Port Forwarding Rules”, Set up new rules. The “Host Port” is the port number on the host machine. The “Guest Port” is the corresponding port number on the VM. In this example, I forwarded port 4016 on host to the port 22 on the VM.

Network-setting5

After that, you can connect to the VM on the host machine. Type:

1
ssh -p 4016 ubuntu@127.0.0.1

You can also connect to the VM outside of the host machine. Assume in the local network, my ip address is 10.31.38.94. So from the computer in the same local network, type:

1
ssh -p 4016 ubuntu@10.31.38.94

And everything should go fine. If something goes run, please check whether we have firewall permissions. If not, you can set a firewall rule for port 4016 (The port number you forwarded).

Conclusion

You can setup both Method 1 & Method 2!! Then you can experience all of these benefits (connect from other machines, all ports available on local machines, etc.) and avoid the disadvantages.

We usually use WSL and docker container for development. In most cases, when we install the WSL and container, we may need setup the SSH (secure shell) configuration by ourselves. This article shares my experience in setting up SSH in these environments.

1. Install & Config

Step 1: Install (Optional): If your the system has no ssh installed, you may use the package manager (e.g., yum, apk, apt based on which operating system you are using). E.g.,

1
2
3
4
5
6
7
8
9
10
11
# Ubuntu/Debian:
sudo apt update
sudo apt install openssh-server
# CentOS 7 and earlier:
sudo yum install openssh-server
# CentOS 8 and later:
sudo dnf install openssh-server
# Arch Linux:
sudo pacman -S openssh
# Alpine Linux:
sudo apk add openssh-server

Step 2: Configuring:

  1. set up keys for SSH. Run following codes.
1
2
3
ssh-keygen -q -t rsa -b 2048 -f /etc/ssh/ssh_host_rsa_key -N ''
ssh-keygen -q -t ecdsa -f /etc/ssh/ssh_host_ecdsa_key -N ''
ssh-keygen -t dsa -f /etc/ssh/ssh_host_ed25519_key -N ''
  1. (Optional): If you haven’t set the password yet, set one first.
1
passwd <user_name>
  1. set up configs in sshd_config

Use any text editor, like vim, to modify the /etc/ssh/sshd_config if needed. Some important configs are:

1
2
3
4
5
6
7
8
# indicates which port listening to
Port 22
# Use public key authentication
PubkeyAuthentication yes
# Use password authentication
PasswordAuthentication yes
# Use X11 forwarding to forward a display in the remote shell
X11Forwarding yes
  1. Start the sshd service:
1
2
3
4
5
6
7
8
# Ubuntu/Debian:
sudo systemctl start ssh
# CentOS:
sudo systemctl start sshd
# Arch Linux:
sudo systemctl start sshd
# Alpine Linux:
sudo rc-service sshd start

2. Connect

Use ssh -p <port> <username>@<ip_address> to start a SSH connection.

2.1 WSL

<port>: Unless specified in the /etc/ssh/sshd_config, the default port number is 22.

username: The one in the WSL.

The WSL is an independently installed operating system running on your machine, so it has a different ip address. To get the ip address, you can use the following command in the host’s powershell to get the <ip_address>:

1
2
$wsl_ip = (wsl hostname -I).trim()
Write-Host "WSL Machine IP: ""$wsl_ip"""

2.2 Container

<port>: When configuring the container, we usually forward the container’s 22 port to one in the host machine. Use that port number at the host machine.

<username>: Unless create other user, the container has only one default user, root.

ip_address: Since we have port forwarding, the ip_address is the host machine, 127.0.0.1 (or equivalently, the ip address from command ipconfig (Windows), or ifconfig/iwconfig (Linux))

3 Remote Visiting

In some cases, we may want to visit the WSL/docker on this host machine from other machines in the same local network.

3.1 WSL

Host Port Forwarding. Since WSL behaves like a separate machine, it has its own ip address. We should perform port forwarding so that the remote connection to this host machine can be forwarded to the WSL. Run the following command to set port forwarding through Powershell in admin privilage.

1
2
3
4
5
$wsl_ip = (wsl hostname -I).trim()

Write-Host "WSL Machine IP: ""$wsl_ip"""

netsh interface portproxy add v4tov4 listenport=2222 connectport=22 connectaddress=$wsl_ip

Firewall rules. We should set firewall permission rules, so that the connection would not be refused by the host machine. You may search additional online materials for how to allow TCP connection to the <listenport> you defined above, e.g., here.

Connect. Then, we can access by ssh -p 2222 <wsl_username>@<host_ipaddr>. The <host_ipaddr can be obtained by command ipconfig (Windows), or ifconfig/iwconfig (Linux).

Additional note: if something goes run, use netsh interface portproxy reset to reset the port forwarding.

3.2 Container

Connect by ssh -p <port> <wsl_username>@<host_ipaddr>, where:

<port> is the forwarded port number, and

<host_ipaddr> is the host’s ip address in the local network, obtained by command ipconfig (Windows), or ifconfig/iwconfig (Linux).

Images

Image: just like OS image, can be used to create a container.

Create image by:

  1. .tar file, with docker load -i <tar_file> command.
  2. Dockerfile script, with docker build -t command.

List images: Use docker images to show all available images on the device.

Container

Container: An instance running the OS. built from image.

Create from image by docker run command. You can pass in some configuration. (Note some config are fixed after creation. Be careful!) Some common parameters are shown below:

1
docker run -it -d -p <localport>:<dockerport> --gpus all --name <name> --volume </path/host/directory>:</path/container/directory> <IMAGE ID>
  • -d means “detach”, running the docker container in the backend.
  • -p <hostport>:<dockerport> means forward the docker’s port <dockerport> to the host machine’s port <hostport>. Usually, we forward 22 port from container to e.g., 2222 port in host for ssh connection.
  • --gpus all means all host’s GPU are visible to the container.
  • --name <name> indicates the name of the container. It will be shown in the docker ps and other places.
  • --volume </path/host/directory>:</path/container/directory> indicates which host directory are shared with container’s directory, therefore, you can access it in both the container and host.
  • <IMAGE ID> is the id of the image to be built.

List containers: Use docker ps -a to check status of all containers. Use docker ps to check all running containers.

Three ways to interact with the container:

  1. Docker desktop’s shell
  2. docker exec -it <container id> bash
  3. SSH, if you have set so

Start the container: By docker start <CONTAINER ID>, or do it in docker desktop.

Stop the container: By docker stop <CONTAINER ID>, or do it in docker desktop.

More info: please see the official cheat sheet at here.

3. Data Manipulation

In this article, we will discuss four traditional data manipulation problems in SQL / Pandas. They are from

3.1: https://leetcode.cn/problems/nth-highest-salary/

3.2: https://leetcode.cn/problems/rank-scores/

3.3: https://leetcode.cn/problems/rearrange-products-table/

3.4: https://leetcode.cn/problems/calculate-special-bonus/

In our article, we won’t use the same data in these question. We’ll use a more simpler one to illustrate the functions.

3.1 TopK & n-th largest

TopK means selecting top k rows according to some rules from the table. n-th largest means selecting the just n-th largest row according to some rules from the table

3.1.1 MySQL

To select some first couple of rows from the table, you can use the LIMIT keyword. LIMIT M, N will select from M row (0-index), a total of N rows. See the following example and explanation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- table is
+-------+-------+
| name | score |
+-------+-------+
| Alice | 80 |
| Bob | 90 |
| Cindy | 100 |
| David | 70 |
| Ella | 60 |
| Frank | 80 |
+-------+-------+

-- select from 1st row, a total of 2 rows:
select * from Scores LIMIT 1, 2;
+-------+-------+
| name | score |
+-------+-------+
| Bob | 90 |
| Cindy | 100 |
+-------+-------+

-- select total of 0 rows means nothing
select * from Scores LIMIT 1, 0;
+------+-------+
| name | score |
+------+-------+

-- if starting index > len(table), nothing would be returned
select * from Scores LIMIT 7, 1;
+------+-------+
| name | score |
+------+-------+

-- select from minux number means select from first row
select * from Scores LIMIT -17, 2;
+-------+-------+
| name | score |
+-------+-------+
| Alice | 80 |
| Bob | 90 |
+-------+-------+

Therefore, it would be quite easy to implement the TopK and n-th largest. For example, if we want to find top 2 grade students and the 2nd highest student in this table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- find top 2 grade students
select * from Scores order by score desc LIMIT 0, 2;
+-------+-------+
| name | score |
+-------+-------+
| Cindy | 100 |
| Bob | 90 |
+-------+-------+

-- find the 2nd highest student (note that it is 0-index)
select * from Scores order by score desc LIMIT 1, 1;
+------+-------+
| name | score |
+------+-------+
| Bob | 90 |
+------+-------+

3.1.2 Pandas

In pandas, we have the method pd.DataFrame.nlargest(N, col_name) to find the top N rows. Note that, after executing this method, the result table is already sorted descending. Access the last one row will just be the n-th largest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# the df is
name score
0 Alice 80
1 Bob 90
2 Cindy 100
3 David 70
4 Ella 60
5 Frank 80

# find top 2 grade students
df.nlargest(N, 'score')
# result is:
name score
2 Cindy 100
1 Bob 90

# find the 2nd highest student (note that it is 0-index)
Scores.nlargest(N, 'score').iloc[-1]
# result is:
name Bob
score 90

3.2 rank, dense_rank

This question requires us to build a dense rank. That is, when there are rows that “score” are same, they should both appear and they should have the same score. Fortunately, we have built-in functions to help us achieve that.

3.2.1 MySQL

There is a function called DENSE_RANK(), which can be used to create a new column based on the ranking information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- The table is:
+----+-------+
| id | score |
+----+-------+
| 1 | 3.5 |
| 2 | 3.65 |
| 3 | 4 |
| 4 | 3.85 |
| 5 | 4 |
| 6 | 3.65 |
+----+-------+

-- Do the ranking (the problems requires the "dense rank")
SELECT S.score, DENSE_RANK() OVER (
ORDER BY S.score DESC
) AS 'rank'
FROM
Scores S;
-- result is:
+-------+------+
| score | rank |
+-------+------+
| 4 | 1 |
| 4 | 1 |
| 3.85 | 2 |
| 3.65 | 3 |
| 3.65 | 3 |
| 3.5 | 4 |
+-------+------+

Usually, our ranking is not like this: if we have two 4.0 students, the rank of 3.85 student should be 3, not 2. The RANK() functions helps in this case:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT S.score, RANK() OVER (
ORDER BY S.score DESC
) AS 'rank'
FROM
Scores S;
-- result is:
+-------+------+
| score | rank |
+-------+------+
| 4 | 1 |
| 4 | 1 |
| 3.85 | 3 |
| 3.65 | 4 |
| 3.65 | 4 |
| 3.5 | 6 |
+-------+------+

3.2.2 Pandas

In pandas, such operation is quite easy: we have the pd.Series.rank() function. This function has multiple parameters. The most important two are method and ascending. When method is 'dense', it performs as dense_rank. When method is min, it is the common ranking (meaning that when we have multiple same score, we should take minimum rank for all of them). ascending controls whether 1 is assigned to highest or lowest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# scores is:
id score
0 1 3.5
1 2 3.65
2 3 4.0
3 4 3.85
4 5 4.0
5 6 3.65

# create a new column, assign the dense_rank:
scores['rank'] = scores['score'].rank(method='dense', ascending=False)
# result is:
id score rank
0 1 3.5 4.0
1 2 3.65 3.0
2 3 4.0 1.0
3 4 3.85 2.0
4 5 4.0 1.0
5 6 3.65 3.0

# create a new column, assign the common rank:
scores['rank'] = scores['score'].rank(method='min', ascending=False)
# result is:
id score rank
0 1 3.5 6.0
1 2 3.65 4.0
2 3 4.0 1.0
3 4 3.85 3.0
4 5 4.0 1.0
5 6 3.65 4.0

3.3 row to column conversion

3.3.1 MySQL

The idea is quite simple as illustrate below:

q3

First, we extract each column to form a new table corresponding to each store. Then, we concatenate (“Union”) three tables together. Note that when we are extracting the table, we should remove “null” rows.

Follow this idea, our implementation would be relatively easy:

1
2
3
4
5
select product_id, 'store1' as store, store1 as price from Products where store1 is not null
union all
select product_id, 'store2' as store, store2 as price from Products where store2 is not null
union all
select product_id, 'store3' as store, store3 as price from Products where store3 is not null

3.3.2 Pandas

In pandas, you can use the similar idea as listed in the above subsection. But Pandas provides more powerful tools, pd.melt, which helps us to convert wide (row) format into long (column) format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Input: 
Products table:
+------------+--------+--------+--------+
| product_id | store1 | store2 | store3 |
+------------+--------+--------+--------+
| 0 | 95 | 100 | 105 |
| 1 | 70 | null | 80 |
+------------+--------+--------+--------+
Output:
+------------+--------+-------+
| product_id | store | price |
+------------+--------+-------+
| 0 | store1 | 95 |
| 0 | store2 | 100 |
| 0 | store3 | 105 |
| 1 | store1 | 70 |
| 1 | store3 | 80 |
+------------+--------+-------+

The id_vars is the identifier column (will not be changed), value_vars is the columns to be converted. var_name is the new column name from converted columns.

1
2
3
4
5
6
df = products.melt(
id_vars='product_id',
value_vars=['store1', 'store2', 'store3'],
var_name='store',
value_name='price'
)

3.4 Conditional operator

The question requires us to set each value according to a predicate. Such basic requirement is well-supported by the MySQL and Pandas.

3.4.1 MySQL

In MySQL, if(cond, true_value, false_value) is a good helper. This should appear after SELECT.

1
2
3
4
5
SELECT 
employee_id,
IF(employee_id % 2 = 1 AND name NOT REGEXP '^M', salary, 0) AS bonus
FROM
employees

The second column, bonus, entry would be the same value of salary if the condition is satisfied, otherwise it is 0. For the condition, the second is the grammar of regular expression, asserting true only when name does not start with M. You may also use name NOT LIKE 'M%' for the second condition.

3.4.2 Pandas

In pandas, we can use the pd.DataFrame.apply function to achieve this. The apply function accepts a function, f(x), which x is the “current” row. Based on the information on that row, it returns a value. For example, in this question, the condition is “the employee is an odd number and the employee’s name does not start with the character 'M'. “

1
2
3
4
5
6
7
8
9
# judge function
def f(x):
if x['employee_id'] % 2 == 1 and not x['name'].startswith('M'):
return x['salary']
else:
return 0

# Apply it to the dataframe for the new column:
employees['bonus'] = employees.apply(f, axis=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Sample input:
| employee_id | name | salary |
| ----------- | ------- | ------ |
| 2 | Meir | 3000 |
| 3 | Michael | 3800 |
| 7 | Addilyn | 7400 |
| 8 | Juan | 6100 |
| 9 | Kannon | 7700 |

# Sample result of employees['bonus']:
0 0
1 0
2 7400
3 0
4 7700
Name: bonus, dtype: int64

Which satisfies the requirement.

2. String Operations

2.1 MySQL

Filtering string in SQL is generally same as the process for filtering numeric. There are many built-in functions to help you achieve some string operation. For example, UPPER(), LOWER(), CONCAT(), SUBSTRING() .

  • length() function returns length of that columns’ entries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Get len of each entry in "content" column
select length(content) from fruits;
+-----------------+
| length(content) |
+-----------------+
| 5 |
| 6 |
| 6 |
+-----------------+

-- Assign the length column to table, named "content_len":
select *, length(content) as content_len from fruits;
+----+---------+-------------+
| id | content | content_len |
+----+---------+-------------+
| 1 | apple | 5 |
| 2 | banana | 6 |
| 3 | orange | 6 |
+----+---------+-------------+
  • LIKE "pattern" tells whether a column match this pattern, so that you can filter others out. The pattern uses _ to match any ONE character, and % to match any MULTIPLE (>1) character.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Get a boolean column where "1" entry means content starts with "a"
+-------------------+
| content like "a%" |
+-------------------+
| 1 |
| 0 |
| 0 |
+-------------------+

-- Combine with "where" to do filtering
select * from fruits where content like "a%";
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
+----+---------+
  • Sometimes, the pattern above cannot handle some complex queries. We can use regular expression here. The grammar is: <col_name> REGEXP 'pattern'. Note that this by default behaves like “match”, checking from the beginning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- table is
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
| 2 | banana |
| 3 | orange |
+----+---------+

-- Match all entries, begining with "a"
select * from fruits where content REGEXP "a\w*";
+----+---------+
| id | content |
+----+---------+
| 1 | apple |
+----+---------+

-- Match all entries, containing "an"
select * from fruits where content REGEXP "\w*an\w*";
+----+---------+
| id | content |
+----+---------+
| 2 | banana |
| 3 | orange |
+----+---------+

2.2 Pandas

For numeric data types, we usually directly use grammar like df['col'] > 5 to get the indexing series. For string operations, we have some similar ways to get the indexing series.

String Accessor. For each column, we can use .str to get its pandas.core.strings.accessor.StringMethods. This object has various of string operation functions, such as .replace(), .find(), .len(), .startswith()……

  • Some function, such as .len(), returns a new integer series, containing length of string for each entry:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Get len of each entry in "content" column
df['content'].str.len()
0 5
1 6
2 6
Name: content, dtype: int64

# Assign the len series to df, named "content_len":
df['content_len'] = df['content'].str.len()

# df now is:
id content content_len
0 1 apple 5
1 2 banana 6
2 3 orange 6
  • Some function, like .startwith(), returns a new boolean series, so you can do indexing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Get indexing series where "content" entry starts with "a"
cond = df['content'].str.startswith('a')

# cond is:
0 True
1 False
2 False
Name: content, dtype: bool

# Do filtering
df = df.loc[cond, :]

# df is:
id content
0 1 apple
  • Regular Expression

Sometimes, we want to utilize Regex to help us match entries. We have .match() function. Note that match will exactly match from the beginning. If you just want to match any substring, use .contains instead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# df is:
id content
0 1 apple
1 2 banana
2 3 orange

# Match all entries, begining with "a"
cond = df['content'].str.match(r'a[a-zA-Z]*')

# Result will be:
df.loc[cond, :]
id content
0 1 apple

# Match all entries, containing "an"
cond = df['content'].str.contains(r'an')

# Result will be:
df.loc[cond, :]
id content
1 2 banana
2 3 orange

Introduction

There are 30 questions to practice Python Pandas and MySQL on https://leetcode.cn/studyplan/30-days-of-pandas/ . These questions basically practice the fundamental skills. Pandas and MySQL are similar in some aspects, and one can write the same functional code in two languages. I’ll practice my skill with these questions, and write notes about the equivalent operations in two languages.

Accordingly, there are six sections in this section (so will use 6 articles in total):

  1. Condition Filter & Misc
  2. String Manipulation
  3. Data Manipulation
  4. Data Statistics
  5. Grouping
  6. Merging

1. Condition Filter & Join & Misc

The four related question in this area are:

Basic where clause filtering, with OR operator: https://leetcode.cn/problems/big-countries/

Basic where clause filtering, with AND operator: https://leetcode.cn/problems/recyclable-and-low-fat-products/

Joining two tables, with where clause filtering null: https://leetcode.cn/problems/customers-who-never-order/

Basic where clause filtering, with DISTINCT and sort: https://leetcode.cn/problems/article-views-i

1.1 Condition

Typical condition filter work is asking us to do like this:

Filter the big country in World scheme. A country is big if:

  • it has an area of at least three million (i.e., 3000000 km2), or
  • it has a population of at least twenty-five million (i.e., 25000000).

We’ll use this example to illustrate the point.

1.1.a MySQL

A typical condition filter in SQL looks like this:

1
2
3
4
5
select <col_names>
from <table_name>
where (
<condition predicate>
)

SQL allows the use of the logical connectives and, or, and not.

The operands of the logical connectives can be expressions involving the comparison operators <, <=, >, >=, =, and <> (Note that, equal symbol is one =, but not ==!)

For example, in the codes above, the answer is

1
2
3
SELECT name, population, area
FROM World
WHERE area >= 3000000 OR population >= 25000000

1.1.b Pandas

Filter operations in pandas is a little bit difference. We should know two basics things first:

  • Condition result in Index Series

If we want to get all data that are greater than 5 in df‘s col1 column, the code and result is as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
print(df)
# OUTPUT:
'''
name col1 col2
0 a 1 10
1 b 2 9
2 c 3 8
3 d 4 7
4 e 5 6
5 f 6 5
6 g 7 4
7 h 8 3
8 i 9 2
9 j 10 1
'''

print(df['col1'] > 5)
# OUTPUT:
'''
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
9 True
Name: col1, dtype: bool
'''

As you can see, the result of df['col1'] > 5 is a bool Series with len(df) size. The entry is True only when the corresponding entry at col1 satisfies the condition.

  • Indexing by the Index Series

We can pass this bool index series in to the df.loc[]. Then, only the rows with the condition satisfied are displayed:

1
2
3
4
5
6
7
8
9
10
11
a = df.loc[df['col1'] > 5]
print(a)
# OUTPUT
'''
name col1 col2
5 f 6 5
6 g 7 4
7 h 8 3
8 i 9 2
9 j 10 1
'''

By such method, we can do the filtering in the pandas. Note that:

For the and, or and not logic operators, the corresponding operator in pandas is &, | and ~. For example:

1
2
3
4
5
6
cond1 = df['col1'] > 5
cond2 = df['col2'] > 2

b = df.loc[cond1 & cond2]

c = df.loc[ (df['col1'] > 5) & (df['col2'] > 2) ]

Note that the () symbols are necessary because of the operator precedence. For example, in the codes above, the answer is:

1
2
3
4
cond1 = world['area'] >= 300_0000
cond2 = world['population'] >= 2500_0000

return world.loc[cond1 | cond2, ['name', 'population', 'area']]

1.2 Misc

1.2.1 Rename output columns

1.2.1.a MySQL

In MySQL, rename a column is very easy. By default, if the command is

1
2
3
SELECT name, population, area
FROM World
WHERE area >= 3000000 OR population >= 25000000

Then the output columns are named name, population, and area. To rename, just use the as symbol to give a new name as output:

1
2
3
4
-- change name => Name, population => Population, area => Area
SELECT name as Name, population as Population, area as Area
FROM World
WHERE area >= 3000000 OR population >= 25000000

1.2.1.b Pandas

In pandas, we can use the pd.DataFrame.rename function to rename the columns. The input parameter columns is a dict[str, str], where key is the old name, and the value is the new name. For example in the example below, we make all names capitalized:

1
df = df.rename(columns={'name': 'Name', 'population': 'Population', 'area': 'Area'})

1.2.2 Swap output columns order

1.2.2.a MySQL

In SQL, this work is relatively easy. You only need to swap the column names in the select command:

1
2
3
4
5
-- show with A, B, C columns 
select A, B, C from table;

-- show with C, B, A columns
select C, B, A from table;

1.2.2.b Pandas

The method is relatively similar as the code in MySQL, by:

1
2
3
4
5
# show with A, B, C columns 
df.loc[:, ['A', 'B', 'C']]

# show with C, B, A columns
df.loc[:, ['C', 'B', 'A']]

1.2.3 Remove duplicate rows

1.2.3.a MySQL

It is quite easy to do that: you only need to add DISTINCT keyword. For example:

1
2
select DISTINCT student_id
from courses

1.2.3.b Pandas

The DataFrame has a method called drop_duplicates(). For example:

1
selected_views = selected_views.drop_duplicates()

More advanced usage is use parameter subset to tell which columns for identifying duplicates, by default use all of the columns. For example, the following code drop the duplicate rows whenever they have same value in column teacher_id, subject_id.

1
teacher = teacher.drop_duplicates(subset=['teacher_id', 'subject_id'])

1.2.4 Sort

1.2.4.a MySQL

In the select statement, add with order by <col> asc/desc statement. <col> indicates sort according to which column, asc means ascending sorting, desc means descending sorting.

1
2
3
select v.author_id 
from Views as v
order by id asc

To sort according multiple columns (if first column is the same, sort by second column), do it by:

1
2
3
select v.author_id 
from Views as v
order by id asc, col2 desc

1.2.4.b Pandas

Use the pd.DataFrame.sort_values() to sort according to which column(s). If we want to sort from biggest to smallest, use ascending parameter.

1
2
3
4
5
6
7
8
# sort by one column, id
selected_views = selected_views.sort_values(by=['id'])
selected_views = selected_views.sort_values(by=['id'], ascending=True)

# sort by two columns
# if first column is the same, sort by second column
selected_views = selected_views.sort_values(by=['id', 'view_date'])
selected_views = selected_views.sort_values(by=['id', 'view_date'], ascending=[True, False])