Hello, I am Jaejin Song, DevOps Team Leader at WhaTap Labs. The DevOps team does a wide range of work, from managing WhaTap service infrastructure to CI/CD, issue handling, customer support, and technical writing. You can say that we do everything except service development.
WhaTap monitoring service collects a lot of data as it monitors tens of thousands of servers around the world and more than 1 million transactions per second in real time. The amount of data stored on the collection server disk is about 6Tb/day, and the amount is increasing every day. We would like to talk about how we have been thinking about the data, which is the core of WhaTap service, from the infrastructure perspective, not from the code perspective, to efficiently store this data.😄
A unit of monitoring for the WhaTap service is called a "project." A project can be a group of applications, a group of servers, or a single NameSapce, and in Kubernetes, a single NameSapce is the smallest unit. Most people organize a few servers and a few applications into a project.
Data storage is also based on projects.
With a combined storage of 6 TB/day and a total of 10,000 projects, 600 Mb/day per project is the average. When a new project is created, it is placed on the most spare VM out of the dozens of storage tens. The volumes allocated to the VM are managed by a monitoring daemon, which calls the cloud provider's API to increase the volume size when it anticipates running out of space.
We were having a peaceful time, with services thriving and systems running smoothly with the power of automation. Then one day, we noticed a red flag in one of our large client’s Kubernetes monitoring projects: hundreds of pods were being created on a single NameSapce, and within a day, we started to accumulate over 1Tb of data.
While it is natural to scale out as the number of projects grows, it is unexpected to see so much data stored in a single project.
WhaTap monitoring service should guarantee 1 month data retention. You have homework, because you will reach the volume size limit in just a few days.
After watching the usage trend for a few days, it was clear that the volume size was going to be over 16Tb. We moved the project to a new VM, and the volume on the new VM was configured with LVM to tie the two volumes together.
However, when it came time for the customer to eagerly retrieve the accumulated data, performance issues arose. LVM by itself doesn't cause performance degradation, but when configured as Linear with scaling in mind, only one of the two disks was being written too hard.
Linear vs Stripe
The I/O performance degradation made me think about Stripe on LVM. Stripe on LVM is raid0, which has the same drawbacks: each volume must be the same size and cannot be resized later. This configuration greatly reduces the flexibility of LVM. While I am not worried about disk failures in a cloud environment, it is not uncommon for Stripe information to be lost due to configuration changes, so I was concerned. The nature of the storage is such that it uses thousands of IOPS at all times with random I/O and requires instant latency, so distributed file systems like HDFS and Object Storage are out of the question.
Late that night, I fell asleep thinking "Wouldn't it be great if we could configure Stripe like a linear write in LVM?".
On my way to work the next day, Google answered my wish: "ZFS, BTRFS, choose one". I had heard of both, but had not given them much attention or thought to try them out.
The introduction on the ZFS homepage was tremendous.
It is a filesystem panacea, and when I dug into the details, I thought, "Is that all there is to it?" and it looks like it is.
I decided to see if it was true.
We mirrored some of the service data and created a harsh environment to accommodate a project with more than twice the production storage.
This is what our environment looks like, named "Canaria".
In a Linux environment, btrfs has many advantages for management and monitoring, but after a brief pilot, we concluded that zfs is better suited for our service.
In terms of future potential, BTRFS is the clear winner, but for now, it leaves something to be desired. In the future, btrfs will support zstd dictionary, which is more than 5 times more compressed than regular zstd, and lz4, which is the fastest. If all goes according to plan, we can replace zfs with btrfs in 2-3 years.
btrfs.wiki.kernel.org/index.php
And the support for lz4 compression is huge. lz4 in zfs has a "huge" difference in random I/O speed compared to lzo and zstd in brfs. However, if you do not need instant latency like we do, btrfs + zstd is a great choice.
Algorithms
Compression Ratio
Compression Performance
Decompression performance
gzip
2.743x
90 MB/s
400 MB/s
lzo
2.106x
690 MB/s
820 MB/s
lz4
2.101x
740 MB/s
4530 MB/s
zstd
2.884x
500 MB/s
1660 MB/s
Source: https://facebook.github.io/zstd/
In a typical environment, it is hard to notice a performance difference based on filesystem characteristics, but our repository is extremely I/O intensive.
Our storage is characterized by
Here are the results of our tests in the Canary environment
I was very surprised by the results. There is "never" and "never" a reason not to use zfs. I was especially encouraged by the I/O performance gains with volume compression.
I was looking forward to the following benefits of switching from ext4 to zfs.
Easier volume management
Performance
cost reduction compared to ext4
There are some downsides, but the upsides outweigh them all. The slight increase in memory and CPU usage is a trade-off, but the performance and cost savings are so great that it's a trade-off worth making 10 times over.
Cons
Since the "canary" environment showed great results, we immediately started the production deployment. The filesystem migration, which totaled over 150TB, was done sequentially over three months and was a smooth process, but there were a few missteps along the way that we will summarize.
There are many different versions of ZFS and many different ways to deliver modules. I have found that modules distributed with OpenZFS version 2.0x + dkms have better performance/reliability than the ZFS shipped with the distribution. Don't hesitate, use this version.
You must! Use version 2.0.x. The difference is very significant.
#On Ubuntusudo add-apt-repository ppa:jonathonf/zfssudo apt-get install -y zfs-dkms
The default values are fine, but some option tweaks can significantly increase performance. However, there is no one-size-fits-all solution, and you need to make appropriate adjustments based on the characteristics of your application, such as sequential vs. random.
For example, a large recordsize can improve compression and sequential storage performance, but at the expense of Random performance.
The guide is pretty good.
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Workload Tuning.html
#Options applied to the WhaTap ingestion server
sudo zpool set ashift=12 yardbasesudo zfs set compression=lz4 yardbasesudo zfs set atime=off yardbasesudo zfs set sync=disabled yardbasesudo zfs set dnodesize=auto yardbasesudo zfs set redundant_metadata=most yardbasesudo zfs set xattr=sa yardbasesudo zfs set recordsize=128k yardbase
We tried lowering the disk specs until it was working well in production.
The AWS region changed gp2 to gp3, which has slightly lower latency and half the bandwidth, but costs nearly 20% less than gp2. The performance difference is taken from the blog below and is not much different from what we have measured on ext4.
https://silashansen.medium.com/looking-into-the-new-ebs-gp3-volumes-8eaaa8aff33e
Azure regions have changed from premium SSDs to standard SSDs.
Standard SSDs in Azure have very low performance, actually only slightly better than HDDs.
Wow! That is enough to make a difference.
I did not expect it to work so well, especially with standard SSDs in Azure.
Since ZFS's raidz stripe scales performance linearly with the number of disks, the lower spec disks were no problem at all.