December 29, 2016
5 min Read
Ceph, a free-software storage platform, scalable to the exabyte level, became a hot topic this year. In fact, now Ceph is so stable it is used by some of the largest companies and projects in the world, including Yahoo!, CERN, Bloomberg. Hostrangers also joined this league of Ceph users, but in our own way. We deployed our larger, better-performing cluster using not RadosGW or RBD, but the very fresh Ceph’s file system (CephFS).
Pretty hard. But not impossible
Ceph is pretty hard to deploy and tune to make most of your cluster. There are tools to make deploying Ceph easier: ansible version ansible-ceph, saltstack version deep-sea, chef version ceph-chef and, of course, ceph-deploy.
Our decision to use Ceph for 000webhost.com project was thoroughly researched. At Hostrangers we have already had experience using Ceph and Gluster in production. And now, we are confident, that using Ceph is the right choice.
Not as performant? The best or nothing
Before migrating all our storage needs to Ceph, Hostrangers used Gluster which is no longer suitable for our present needs. First of all, Gluster is harder to maintain than Ceph. Second, it is not as performant. Unlike Ceph, which has native Kernel client, Gluster is exported either through NFS (or Ganesha-NFS), or FUSE client. Even though Gluster documentation says that FUSE is meant to be used where high performance is required, FUSE can’t compete with kernel clients.
After mentioning that we use CephFS in production, we got reactions like “really!!!??” and some laughter. But as we are trying to continue to be an innovative company that keeps technology stack on edge, we continue to use CephFS. Another reason why we choose to use CephFS is that any alternatives are not good enough. Here at Hostrangers we have experience trying to run Ceph RBD images over NFS share. This solution did provide us shared storage, but under certain loads (a mix of a large count of clients sending a mix of both – IO and throughput oriented – read and write requests) NFS (and Ganesha) was showing terrible performance with frequent lock-ups. Although it is stated that there is no iSCSI support for Ceph, we have managed to export RBD images over iSCSI (two years ago) and successfully ran a few production projects this way. Which, by the way, gave us a very decent performance which we were satisfied with. But this solution is no longer suitable for our present needs too because it does not provide shared storage.
How we did it
For the initial launch of the project, we have deployed cluster that consists of:
- Three monitor nodes, which also act as metadata servers and cache tier. Each was having 1 NVMe, totaling to 800GB size of cache tier for the hottest data.
- 5 OSD nodes. Each is having 12 OSD + 1 NVMe for journals. Totaling 60 OSDs and 120 TB of usable disk space (360 TB of raw space)
- Separate 10GE fiber networks. One public client network and one private cluster network.
Like all of our internal networks, Ceph cluster is running on the IPv6-only network.
For those wondering what NVMe is – it is a quite new technology, with the first specification released in 2011. Our NVMe drives can achieve up to 3GBps (24 Gbps) read throughput and up to 2 GBps (16 Gbps) write throughput, providing exceptional performance and enabling data throughput improvements over traditional HDDs and other SSDs.
Long road of benchmarking, tuning and lessons learned
As this cluster is built for shared hosting, most of operations are IOPS oriented.
It all boils down to the latency, and latency is the key:
- Latency induced by network
- Latency induced by drives and their controllers
- Latency induced by CPU
- Latency induced by RAID controller
- Latency induced by other hardware components
- Latency induced by kernel
- Latency induced by Ceph code
- Latency induced by wrong kernel, Ceph, network, file system and CPU configuration
While you can’t do much about some of these parts, some things are in your control.
- Tuning kernel configuration parameters
There are a lot of kernel parameters that can improve your cluster performance, or make it worse, so you should leave them at defaults unless you really know what you are doing.
- Tuning Ceph configuration parameters
By default, ceph is configured to meet requirements of spindle disks. But many of clusters are SSD-only, a mix of SSD (for journals) backed by spindle HDDs, and NVMe for cache backed by spindle HDDs as we have here. So we had to tune many of the Ceph configuration parameters. One of the most performance impacting configuration parameters for us, as we use NVMe journals, were related to writeback throttling. The problem was throttling kicking in too soon. As a consequence journals were able to absorb much more load, but were forcibly slowed down by backing HDDs. You can, but should not turn off wb_throttling altogether as that might later introduce huge spikes and slowdowns because HDDs won’t keep up with the speed of NVMe. We have changed these parameters based on our calculations.
The rule of thumb for Ceph is – you need as many CPU cores as there are OSDs on the node after that only core frequency is what counts. And when it comes to CPU frequency, you should be aware of CPU states, namely C-States. You can turn them off as a tradeoff for power consumption.
- Choosing right IO scheduler
Linux kernel has more than one IO scheduler. Different schedulers provide different performances for different workloads. For Ceph, you should use deadline scheduler.
When you are building hardware for your Ceph cluster, you need to take into account your RAID controller if you are going to use one. Not all RAID controllers are made equal. Some controllers will drain CPU more than others, which means that you will get fewer CPU cycles for Ceph operations. Benchmarks done at Ceph very well depicts this.
- Tuning underlying filesystem
The filesystem should be configured accordingly, too.
Use XFS for the OSD file system with the following recommended options: noatime, nodiratime, logbsize=256k, logbufs=8, inode64.
- Other hardware components
Our first Ceph cluster was SSD-only, with nodes having 24 standalone OSDs. Doing benchmarks quickly revealed that there is somewhere a bottleneck at 24 Gbps, but LSI HBA has throughout of 48 Gbps. Digging into how the server is built and how the drives are connected, it was clear, that the bottleneck is… SAS Expander, because of the way drives are connected to it.
- Kernel client vs. FUSE client
Our initial deployment was using FUSE Ceph client because it supported filesystem quotas and lightning fast file and directory size report because metadata server tracks every file operation (unlike traditional filesystems, where you have to calculate directory sizes every time you need it). But FUSE was unacceptably slow.
Let’s take some simple benchmarks:
- WordPress extraction (as this storage is used for shared hosting platform, we consider WordPress deployment speed to be a good benchmark). Using FUSE client it took, on average, 30 seconds to extract and deploy WordPress to our CephFS filesystem. Using kernel client takes up to 2 seconds.
- Drupal (for the same reasons as WordPress). FUSE client takes, on average, 40 seconds to extract. Kernel client – up to 4 seconds.
- Another quick benchmark is to extract Linux kernel. Using FUSE client it took, on average, 4 minutes, to extract linux-4.10.tar.xz on CephFS filesystem. Using kernel client takes up to 30 seconds.
The difference is huge. We believe same applies to Gluster FUSE client vs. Gluster kernel client. Oh, wait…
As kernel client (version 4.10) still does not support quotas, we tried to use fuse client, and had to rewrite CephFS mount wrapper
mount.fuse.ceph to use ceph-fuse and filesystem parameters correctly, at the same time solving issues that arose while trying to use old
mount.fuse.ceph with systemd.
- Latency is the key
- You have to inspect all the components
- Sometimes it’s the little things that make the big difference
- Defaults are not always best
- Stop using ceph-deploy, use ceph-ansible or deep-sea (saltstack)
To sum up
To sum up, CephFS is very resilient. Extending storage is simple, and with every Ceph software release, our storage will become more and more stable and performant.
Furthermore, we have bigger plans how to use Ceph. But more on that – to come.