Scaleway Cold Storage

Overview

Scaleway Cold Storage (or GLACIER), is a storage-class on the Scaleway S3 Object Storage, built for archiving data at a low cost. It is not a product in itself, but rather an extension of our object storage, it cannot be accessed without the S3 api.

In this document, we will go over on how this project was born, and some technical internals of it.

Why?

In 2016, Online (then Scaleway) launched an archiving product: C14. The product was very hardware-centric, mainly built around two major points: - Cheap SMR disks - In-House boards that can power-off/on disks on demand.

On the API side, it was built like a vault: One opens a vault, put data inside it, then closes the vault (archiving it). There were several shortcomings with this design, and the main one was that the client needed to un-archive an whole archive in order to access a single file in it. In other words, to access a 1G file inside a 40TB archive, one needed to unarchive 40TB. Which can be fine for some use cases, but can be a limiting point for others.

The main problem with this design are the 'Data front ends' AKA the big servers that were keeping the unarchived data. As one can imagine, multiple clients unarchiving multiple-teraybytes vaults can fill-up a server quite quickly, thus blocking other client from the service.

In 2019, an internal Proof of Concept was made in order to demonstrate that we were capable of using the C14 hardware for the, at the time, brand new object storage. There were limitations, of course, but the PoC was working quite well.

This proof of concept was coming from a basic realization: The C14 hardware was under-used. At the time, we had around 40PB of C14 hardware in production, with only 10PB used by clients, and lot of spares in our stocks.

An effort was made to transform the PoC in a production-ready project, and around a year later, GLACIER was in beta on Scaleway.

How we approached the project

Integration within S3

First of all, the project was to be used with the S3 API, so we needed a pretty standard way to access the API. Lots of patches have been made on our S3 gateways in order to be more compliant with Amazon's S3 Glacier storage-class. We also had to begin work on a complete lifecycle engine, since the Lifecycle feature is very important with GLACIER.

Learning from our C14 mistakes

With 3 years of run, we learned a thing or two about what works and what does not with archiving data. The main point was to build GLACIER around read-optimization, rather than write optimization. We also decided to use file systems on the SMR disks, since in those 3 years, lots of patches have been made in the Linux Kernel in order to optimize filesystems interaction with SMR disks.

Architecture Overview

Internals

Hardware

MrFreeze

The 'MrFreeze' is the name of the board that hosts the archiving solution. The board was made in-house, and we did not modify it for GLACIER. It comes with the following:

A very basic 4 core ARM CPU
2G of RAM
2 SATA buses
56 Disks slots
4 1Gbit network port

As one can see, the main feature of the board is to have lot of disks, but only two SATA buses to access them. So, at a time, only two disks can be powered on simultaneously.

The board itself expose a power-line API through GPIO ports, and we keep a map on 'what-to-power' in order to power disk X or Y software side.

One funny quirk of this design is that we don't need to ventilate those boards that much, since 54 disks are powered down all the time, and the heat dissipation works quite well.

The main caveat, of course, is that a disk needs some time to be powered on: Around 15 seconds from power-on to mountable by the userspace.

SMR Disks

The disk that we use to fill those boards are 8TB Seagate SMR Disks. Those disks are built for storage density, and thus are perfect for data archiving. They can be quite slow, especially for writes, but that something that comes with all SMRs disks.

Locations constraints

The C14 racks are located in our Parisians Datacenters, DC2 and DC4 (AKA The Bunker). Since one rack is around 1 metric-ton of disks, we cannot place them wherever we want; for example, a MrFreeze rack in our AMS datacenter is totally out the question. So, we need to able to transfer data from the AMS Datacenter (or WAW datacenter) to the Parisians one.

Software

Freezer

The Freezer is the software that operate the MrFreeze board. It is responsible for powering-on and off disks, and actually writing data to them. It's a very 'dumb' software though, since all the database and intellect are in the worker, which are on far more powerful machines.

The Freezer communicates with the worker over a TCP connection, and exposes a basic binary API:

typedef enum {
    ICE_PROBE_DISK = 0,     /*!< Probe a disk for used and total size */
    ICE_DISK_POWER_ON,      /*!< Power on a disk */
    ICE_DISK_POWER_OFF,     /*!< Power off a disk */
    ICE_SEND_DATA,          /*!< Send data */
    ICE_GET_DATA,           /*!< Get data */
    ICE_DEL_DATA,           /*!< Delete data */
    ICE_SATA_CLEAN,         /*!< Clean all the SATA buses, power off everything */
    ICE_FSCK_DISK,          /*!< Request for a filesystem integrity check on a disk */
    ICE_LINK_DATA,          /*!< Link data (zero copy) */
    ICE_ERROR,              /*!< Error reporting */
} packet_type_t;

For simplicity, we logically splitted the freezer in two part, one per SATA bus. So in reality, two workers are speaking with one freezer, each one with a dedicated bus. I won't go in detail in the freezer internals, since it's mainly dumping data from a socket to an inode (or the opposite for a read), and make some integrity checks on it.

Integration within OpenIO

Before we explain how the actual worker works, we need to explain a bit of OpenIO internals: Once an object is created within OpenIO, it is splitted in erasure coded chunks (6+3), and a rawx is elected to store a chunk. The rawx is a simple binary much like the freezer, without the disk-power bit: It simply dump a socket into an inode, or the opposite. So for an object, a minimum of 9 rawxs are used to store the actual data.

For hot-storage, we don't need to go further than that. We have a rawx per disk (mechanical or SSD), which means the chunk is accessible at all time. For cold storage though, we need to go one step further and actually send the data to the archiving disks.

In order to do that, I used the OpenIO's golang rawx code, and did add the following logic:

On a WRITE (PUT), notify a program through an IPC socket
On a READ (GET), check that the data is actually there, and if it is not, request it through the IPC.
Same operation for DELETE.

The other end of the IPC socket is the actual GLACIER worker, which talks with all the rawx of a rack; There are 12 mechanical disks per rack, so one worker speaks with 12 different rawxs.

On a Write, the worker will transfer the data to the freezers with a low priority, actually emptying the inode on the rawx. On a READ, it gets the data from the freezer to the rawx with high-priority in order for it to be accessible by the usual OpenIOs API.

Worker

The worker is the piece of software that links the object storage with the archiving servers. It receives commands on an IPC, and execute them asynchronously.

There are 3 main types of command that the worker can receive:

CREATE: A chunk has been created, we need to transfer it to the archiving servers when thats possible.
GET: A chunk is requested, we need to get it from the freezers as soon as possible.
DELETE: A chunk has been deleted, we need to unlink the inode on the freezers aswell.

In case of a CREATE, the job is executed in a best-effort manner; we simply do it when we have the time to do so. That's not a problem, because: The data is available if the client needs it, and we have plenty of space on the hot-storage hard-drives. When the job is executed, we elect a disk to store the chunk. Election is a straightforward process, as the goal is too use a minimal amount of disks in total, in order not to have too much time loss on powering on and off disks. Once we find an appropriate disk, the data is sent to the freezer, and we empty the inode on the hot-storage. The inode itself is not deleted, the content is. We then store the disk number along some other informations in a database, in order to be able to retrieve the content later on.

In case of a GET, the job is executed as soon as we can. It's in the top priority for scheduling, so the only thing keeping us from executing it is prior GET jobs. Once the job is executed, we simply get the data from the archiving server, and fill the inode on the hot-storage hard drive. We then set a time on that inode, usually around a day, in order to garbage collect it, and not fill not-used space on the hot-storage.

It is important to note that this mechanic have nothing to do with the "Restore time" from the restore route, as the restore API call do move the object from GLACIER to STANDARD (eg; from archiving to hot storage).

As mentionned previously, the worker is linked with a SATA bus from the freezer. So the worker only handles 22 disks in total.

Lifecycle

The lifecycle engine places itself between OpenIO's API, Rawx and the S3 Gateway, creating and deleting bucket rules and scheduling the transitions or expirations jobs on the objects matching the rules. The architecture is a distributed system, with a lifecycle-master that communicates with the gateway, schedules and dispatch jobs to lifecycle-workers which are present on all storage pods.

The master keeps track of every jobs to execute while the lifecycle-worker only executes the task it receives using the capacity of the storage pod as temporary storage solution while transferring the object to Cold Storage and back. One of the main asset of the engine is that it can wait for workers to put the data back from Cold Storage to the rawx, that's why it is also used to execute the restore-object S3 call which does not need any lifecycle rule on the bucket.