Quantcast
Channel: Algorithm – Randy Gaul's Game Programming Blog
Viewing all articles
Browse latest Browse all 24

Base64 Encoding and some Opinions

$
0
0

Recently I had a need to do some base64 encoding in C. I was hoping to find two functions, one for encode and one for decode. base64 encoding is useful to transmit a byte array over text in a safe manner, where safe means common text editors or copy + paste mechanisms will not botch the underlying data. base64 encoding is commonly used to transmit data over JSON or HTTP.

My initial search was on github. I searched for “base64” filtered results by repositories under the C category sorted by “most stars”. Here are my thoughts on the first few results I found. My thoughts are sort of like a code review of popular open source options.


The most popular C solution on github is base64 by aklomp. The big thing about this solution is it implements SIMD intrinsics for a wide variety of platforms. The repository has over 4k lines of code, meaning it’s a pretty beefy dependency. SIMD does sound nice, but for such a simple operation like base64 encoding, this might not be worth it. Personally I will not be doing enough base64 encoding to demand such a huge amount of code with such heavy optimizations. My preference would be something perhaps less than 500 lines of code in plain cross-platform C. This solution may be good on servers that prefer to process incoming data as fast as possible, perhaps to try to avoid the need to invoke write backpressure in certain cases. For me this solution is just too overkill.

Despite the large amount of code the API actually looks quite well designed. The only header users need to include, libbase64.h, does not include any other headers and has only four functions declared. Two functions are used to initialize some kind of state object, and the other two do encoding/decoding.

The state is used to determine what kind of intrinsics to use. Personally I think it is unnecessary to have dedicated functions for picking SIMD intrinsics, and instead prefer to pick based on the compiled target, but these kind of complexities are expected when dealing with SIMD. Personally I would prefer only two functions and no SIMD at all for maximum simplicity. SIMD really clutters the API.


Next on the list is fastbase64 by lemire. Fast as in SIMD. I decided to simply skip this one after seeing the previous SIMD solution.


Next is b64.c by littlestar. The example code is super simply making for an attractive API. This example is taken straight from the readme.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "b64.h"

int
main (void) {
  unsigned char *str = "brian the monkey and bradley the kinkajou are friends";
  char *enc = b64_encode(str, strlen(str));

  printf("%s\n", enc); // YnJpYW4gdGhlIG1vbmtleSBhbmQgYnJhZGxleSB0aGUga2lua2Fqb3UgYXJlIGZyaWVuZHM=

  char *dec = b64_decode(enc, strlen(enc));

  printf("%s\n", dec); // brian the monkey and bradley the kinkajou are friends
  free(enc);
  free(dec);
  return 0;
}

It looks like the encode/decode functions call malloc underneath. This is forgiveable, but does incur a performance hit in terms of hitting a mutex on most malloc implementations. The mutex cost can be pretty annoying when a single machine is running multiple decoders simultaneously, causing inadvertent synchronization between what should otherwise be isolated threads.

I was a bit impressed by the foresight of the header design. Here’s a snippet from the header.

/**
 *  Memory allocation functions to use. You can define b64_malloc and
 * b64_realloc to custom functions if you want.
 */

#ifndef b64_malloc
#  define b64_malloc(ptr) malloc(ptr)
#endif
#ifndef b64_realloc
#  define b64_realloc(ptr, size) realloc(ptr, size)
#endif

/**
 * Base64 index table.
 */

static const char b64_table[] = {
  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
  'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
  'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
  'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
  'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
  'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
  'w', 'x', 'y', 'z', '0', '1', '2', '3',
  '4', '5', '6', '7', '8', '9', '+', '/'
};

malloc/free are overrideable, which makes the solution more future-proof. This is good. I also noted that an encoding table was exposed in the header. At first I thought this was completely unnecessary and should be hidden inside the implementation. But if we look into the base64 RFC there are actually two encoding tables presented. Once uses + and /, while the other uses – and _. The latter seems to be good for file and URI safe encodings, though the former seems more common. By placing the table in the header users can actually change the source and recompile to suit their needs. More importantly which table is used by this library is shown right here in the header. A smart move as an API designer!

I decided to take a peek at the implementation. It looks small enough. The whole repository is just above 500 lines of code. I noted the decode implementation does not seem to validate input data and return an error if invalid input is found. The RFC suggests to throw away the entire input if any bad data is found in order to avoid exposing security vulnerabilities.

If we peek at the decode operation it actually loops over the entire encoding table in order to find the reverse mapping. This for loop can be replaced by another table lookup by constructing a reverse lookup table. This optimization would also simplify the decoder implementation.

At this point I decided to look at a few more implementations to see if anyone is properly rejecting input data as the RFC suggests, and using a lookup table for decoding.

Other than these points the solution looks quite solid.


Next up is base64 by davidlazar. Curiously the repository description says “High-assurance base64”. Maybe this means input validation on decoding? After peeking into the repository, looks like there’s only an encoder! No decoder… Next!


The last one I will consider is base64 by zhicheng. The header is perfect. Macros for computing output sizes, along with just two functions (one for encoding, and one for decoding). Due to how the base64 algorithm works output sizes can be computed in constant-time with a simple formula. Two macros are used to compute these sizes.

The decoder uses a lookup table. Good. However, I don’t see an kind of input validation… Skip!


Finally I ended up just implementing my own solution. After looking at other implementations and reading the RFC I realized a good way to implement the encoder and decoder. Here is the header followed by the implementation. This code comes from my own personal codebase, and is under the zlib license.

/*
	Cute Framework
	Copyright (C) 2019 Randy Gaul https://randygaul.net

	This software is provided 'as-is', without any express or implied
	warranty.  In no event will the authors be held liable for any damages
	arising from the use of this software.

	Permission is granted to anyone to use this software for any purpose,
	including commercial applications, and to alter it and redistribute it
	freely, subject to the following restrictions:

	1. The origin of this software must not be misrepresented; you must not
	   claim that you wrote the original software. If you use this software
	   in a product, an acknowledgment in the product documentation would be
	   appreciated but is not required.
	2. Altered source versions must be plainly marked as such, and must not be
	   misrepresented as being the original software.
	3. This notice may not be removed or altered from any source distribution.
*/

#include <cute_defines.h>
#include <cute_error.h>

namespace cute
{

// Info about base 64 encoding: https://tools.ietf.org/html/rfc4648

#define CUTE_BASE64_ENCODED_SIZE(size) ((((size) + 2) / 3) * 4)
#define CUTE_BASE64_DECODED_SIZE(size) ((((size) + 3) / 4) * 3)

CUTE_API error_t CUTE_CALL base64_encode(void* dst, int dst_size, const void* src, int src_size);
CUTE_API error_t CUTE_CALL base64_decode(void* dst, int dst_size, const void* src, int src_size);

}

/*
	Cute Framework
	Copyright (C) 2019 Randy Gaul https://randygaul.net

	This software is provided 'as-is', without any express or implied
	warranty.  In no event will the authors be held liable for any damages
	arising from the use of this software.

	Permission is granted to anyone to use this software for any purpose,
	including commercial applications, and to alter it and redistribute it
	freely, subject to the following restrictions:

	1. The origin of this software must not be misrepresented; you must not
	   claim that you wrote the original software. If you use this software
	   in a product, an acknowledgment in the product documentation would be
	   appreciated but is not required.
	2. Altered source versions must be plainly marked as such, and must not be
	   misrepresented as being the original software.
	3. This notice may not be removed or altered from any source distribution.
*/

#include <cute_base64.h>
#include <cute_c_runtime.h>

// Implementation referenced from: https://tools.ietf.org/html/rfc4648

namespace cute
{

// From: https://tools.ietf.org/html/rfc4648#section-3.2
static const uint8_t s_6bits_to_base64[64] = {
	'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
	'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
	'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '/'
};

/*
	Generated by:

		int out_array[79];
		for (int i = 0; i < 79; ++i) out_array[i] = -1;
		for (int i = 0; i < 64; ++i)
		{
			int val = s_6bits_to_base64[i];
			int index = val - 43;
			out_array[index] = i;
		}
		for (int i = 0; i < 79; ++i) printf("%d,\n", out_array[i]);
*/
static const int s_base64_to_6bits[79] = {
	62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4,
	5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1,
	26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
};

error_t base64_encode(void* dst, int dst_size, const void* src, int src_size)
{
	int out_size = CUTE_BASE64_ENCODED_SIZE(src_size);
	if (dst_size < out_size) return error_failure("`dst` buffer too small to place encoded output.");

	int triplets = (src_size) / 3;
	int pads = (src_size) % 3 ? 3 - (src_size) % 3 : 0;

	const uint8_t* in = (const uint8_t*)src;
	uint8_t* out = (uint8_t*)dst;

	while (triplets--)
	{
		uint32_t bits = ((uint32_t)in[0]) << 16 | ((uint32_t)in[1]) << 8 | ((uint32_t)in[2]);
		uint32_t a = (bits & 0xFC0000) >> 18;
		uint32_t b = (bits & 0x3F000) >> 12;
		uint32_t c = (bits & 0xFC0) >> 6;
		uint32_t d = (bits & 0x3F);
		in += 3;
		CUTE_ASSERT(a < 64);
		CUTE_ASSERT(b < 64);
		CUTE_ASSERT(c < 64);
		CUTE_ASSERT(d < 64);
		*out++ = s_6bits_to_base64[a];
		*out++ = s_6bits_to_base64[b];
		*out++ = s_6bits_to_base64[c];
		*out++ = s_6bits_to_base64[d];
	}

	switch (pads)
	{
	case 1:
	{
		uint32_t bits = ((uint32_t)in[0]) << 8 | ((uint32_t)in[1]);
		uint32_t a = (bits & 0xFC00) >> 10;
		uint32_t b = (bits & 0x3F0) >> 4;
		uint32_t c = (bits & 0xF) << 2;
		CUTE_ASSERT(a < 64);
		CUTE_ASSERT(b < 64);
		CUTE_ASSERT(c < 64);
		*out++ = s_6bits_to_base64[a];
		*out++ = s_6bits_to_base64[b];
		*out++ = s_6bits_to_base64[c];
		in += 2;
	}	break;

	case 2:
		uint32_t bits = ((uint32_t)in[0]);
		uint32_t a = (bits & 0xFC) >> 2;
		uint32_t b = (bits & 0x3) << 4;
		CUTE_ASSERT(a < 64);
		CUTE_ASSERT(b < 64);
		*out++ = s_6bits_to_base64[a];
		*out++ = s_6bits_to_base64[b];
		in += 1;
		break;
	}

	while (pads--)
	{
		*out++ = '=';
	}

	CUTE_ASSERT((int)(out - (uint8_t*)dst) == out_size);

	return error_success();
}

error_t base64_decode(void* dst, int dst_size, const void* src, int src_size)
{
	if (!src_size) return error_success();
	if (src_size % 4) return error_failure("`src_size` is not a multiple of 4 (all base64 streams must be padded to a multiple of four with `=` characters).");
	int quadruplets = src_size / 4;
	
	const uint8_t* in = (const uint8_t*)src;
	uint8_t* out = (uint8_t*)dst;

	const uint8_t* end = in + src_size;
	int pads = 0;
	if (end[-1] == '=') pads++;
	if (end[-2] == '=') pads++;
	if (pads) quadruplets--;

	// RFC describes the best way to handle bad input is to reject the entire input.
	// https://tools.ietf.org/html/rfc4648#page-14

	while (quadruplets--)
	{
		uint32_t a = *in++ - 43;
		uint32_t b = *in++ - 43;
		uint32_t c = *in++ - 43;
		uint32_t d = *in++ - 43;
		if ((a > 78) | (b > 78) | (c > 78) | (d > 78)) return error_failure("Found illegal character in input stream.");
		a = s_base64_to_6bits[a];
		b = s_base64_to_6bits[b];
		c = s_base64_to_6bits[c];
		d = s_base64_to_6bits[d];
		if ((a == ~0) | (b == ~0) | (c == ~0) | (d == ~0)) return error_failure("Found illegal character in input stream.");
		uint32_t bits = (a << 26) | (b << 20) | (c << 14) | (d << 8);
		*out++ = (bits & 0xFF000000) >> 24;
		*out++ = (bits & 0x00FF0000) >> 16;
		*out++ = (bits & 0x0000FF00) >> 8;
	}

	switch (pads)
	{
	case 1:
	{
		uint32_t a = *in++ - 43;
		uint32_t b = *in++ - 43;
		uint32_t c = *in++ - 43;
		if ((a > 78) | (b > 78) | (c > 78)) return error_failure("Found illegal character in input stream.");
		a = s_base64_to_6bits[a];
		b = s_base64_to_6bits[b];
		c = s_base64_to_6bits[c];
		if ((a == ~0) | (b == ~0) | (c == ~0)) return error_failure("Found illegal character in input stream.");
		uint32_t bits = (a << 26) | (b << 20) | (c << 14);
		*out++ = (bits & 0xFF000000) >> 24;
		*out++ = (bits & 0x00FF0000) >> 16;
	}	break;

	case 2:
	{
		uint32_t a = *in++ - 43;
		uint32_t b = *in++ - 43;
		if ((a > 78) | (b > 78)) return error_failure("Found illegal character in input stream.");
		a = s_base64_to_6bits[a];
		b = s_base64_to_6bits[b];
		if ((a == ~0) | (b == ~0)) return error_failure("Found illegal character in input stream.");
		uint32_t bits = (a << 26) | (b << 20);
		*out++ = (bits & 0xFF000000) >> 24;
	}	break;
	}

	CUTE_ASSERT((int)(out + pads - (uint8_t*)dst) == CUTE_BASE64_DECODED_SIZE(src_size));

	return error_success();
}

}

The RFC for base64 encoding actually comes with a few very useful test vectors. This made the RFC extremely useful! Whoever made this RFC did a great job. Here’s my test cases.

/*
	Cute Framework
	Copyright (C) 2019 Randy Gaul https://randygaul.net

	This software is provided 'as-is', without any express or implied
	warranty.  In no event will the authors be held liable for any damages
	arising from the use of this software.

	Permission is granted to anyone to use this software for any purpose,
	including commercial applications, and to alter it and redistribute it
	freely, subject to the following restrictions:

	1. The origin of this software must not be misrepresented; you must not
	   claim that you wrote the original software. If you use this software
	   in a product, an acknowledgment in the product documentation would be
	   appreciated but is not required.
	2. Altered source versions must be plainly marked as such, and must not be
	   misrepresented as being the original software.
	3. This notice may not be removed or altered from any source distribution.
*/

#include <cute_base64.h>
using namespace cute;

CUTE_TEST_CASE(test_base64_encode, "Test vectors from RFC 4648.");
int test_base64_encode()
{
	uint8_t buffer[256];

	// Test vectors from: https://tools.ietf.org/html/rfc4648#section-10

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "", 0).is_error());

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "f", 1).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zg==", 4));

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "fo", 2).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zm8=", 4));

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "foo", 3).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zm9v", 4));

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "foob", 4).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zm9vYg==", 8));

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "fooba", 5).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zm9vYmE=", 8));

	CUTE_TEST_CHECK(base64_encode(buffer, 256, "foobar", 6).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "Zm9vYmFy", 8));
	
	CUTE_TEST_CHECK(base64_decode(buffer, 256, "", 0).is_error());

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zg==", 4).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "f", 1));

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zm8=", 4).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "fo", 2));

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zm9v", 4).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "foo", 3));

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zm9vYg==", 8).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "foob", 4));

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zm9vYmE=", 8).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "fooba", 5));

	CUTE_TEST_CHECK(base64_decode(buffer, 256, "Zm9vYmFy", 8).is_error());
	CUTE_TEST_ASSERT(!CUTE_MEMCMP(buffer, "foobar", 6));

	// Assert failure on some bad inputs.
	CUTE_TEST_ASSERT(base64_decode(buffer, 256, "f===", 4).is_error());
	CUTE_TEST_ASSERT(base64_decode(buffer, 256, "foo~", 4).is_error());
	CUTE_TEST_ASSERT(base64_decode(buffer, 256, "foo", 3).is_error());
	CUTE_TEST_ASSERT(base64_decode(buffer, 256, "\\!@$", 4).is_error());

	return 0;
}

TwitterRedditFacebookShare


Viewing all articles
Browse latest Browse all 24

Trending Articles