前言

base64 encode & decodeCOS333的assignment 3,根据其提示,通过翻阅rfc2045 section 6.8wikipedia编写了base64的解编码器。

什么是base64

base64是一种将原生二进制数据编码为64个可打印ASCII码的编码格式(encoding format)。因为是base256转化为base64,所以大小相比原来膨胀了1/3,不过一般来说是可以接受的。

为什么需要base64

作为大小膨胀的代价,base64带来的好处在于:

  • 兼容上古时期(1990’s)的各种系统的encoding标准,控制字符和其他非打印字符可能被误解读
  • 兼容上古时期的文本协议,不支持base256,比如早期的SMTP,只支持(关注)7位ascii字符
  • 对于不是8 bit clean的系统,其第8位可能被置为0,因此破坏了二进制数据

除此之外,现在web开发中有的时候会将图片等二进制资源塞进json等数据格式中,那么需要图片转化为字符,base64是一种满足该需求的编码格式。

编码(encode)

base64说白了就是原来的base256裁掉了高2位,即8bit->6bit,所以取最小公倍数24 bits作为转化单位,每三个base256为一组转化为四个base64
但是如果最后一组不足三个,那么采取补0和添加=填充的手段。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
void Base64Encoder::Encode(void const *raw_data, size_t len, std::string &text, bool newline) {
size_t written = 0;
unsigned char const *binary = (unsigned char const*)raw_data;
auto final_size = size_t((len * 4) / 3.0);
text.reserve(final_size + (newline ?
(final_size / 64 + ((final_size & 63) == 0 ? 0 : 1)) : 0));

size_t i = 0;
/* Avoid underflow
* Not len - i >= 3 */
for (; len >= 3 + i; ) {
text.push_back(GetBase64Char(GetBase64Index1(binary)));
text.push_back(GetBase64Char(GetBase64Index2(binary)));
text.push_back(GetBase64Char(GetBase64Index3(binary)));
text.push_back(GetBase64Char(GetBase64Index4(binary)));
binary += 3;
i += 3;
written += 4;
if (newline && ((written & 63) == 0))
text.push_back('\n');
}

/* | x..x | xx |
* | 3n i len |
* padding = 3 - left = 3 - (len - i) */
int padding = 3 - (len - i);
assert(padding >= 0 && padding <= 3);

if (padding == 1) {
text.push_back(GetBase64Char(GetBase64Index1(binary)));
text.push_back(GetBase64Char(GetBase64Index2(binary)));
text.push_back(GetBase64Char(GetBase64IndexPadding1(binary)));
text.push_back('=');
} else if (padding == 2) {
text.push_back(GetBase64Char(GetBase64Index1(binary)));
text.push_back(GetBase64Char(GetBase64IndexPadding2(binary)));
text.push_back('=');
text.push_back('=');
}

// The last new line don't put only when
// the new line has been put, i.e.,
// the old written % 64 == 0, and the data is empty
if (newline && len > 0 && ((written & 63) != 0))
text.push_back('\n');
}

更多细节参考base64_encoder.cc

解码(decode)

解码是编码的逆过程。由于编码是转化为了4base64字符为一组的数据流,我们接受的数据大小必然是4的倍数,不是可以舍弃。除了最后一组需要特殊处理(如果有=作为填充位),其他的组都可以采用相同的处理手段提取bit恢复为原来的base256字符。
对于最后一组,

  • 最后一个字符是=
    • 倒数第二个字符是=,说明有两个填充位,那么原来只有一个base256字符,取第一个base64和第二个的合并即可
    • 否则,说明有一个填充位,那么原来有两个base256字符,取第一,二,三个字符合并即可
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

Base64ErrorCode Base64Decoder::Decode(char const *text, size_t len,
std::vector<unsigned char> &raw, bool newline) {
if (len == 0) return BE_OK;

auto new_line_count = newline ? (len / 65 + (len % 65 == 0 ? 0 : 1)) : 0;

if (((len - new_line_count) & 3) != 0)
return BE_INVALID_LENGTH;

// We can compute the size of decoded content
raw.reserve(size_t((len - new_line_count) * (3.0/4)));

char ch1, ch2, ch3, ch4;

// Remove the last newline character
if (newline) len--;
// Handling the last four ascii specially
const size_t end = text[len-1] == '=' ? len - 4 : len;

for (size_t i = 0; i < end;) {
// Omit the newline character
ch1 = GetBase64(text[i++]);
ch2 = GetBase64(text[i++]);
ch3 = GetBase64(text[i++]);
ch4 = GetBase64(text[i++]);
if (ch1 == -1 || ch2 == -1 || ch3 == -1 || ch4 == -1) {
fprintf(stderr, "i = %zu, ch1 = %c, ch2 = %c, ch3 = %c, ch4 = %c\n", i, text[i-4], text[i-3], text[i-2], text[i-1]);
return BE_INVALID_ENCODING;
}

raw.push_back(GetBase256_1(ch1, ch2));
raw.push_back(GetBase256_2(ch2, ch3));
raw.push_back(GetBase256_3(ch3, ch4));

if (newline && text[i] == '\n') ++i;
}

if (text[len-1] == '=') {
/* | 00xxxxxx | 00xxxxxx | = | = | */
if (text[len-2] == '=') {
ch1 = GetBase64(text[len-4]);
ch2 = GetBase64(text[len-3]);
raw.push_back(GetBase256_1(ch1, ch2));
} else {
ch1 = GetBase64(text[len-4]);
ch2 = GetBase64(text[len-3]);
ch3 = GetBase64(text[len-2]);
raw.push_back(GetBase256_1(ch1, ch2));
raw.push_back(GetBase256_2(ch2, ch3));
}
}

return BE_OK;
}

更多细节参考base64_decoder

Test

测试是通过将不同的文件作为输入,然后输出到其他文件,通过cmp(1)openssl enc解编码的结果进行比较来验证程序的正确性。
为了方便和统一,从stdin读取数据,输出到stdout,而至于文件操作则通过redirect实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
g++ -o encode encode.cc ../base64_encoder.cc
g++ -o decode decode.cc ../base64_decoder.cc

for file in $(find raw-files -type f -print); do
echo "file: $file"
echo "*** encode test ***"
./encode <$file >x0
openssl enc -e -base64 <$file >x1
if ! cmp x0 x1; then
exit 1
fi
echo "*** encode test end ***"

echo "*** decode test ***"
./decode <x0 >dx0
openssl enc -d -base64 <x1 >dx1
cmp dx0 dx1
if ! cmp x0 x1; then
exit 1
fi
echo "*** decode test end ***"
done

echo
echo "*** All tests pass ***"
echo

echo "Remove files used for tests"
rm x0 x1 dx0 dx1
echo "Remove successfully"

Github

我将解编码器封装成了类库,尽管是空类,出于以后扩展的考虑,没有实现为纯函数。
我将该库命名为uc,取自Gundam unicorn