2018-12-08

使用FXMARK测试文件系统的多核可扩展性

并行I/O技术是一项用于提升应用性能的技术。随着NVMe设备的出现与发展，并发I/O能力增强。同时，CPU核数呈现增长的趋势。这使得很多高性能数据库开始利用多核来加速核心操作。但是，由于文件系统中存在诸如锁竞争、引用计数、cache-line竞争等扩展性瓶颈，因此文件系统操作的处理能力并不能随着核数的增加呈现出线性增长的关系。FxMark就是 USENIX Annual Technical Conference(ATC’16) 中文章 “Understanding Manycore Scalability of File Systems” 的作者开源的一套microbench套件，可以测试文件系统创建、删除、读/写等操作的多核扩展性

这篇博文从源码的角度分析FxMark的使用

源码结构如下：

fxmark
   |------bin           #一些python脚本，主程序
   |------logs
   |------script
   |------src           #包含了每个microbench的c文件
   |------Makefile
   |------README.md     #使用文档

主程序

主程序位于bin/目录中，run-fxmark.py作为命令入口，调用关系如下：

run-fxmark.py
     |------ run()
              |--------mount()
              |--------fxmark()
                          |
     |--------------------|
     |
     |
   fxmark
     |------main()
              |--------alloc_bench()
              |--------init_bench()
              |--------run_bench() #fork，总共ncpu个进程，每个都调用work_main()
              |            |---------work_main() 
              |                          |---------pre_work()
              |                          |---------main_work()
              |                          |---------post_work()
              |--------report_bench()

run-fxmark.py

if __name__ == "__main__":
    # config parameters
    # -----------------
    #
    # o testing core granularity
    # - Runner.CORE_FINE_GRAIN
    # - Runner.CORE_COARSE_GRAIN
    #
    # o profiling level
    # - PerfMon.LEVEL_LOW
    # - PerfMon.LEVEL_PERF_RECORD
    # - PerfMon.LEVEL_PERF_PROBE_SLEEP_LOCK
    # - PerfMon.LEVEL_PERF_PROBE_SLEEP_LOCK_D  # do NOT use if you don't understand what it is
    # - PerfMon.LEVEL_PERF_LOCK                # do NOT use if you don't understand what it is
    # - PerfMon.LEVEL_PERF_STAT                # for cycles and instructions
    #
    # o testcase filter
    # - (storage device, filesystem, test case, # core, directio | bufferedio)

    # TODO: make it scriptable
    run_config = [
        (Runner.CORE_FINE_GRAIN,
         PerfMon.LEVEL_LOW,
         ("mem", "*", "DWOL", "*", "directio")),
        # ("mem", "tmpfs", "filebench_varmail", "32", "directio")),
        # (Runner.CORE_COARSE_GRAIN,
        #  PerfMon.LEVEL_PERF_RECORD,
        #  ("*", "*", "*", "*", "bufferedio")),
        #
        # (Runner.CORE_COARSE_GRAIN,
        #  PerfMon.LEVEL_PERF_RECORD,
        #  ("*", "*", "*", str(cpupol.PHYSICAL_CHIPS * cpupol.CORE_PER_CHIP), "*"))
    ]

    confirm_media_path()
    for c in run_config:
        runner = Runner(c[0], c[1], c[2])
        runner.run()

run_config是运行配置，上面代码中("mem", "*", "DWOL", "*", "directio")是核心配置，格式为("后端设备名","文件系统名","microbench名","核数","I/O模式")。*表示通配，即测试所有可选的配置

后端设备

1	self.MEDIA_TYPES = ["ssd", "hdd", "nvme", "mem"]

文件系统

self.FS_TYPES = [
#   self.FS_TYPES      = ["tmpfs",
    "ext4", "ext4_no_jnl",
    "xfs",
    "btrfs", "f2fs",
#   "jfs", "reiserfs", "ext2", "ext3",
]

microbench

self.BENCH_TYPES   = [
    # write/write
    "DWAL",
    "DWOL",
    "DWOM",
    "DWSL",
    "MWRL",
    "MWRM",
    "MWCL",
    "MWCM",
    "MWUM",
    "MWUL",
    "DWTL",

    # filebench
    "filebench_varmail",
    "filebench_oltp",
    "filebench_fileserver"

    # dbench
    "dbench_client",

    # read/read
    "MRPL",
    "MRPM",
    "MRPH",
    "MRDM",
    "MRDL",
    "DRBH",
    "DRBM",
    "DRBL",

    # read/write
    # "MRPM_bg",
    # "DRBM_bg",
    # "MRDM_bg",
    # "DRBH_bg",
    # "DRBL_bg",
    # "MRDL_bg",
]

下表详细解释了每个microbench：

I/O模式：直接I/O或buffer I/O，直接I/O不经过page cache

设备配置信息后，会调用函数confirm_media_path()：

def confirm_media_path():
    print("%" * 80)
    print("%% WARNING! WARNING! WARNING! WARNING! WARNING!")
    print("%" * 80)
    yn = input("All data in %s, %s, %s and %s will be deleted. Is it ok? [Y,N]: "
            % (Runner.HDDDEV, Runner.SSDDEV, Runner.NVMEDEV, Runner.LOOPDEV))
    if yn != "Y":
        print("Please, check Runner.LOOPDEV and Runner.NVMEDEV")
        exit(1)
    yn = input("Are you sure? [Y,N]: ")
    if yn != "Y":
        print("Please, check Runner.LOOPDEV and Runner.NVMEDEV")
        exit(1)
    print("%" * 80)
    print("\n\n")

这个函数会输出提示信息，告诉用户以下设备中的数据都会被删除，因为每次运行fxmark都会在这些设备上创建文件并进行测试，所以对于每组测试，测试开始之前需要进行一些格式化处理

LOOPDEV = "/dev/loopX"
NVMEDEV = "/dev/nvme0n1pX"
HDDDEV  = "/dev/sdX"
SSDDEV  = "/dev/sdY"

接着，对run_config中的”每组“配置（由于通配符*的关系，”一组“配置中可能包含多个配置），调用run()函数进行测试：

def run(self):
    try:
        cnt = -1
        self.log_start() #创建日志目录和日志文件、将配置信息写入日志文件
        for (cnt, (media, fs, bench, ncore, dio)) in enumerate(self.gen_config()):
            (ncore, nbg) = self.add_bg_worker_if_needed(bench, ncore)
            nfg = ncore - nbg

            if self.DRYRUN:
                self.log("## %s:%s:%s:%s:%s" % (media, fs, bench, nfg, dio))
                continue

            self.prepre_work(ncore) # 做一些清理工作、设置cpu数
            if not self.mount(media, fs, self.test_root):
                self.log("# Fail to mount %s on %s." % (fs, media))
                continue
            self.log("## %s:%s:%s:%s:%s" % (media, fs, bench, nfg, dio))
            self.pre_work() # 清空缓存
            self.fxmark(media, fs, bench, ncore, nfg, nbg, dio)
            self.post_work()
        self.log("### NUM_TEST_CONF  = %d" % (cnt + 1))
    finally:
        signal.signal(signal.SIGINT, catch_ctrl_C)
        self.log_end()
        self.fxmark_cleanup()
        self.umount(self.test_root)
        self.set_cpus(0)

run()函数首先将该组配置信息记录到日志文件中，然后for循环生成该组配置的每一个配置，针对每一个配置，首先调用mount()函数准备后端存储设备。比如hdd，函数会根据配置的文件系统信息，对设备进行格式化，然后将设备挂载到测试目录相关的一个root/目录下。因此，所有测试需要的文件，以及测试操作，都在这个root/目录中进行。当使用Ext4作为文件系统时，mount()函数会调用下面的命令（注意，下面代码中的/dev/sdb1是我测试时使用的hdd设备上的一个分区，这个值在前面提到的HDDDEV = "/dev/sdX"处设置）：

sudo umount bin/root
sudo mkdir -p bin/root

sudo mkfs.ext4 -F /dev/sdb1
sudo mount -t ext4 /dev/sdb1 bin/root
sudo chmod 777 bin/root

接着run()函数会调用pre_work()函数进行预处理，然后调用fxmark()函数进行测试，最后调用post_work()函数进程收尾工作。所以核心测试工作由fxmark()完成：

def fxmark(self, media, fs, bench, ncore, nfg, nbg, dio):
    #perf数据的日志文件
    self.perfmon_log = os.path.normpath(
        os.path.join(self.log_dir,
                     '.'.join([media, fs, bench, str(nfg), "pm"])))
    (bin, type) = self.get_bin_type(bench) # 获取benchmark 二进制执行文件的路径
    directio = '1' if dio is "directio" else '0'

    if directio is '1':
        if fs is "tmpfs": 
            print("# INFO: DirectIO under tmpfs disabled by default")
            directio='0';
        else: 
            print("# INFO: DirectIO Enabled")

    # 执行benchmark
    cmd = ' '.join([self.fxmark_env(),
                    bin,
                    "--type", type,
                    "--ncore", str(ncore),
                    "--nbg",  str(nbg),
                    "--duration", str(self.DURATION),
                    "--directio", directio,
                    "--root", self.test_root,
                    "--profbegin", "\"%s\"" % self.perfmon_start,
                    "--profend",   "\"%s\"" % self.perfmon_stop,
                    "--proflog", self.perfmon_log])
    p = self.exec_cmd(cmd, self.redirect)
    if self.redirect:
        for l in p.stdout.readlines():
            self.log(l.decode("utf-8").strip())

可以看到，这个函数最后会调用命令bin进行测试，这个命令由函数get_bin_type()决定：

def get_bin_type(self, bench):
    if bench.startswith("filebench_"):
        return (self.filebench_path, bench[len("filebench_"):])
    if bench.startswith("dbench_"):
        return (self.dbench_path, bench[len("dbench_"):])
    return (self.fxmark_path, bench)

fxmark除了设计了一套microbench外，还能使用filebench和dbench进行测试，由于这里主要讨论microbench，所以前面我们提到的bin命令由self.fxmark_path决定，而microbench的类型由bench变量决定

self.FXMARK_NAME    = "fxmark"
...
self.fxmark_path = os.path.normpath(
            os.path.join(CUR_DIR, self.FXMARK_NAME))

所以fxmark()函数中，实际的测试工作转移给了与run-fxmark.py同目录下的fxmark文件。这个文件是src/目录中的fxmark.c编译得到的二进制可执行文件

fxmark

int main(int argc, char *argv[])
{
    struct cmd_opt opt = {NULL, 0, 0, 0, 0, NULL};
    struct bench *bench; 

    /* parse command line options */
    if (parse_option(argc, argv, &opt) < 4) {
        usage(stderr);
        exit(1);
    }

    /* create, initialize, and run a bench */ 
    bench = alloc_bench(opt.ncore, opt.nbg);
    init_bench(bench, &opt);
    run_bench(bench);
    report_bench(bench, stdout);

    return 0;
}

对每个配置的测试，fxmark命令首先会解析命令行参数并调用函数alloc_bench为测试需要的bench结构和worker结构分配空间：

struct bench *alloc_bench(int ncpu, int nbg)
{
    struct bench *bench; 
    struct worker *worker;
    void *shmem;
    int shmem_size = sizeof(*bench) + sizeof(*worker) * ncpu;
    int i;
        
    /* alloc shared memory using mmap */
    shmem = mmap(0, shmem_size, PROT_READ | PROT_WRITE, 
                MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (shmem == MAP_FAILED)
        return NULL;
    memset(shmem, 0, shmem_size);

    /* init. */ 
    bench = (struct bench *)shmem;
    bench->ncpu = ncpu; 
    bench->nbg  = nbg;
    bench->workers = (struct worker*)(shmem + sizeof(*bench));
    for (i = 0; i < ncpu; ++i) {
        worker = &bench->workers[i];
        worker->bench = bench;
        worker->id = seq_cores[i];
        worker->is_bg = i >= (ncpu - nbg);
    }

    return bench;
}

该函数调用mmap()分配一段内存（使用共享内存是因为测试时，每个进程都需要访问这段内存），fd为-1表示使用匿名内存映射，即fd可忽略。大小是一个bench结构的大小加上ncpu个worker结构的大小

bench结构记录了配置的信息，如cpu核数、测试持续时间、I/O模式。由主进程管理
worker结构记录了测试的实际时间、操作完成次数，返回值等信息。测试过程中每个子进程会管理一个worker

并为每个结构实例设置基本信息，结构的关系如下：

然后，fxmark命令会调用函数init_bench()，根据解析出的命令行参数，进一步设置bench实例中的剩余成员。其中ops记录了实际的microbench。对于每个microbench，pre_work指向的函数完成测试前的准备工作，主要用于准备测试文件。main_work指向测试函数，post_work指向的函数完成结束后的收尾工作

接着，fxmark命令会调用函数run_bench()。经过千山万水，测试总算准备开始了：

void run_bench(struct bench *bench)
{
    int i;
    for (i = 1; i < bench->ncpu; ++i) {
        /**
        * fork() is intentionally used instead of pthread
        * to avoid known scalability bottlenecks 
        * of linux virtual memory subsystem. 
        */ 
        pid_t p = fork();
        if (p < 0)
            bench->workers[i].ret = errno;
        else if (!p) {
            worker_main(&bench->workers[i]);
            exit(0);
        }
    }
    worker_main(&bench->workers[0]);
    wait(bench);
}

该函数会根据配置信息中的cpu个数派生子进程，加上主进程，所有ncpu个进程调用worker_main()进行测试：

static void worker_main(void *arg)
{
    struct worker *worker = (struct worker*)arg;
    struct bench *bench = worker->bench;
    uint64_t s_clk = 1, s_us = 1;
    uint64_t e_clk = 0, e_us = 0;
    int err = 0;

    /* set affinity */ 
    setaffinity(worker->id);

    /* pre-work */
    if (bench->ops.pre_work) {
        err = bench->ops.pre_work(worker);
        if (err) goto err_out;
    }

    /* wait for start signal */ 
    worker->ready = 1;
    if (worker->id) {
        while (!bench->start)
        nop_pause();
    }
    else {
        /* are all workers ready? */
        int i;
        for (i = 1; i < bench->ncpu; i++) {
            struct worker *w = &bench->workers[i];
            while (!w->ready)
                nop_pause();
        }
        /* make things more deterministic */
        sync();

        /* start performance profiling */
        if (bench->profile_start_cmd[0])
            system(bench->profile_start_cmd);

        /* ok, before running, set timer */
        if (signal(SIGALRM, sighandler) == SIG_ERR) {
            err = errno;
            goto err_out;
        }
        running_bench = bench;
        alarm(bench->duration);
        bench->start = 1;
        wmb();
    }
        
    /* start time */
    s_clk = rdtsc_beg();
    s_us = usec();

    /* main work */
    if (bench->ops.main_work) {
        err = bench->ops.main_work(worker);
        if (err && err != ENOSPC)
            goto err_out;
    }

    /* end time */ 
    e_clk = rdtsc_end();
    e_us = usec();

    /* stop performance profiling */
    if (!worker->id && bench->profile_stop_cmd[0])
        system(bench->profile_stop_cmd);

    /* post-work */ 
    if (bench->ops.post_work)
        err = bench->ops.post_work(worker);
err_out:
    worker->ret = err;
    worker->usecs = e_us - s_us;
    wmb();
    worker->clocks = e_clk - s_clk;
}

所有进程首先调用bench->ops.pre_work()准备测试文件：

子进程准备文件完成后，设置自己worker结构实例的ready = 1，表示我已经准备好可以开始了，接着子进程会检查bench结构实例的start成员，看看能否开始，不能则等待
主进程准备文件完成后，会检查所有子进程worker结构实例的ready成员是否都设置为1，也就是是否所有子进程都准备好了，当所有子进程都准备好之后，会将bench结构实例的start成员设置为1，表示可以开始测试了

通过上述方式，所有进程能够保证在完成准备工作后一起开始测试，实现同步。这样可以避免测试过程中，其它进程在准备测试文件所带来的干扰

在调用bench->ops.main_work()进行测试的前后，每个进程会记录时间，前后时间相减就是每个进程测试的real time。如果bench->ops.post_work()不为NULL，每个进程还会调用该函数进程一些收尾工作。

最后，当测试完成后，每个子进程将各自完成的操作数记录在了各自的worker结构实例中，fxmark命令会调用report_bench()进行结果汇总：

void report_bench(struct bench *bench, FILE *out)
{
    static char *empty_str = "";
    uint64_t total_usecs = 0;
    double   total_works = 0.0;
    double   avg_secs;
    char *profile_name, *profile_data;
    int i, n_fg_cpu;

        /* if report_bench is overloaded */ 
    if (bench->ops.report_bench) {
        bench->ops.report_bench(bench, out);
        return;
    }

    /* default report_bench impl. */
    for (i = 0; i < bench->ncpu; ++i) {
        struct worker *w = &bench->workers[i];
        if (w->is_bg) continue;
        total_usecs += w->usecs;
        total_works += w->works;
    }
    n_fg_cpu = bench->ncpu - bench->nbg;
    avg_secs = (double)total_usecs/(double)n_fg_cpu/1000000.0;

    /* get profiling result */ 
    profile_name = profile_data = empty_str;
    if (bench->profile_stat_file[0]) {
        FILE *fp = fopen(bench->profile_stat_file, "r");
        size_t len;
        
        if (fp) {
            profile_name = profile_data = NULL;
            getline(&profile_name, &len, fp);
            getline(&profile_data, &len, fp);
            fclose(fp);
        }
    }

    fprintf(out, "# ncpu secs works works/sec %s\n", profile_name);
    fprintf(out, "%d %f %f %f %s\n", 
            n_fg_cpu, avg_secs, total_works, total_works/avg_secs, profile_data);

    if (profile_name != empty_str)
        free(profile_name);
    if (profile_data != empty_str)
        free(profile_data);
}

microbench

前面已经分析过，每个进程都会调用work_main()进行测试，work_main()中会调用bench->ops中的3个函数：pre_work，main_work和post_work。每个microbench定义了自己的这3个函数，所以最终调用什么函数，由microbench决定。下面的代码为每个microbench对应的操作地址，即bench_operations结构的地址：

static struct bench_desc bench_table[] = {
    {"MWCL",
     "inode allocation: each process creates files at its private directory",
     &n_inode_alloc_ops},
    {"DWAL",
     "block allocation: each process appends pages to a private file",
     &n_blk_alloc_ops},
    {"DWOL",
     "block write: each process overwrite a pages to a private file",
     &n_blk_wrt_ops},
    {"MWRM",
     "directory insert: each process moves files from its private directory to a common direcotry",
     &n_dir_ins_ops},
    {"DWSL",
     "journal commit: each process fsync a private file",
     &n_jnl_cmt_ops},
    {"DWOM",
     "mtime update: each process updates a private page of the shared file",
     &n_mtime_upt_ops},
    {"MWRL",
     "rename a file: each process rename a file in its private directory",
     &n_file_rename_ops},
    {"DRBL",
     "file read: each process read a block of its private file",
     &n_file_rd_ops},
    {"DRBL_bg",
     "file read with a background writer",
     &n_file_rd_ops},
    {"DRBM",
     "shared file read: each process reads its private region of the shared file",
     &n_shfile_rd_ops},
    {"DRBM_bg",
     "shared file read with a background writer",
     &n_shfile_rd_bg_ops},
    {"DRBH",
     "shared blk read: each process reads the same page of the shared file",
     &n_shblk_rd_ops},
    {"DRBH_bg",
     "shared blk read with a background writer",
     &n_shblk_rd_bg_ops},
    {"MRDL",
     "directory read: each process reads entries of its private directory",
     &n_dir_rd_ops},
    {"MRDL_bg",
     "directory read with a background writer",
     &n_dir_rd_bg_ops},
    {"MRDM",
     "shared directory read: each process reads entries of the shared directory",
     &n_shdir_rd_ops},
    {"MRDM_bg",
     "shared directory read with a background writer",
     &n_shdir_rd_bg_ops},
    {"MRPL",
     "path resolution for a private file",
     &n_priv_path_rsl_ops},
    {"MRPM",
     "path resolution: each process does stat() at random files in 8-level directories with 8-branching-out factor",
     &n_path_rsl_ops},
    {"MRPM_bg",
     "path resolution  with a background writer",
     &n_path_rsl_bg_ops},
    {"MRPH",
     "path resolution at the same level directory",
     &n_spath_rsl_ops},
    {"MWCM",
     "each process creates files in their private directory",
     &u_file_cr_ops},
    {"MWUM",
     "each process deletes files in their private directory",
     &u_file_rm_ops},
    {"MWUL",
     "each process deletes files at the test root directory",
     &u_sh_file_rm_ops},
    {"DWTL",
     "each process truncates its private file at the test root directory",
     &u_file_tr_ops},
    {NULL, NULL, NULL},
};

以MWCL为例，操作的结构地址为n_inode_alloc_ops，这个结构实例在MWCL.c文件中：

struct bench_operations n_inode_alloc_ops = {
    .pre_work  = pre_work, 
    .main_work = main_work,
};

下面是pre_work()和main_work()：

static void set_test_root(struct worker *worker, char *test_root)
{
    struct fx_opt *fx_opt = fx_opt_worker(worker);
    sprintf(test_root, "%s/%d", fx_opt->root, worker->id);
}

static int pre_work(struct worker *worker)
{
    char test_root[PATH_MAX];
    set_test_root(worker, test_root);
    return mkdir_p(test_root);
}

static int main_work(struct worker *worker)
{
    char test_root[PATH_MAX];
    struct bench *bench = worker->bench;
    uint64_t iter;
    int rc = 0;

    set_test_root(worker, test_root);
    for (iter = 0; !bench->stop; ++iter) {
        char file[PATH_MAX];
        int fd;
        /* create and close */
        snprintf(file, PATH_MAX, "%s/n_inode_alloc-%" PRIu64 ".dat", 
             test_root, iter);
        if ((fd = open(file, O_CREAT | O_RDWR, S_IRWXU)) == -1)
            goto err_out;
        close(fd);
    }
out:
    worker->works = (double)iter;
    return rc;
err_out:
    bench->stop = 1;
    rc = errno;
    goto out;
}

pre_work()根据workid在前面提到过的root/目录下创建测试目录，workid和cpuid有关（0,1,2,…）。main_work()函数在测试目录下循环创建关闭文件，因此MWCL测试每个进程在各自目录下创建文件的性能，即测试创建操作的多核可扩展性

测试结果分析

测试结果记录在log/目录下的一个日志文件中。README.md文档中介绍了如何使用日志文件画图。主要是使用gnuplot，也可以根据日志文件中的结果数据自己使用gnuplot进行绘制。下图为MWCL的一个测试结果。配置信息为：hdd，ext4，bufferio，分别在1，2，4，10，20核下进行了测试：

从图中可以看出，文件创建操作并不能表现出良好扩展性，因为操作完成数并没有随核数增加而线性增长。这是因为文件创建会分配inode，同时需要修改全局的inode链表，这个链表被锁保护。另外，ext4的日志处理过程中也存在锁竞争，这一系列因素使得文件创建操作不具备良好的扩展性

除此之外，fxmark也记录了测试过程中的CPU利用率，可以通过同样的方法画出CPU利用率图

最后，在我们最开始提到的配置信息中，可以指定PerfMon.LEVEL_LOW，PerfMon.LEVEL_PERF_RECORD，PerfMon.LEVEL_PERF_PROBE_SLEEP_LOCK等信息。通过指定这个信息，可以使用perf工具观察测试过程中的热点代码（前提是已经安装了perf），进行更细致的分析，下图是10核下的火焰图：

本文标题:使用FXMARK测试文件系统的多核可扩展性

文章作者:Arking

发布时间:2018-12-08, 16:35:00

最后更新:2018-12-17, 20:33:30

原始链接:https://arkingc.github.io/2018/12/08/2018-12-08-linux-benchmark-fxmark/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。