剖析虚拟内存(1)--字符串在虚拟内存中的段位置与/proc虚拟文件系统

原文：Hack The Virtual Memory

简介

这是一系列围绕虚拟内存展开的小文章/教程中的第一篇。文章的目的是学习一些计算机基础知识，但是是以一些不同的、更加实际的方式。

对于第一部分，我们将使用linux系统中虚拟文件/proc来查找并修改一些运行进程的虚拟内存中的变量（在本例中，是ASCII字符串），并学习到一些很酷的事。

环境

文中出现的脚本和程序均在一下环境中测试运行：

Ubuntu 20.04 LTS-WSL
- Linux DESKTOP-4U2GD5V 4.19.104-microsoft-standard #1 SMP Wed Feb 19 06:37:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
gcc
- gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
python 3
- Python 3.8.5 (default, Jul 28 2020, 12:59:40)
- [GCC 9.3.0] on linux

预备知识

为了能完全理解本文，需要一下预备知识：

C语言编程基础
一些Python相关知识
Linux文件系统和Shell相关基础知识

虚拟内存

在计算机中，虚拟内存是一种同时使用硬件和软件实现的内存管理技术。它将程序使用的内存地址（称为虚拟内存）映射到计算机物理内存地址中，程序所见的主存地址以一系列连续的地址空间，或者连续的地址段集合出现。操作系统虚拟地址空间，并将实际内存分配给虚拟内存。CPU中地址转换的硬件（通常称为内存管理单元或MMU）自动将虚拟地址转换为内存地址。操作系统中的软件可以扩展这些功能，以提供超过实际内存容量的虚拟地址空间，因此可以在计算机中使用比实际物理内存更多的内存空间。

虚拟内存的主要好处包括，使应用程序不必管理共享内存空间，由于内存隔离提高了安全性，并且通过分页技术可以使用比物理可用内存更多的内存。

可选：上Wikipedia了解更多关于虚拟内存。（可能需要翻墙）

在第二章，我们将更加深入细节并对虚拟内存的内容和位置进行一些实际的检查。现在，继续阅读之前有一些关键点你需要知道：

每个进程都有自己独立的虚拟内存
虚拟内存的数量取决于你的系统架构
每个操作系统处理虚拟内存的方式不尽相同，但是大多数现代的操作系统上，虚拟内存看起来如下图：

在内存高地址中，存储着以下内容：（这并不是详细的内容清单，实际存储的内容更多，但是那些不是这次讨论的内容）

程序运行的命令行参数以及环境变量
栈空间：向下增长（由高地址向低地址方向增长）。这看起来有点反直觉，单这是栈在虚拟内存中实现的方式。

在内存低地址中，有以下内容：

你的可执行文件（实际比这个描述稍复杂点，但这足以理解本文中其他内容）
堆空间：向上增长（由低地址向高地址方向增长）。堆是动态分配内存的一部分（包含使用malloc分配的内存）

另外，请记住，虚拟内存与RAM是不同的。

C程序

让我们从这个简单的C程序开始：

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/**
 * main - uses strdup to create a new string, and prints the
 * address of the new duplcated string
 *
 * Return: EXIT_FAILURE if malloc failed. Otherwise EXIT_SUCCESS
 */
int main(void)
{
    char *s;

    s = strdup("Holberton");
    if (s == NULL)
    {
        fprintf(stderr, "Can't allocate mem with malloc\n");
        return (EXIT_FAILURE);
    }
    printf("%p\n", (void *)s);
    return (EXIT_SUCCESS);
}

strdup

继续之前，花点时间想想strdup怎么复制字符串"Holberton"，如何证实？

strdup需要创建一个新的字符串，因此要先为这个新建字符串预留空间。strdup函数可能正在使用malloc。快速浏览一下手册就可以确认一下：

继续之前，思考一下，根据我们前面关于虚拟内存的描述，你认为这个新的字符串副本会位于什么位置？高地址还是低地址？

可能在低地址（堆）中，让我们编译运行一下这个小程序来测试我们的猜想：

我们生成的字符串副本位于内存地址0x56501eba62a0，太好了，但这是一个低地址还是高地址呢？

进程的虚拟内存有多大

进程的虚拟内存大小取决与你的系统架构。在这个例子中，我使用的64位的机器，所以理论上每个进程的虚拟内存大小是2^64^字节.理论上，虚拟内存的最高位地址可能是0xffffffffffffffff，最低地址是0x0

0x56501eba62a0与0xffffffffffffffff相比很小，所以字符串副本可能在低地址中。我们在查看proc文件系统时确认这一点。

proc文件系统

手册描述（来自man proc)

PROC(5)                                Linux Programmer's Manual                                PROC(5)

NAME
       proc - process information pseudo-filesystem

DESCRIPTION
       The  proc  filesystem  is  a pseudo-filesystem which provides an interface to kernel data struc‐
       tures.  It is commonly mounted at /proc.  Typically, it is mounted automatically by the  system,
       but it can also be mounted manually using a command such as:

           mount -t proc proc /proc

       Most  of  the  files in the proc filesystem are read-only, but some files are writable, allowing
       kernel variables to be changed.

如果列出/proc目录下内容，会看到很多文件，我们聚焦其中两项：

/proc/[pid]/maps
/proc/[pid]/mem

maps

手册描述（来自man proc)

/proc/[pid]/maps
         A  file containing the currently mapped memory regions and their access permissions.  See mmap(2) for some further information about memory mappings.
     
         The format of the file is:
        address           perms offset  dev   inode       pathname
        00400000-00452000 r-xp 00000000 08:02 173521      /usr/bin/dbus-daemon
        00651000-00652000 r--p 00051000 08:02 173521      /usr/bin/dbus-daemon
        00652000-00655000 rw-p 00052000 08:02 173521      /usr/bin/dbus-daemon
        00e03000-00e24000 rw-p 00000000 00:00 0           [heap]
        00e24000-011f7000 rw-p 00000000 00:00 0           [heap]
        ...
        35b1800000-35b1820000 r-xp 00000000 08:02 135522  /usr/lib64/ld-2.15.so
        35b1a1f000-35b1a20000 r--p 0001f000 08:02 135522  /usr/lib64/ld-2.15.so
        35b1a20000-35b1a21000 rw-p 00020000 08:02 135522  /usr/lib64/ld-2.15.so
        35b1a21000-35b1a22000 rw-p 00000000 00:00 0
        35b1c00000-35b1dac000 r-xp 00000000 08:02 135870  /usr/lib64/libc-2.15.so
        35b1dac000-35b1fac000 ---p 001ac000 08:02 135870  /usr/lib64/libc-2.15.so
        35b1fac000-35b1fb0000 r--p 001ac000 08:02 135870  /usr/lib64/libc-2.15.so
        35b1fb0000-35b1fb2000 rw-p 001b0000 08:02 135870  /usr/lib64/libc-2.15.so
        ...
        f2c6ff8c000-7f2c7078c000 rw-p 00000000 00:00 0    [stack:986]
        ...
        7fffb2c0d000-7fffb2c2e000 rw-p 00000000 00:00 0   [stack]
        7fffb2d48000-7fffb2d49000 r-xp 00000000 00:00 0   [vdso]
       
       The  address  field  is  the address space in the process that the mapping occupies.  The
          perms field is a set of permissions:
        
        r = read
              w = write
              x = execute
              s = shared
              p = private (copy on write)
        
          The offset field is the offset into the file/whatever; dev is the  device  (major:minor);
          inode is the inode on that device.  0 indicates that no inode is associated with the mem‐
          ory region, as would be the case with BSS (uninitialized data).
        
          The pathname field will usually be the file that is backing the mapping.  For ELF  files,
          you can easily coordinate with the offset field by looking at the Offset field in the ELF
          program headers (readelf -l).
        
          There are additional helpful pseudo-paths:
        
               [stack]
                      The initial process's (also known as the main thread's) stack.
        
               [stack:<tid>] (from Linux 3.4 to 4.4)
                      A thread's stack (where the <tid> is a thread ID).   It  corresponds  to  the
                      /proc/[pid]/task/[tid]/  path.   This  field  was removed in Linux 4.5, since
                      providing this information for a process with large numbers of threads is ex‐
                      pensive.
        
               [vdso] The virtual dynamically linked shared object.  See vdso(7).
        
               [heap] The process's heap.
        
          If  the  pathname  field  is blank, this is an anonymous mapping as obtained via mmap(2).
          There is no easy way to coordinate this back to a process's source, short of  running  it
          through gdb(1), strace(1), or similar.
        
          pathname is shown unescaped except for newline characters, which are replaced with an oc‐
          tal escape sequence.  As a result, it is not possible to determine whether  the  original
          pathname contained a newline character or the literal \e012 character sequence.
        
          If  the  mapping is file-backed and the file has been deleted, the string " (deleted)" is
          appended to the pathname.  Note that this is ambiguous too.
        
          Under Linux 2.0, there is no field giving pathname.

mem

手册描述：（man proc)

/proc/[pid]/mem
	This file can be used to access the pages of a process's memory through open(2), read(2),and lseek(2).   	
	
			       				Permission to access this file is governed by a ptrace access mode PTRACE_MODE_ATTACH_FS‐CREDS check; see ptrace(2).

这意味着我们可以通过/proc/[pid]/mem文件定位到运行中程序的堆空间。如果我们可以从堆中读取，就可以定位到想要修改的字符串。如果我们可以写堆，就可以用我们想要的替换这个字符串。

pid

进程是程序的实例，具有唯一的id。在很多函数和系统调用中使用进程ID(pid)，与进程进行交互和操作。

C程序

现在我们具备了需要的一切来写一个脚本从运行程序的堆中查找一个字符串并替换成其他的字符串（长度一致或者更短，否则可能会造成内存越界破坏程序运行。下面例子中由于堆空间只有一个字符串，长度超出原有字符串长度也不会使程序崩溃，但是依旧不建议使用长度超出原字符串长度的内容替换，因为会有无法预期的结果）。

下面我们将使用下面的简单程序来无限循环，并输出一个固定的字符串”Holberton”。

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>

pid_t getpid(void);
/**
 * main - uses strdup to create a new string, loops forever-ever
 *
 * Return: EXIT_FAILURE if malloc failed. Other never returns
 */
int main(void)
{
    char *s;
    unsigned long int i;

    s = strdup("Holberton");
    if (NULL == s)
    {
        fprintf(stderr, "Can't allocate mem with malloc\n");
        return EXIT_FAILURE;
    }
    i = 0;
    while (s)
    {
        printf("[%lu] [pid: %d] %s (%p)\n", i, getpid(), s, s);
        sleep(5);
        i++;
    }
    return EXIT_SUCCESS;
}

编译并运行上面的程序，将会获得一个类似如下图的输出，并无限循环直到你手动杀死这个进程。

/proc/pid/maps

[heap]

查看maps文件，可以找到定位目标字符串的所需的所有条件

堆：

在虚拟内存中起始地址0x564a10b4e000，结束地址0x564a10b6f000
堆地址段权限为rw，可读写
查看上面程序运行打印，可证实字符串副本地址位于堆空间中

重写虚拟内存中字符串

编写python3脚本（也可用其他任何语言）

#!/usr/bin/env python3
'''
Locates and replaces the first occurrence of a string in the heap
of a process

Usage: ./read_write_heap.py PID search_string replace_by_string
Where:
- PID is the pid of the target process
- search_string is the ASCII string you are looking to overwrite
- replace_by_string is the ASCII string you want to replace
  search_string with
'''
import sys


def print_usage_and_exit():
    print('Usage: {} pid search write'.format(sys.argv[0]))
    sys.exit(1)


# check usage
if len(sys.argv) != 4:
    print_usage_and_exit()

# get the pid from args
pid = int(sys.argv[1])
if pid <= 0:
    print_usage_and_exit()
search_string = str(sys.argv[2])
if search_string == '':
    print_usage_and_exit()
write_string = str(sys.argv[3])
if write_string == '':
    print_usage_and_exit()

# open the maps and mem files of the process
maps_filename = "/proc/{}/maps".format(pid)
print("[*] maps: {}".format(maps_filename))
mem_filename = "/proc/{}/mem".format(pid)
print("[*] mem: {}".format(mem_filename))

# try opening the maps file
try:
    print("open map files")
    maps_file = open('/proc/{}/maps'.format(pid), 'r')
except IOError as e:
    print("[ERROR] Can not open file {}:".format(maps_filename))
    print("        I/O error({}): {}".format(e.errno, e.strerror))
    sys.exit(1)

for line in maps_file:
    sline = line.split(' ')
    # check if we found the heap
    if "[heap]" not in sline[-1]:
        continue
    print("[*] Found [heap]:")

    # parse line
    addr = sline[0]
    perm = sline[1]
    offset = sline[2]
    device = sline[3]
    inode = sline[4]
    pathname = sline[-1][:-1]
    print("\tpathname = {}".format(pathname))
    print("\taddresses = {}".format(addr))
    print("\tpermisions = {}".format(perm))
    print("\toffset = {}".format(offset))
    print("\tinode = {}".format(inode))

    # check if there is read and write permission
    if perm[0] != 'r' or perm[1] != 'w':
        print("[*] {} does not have read/write permission".format(pathname))
        maps_file.close()
        exit(0)

    # get start and end of the heap in the virtual memory
    addr = addr.split("-")
    if len(addr) != 2:  # never trust anyone, not even your OS :)
        print("[*] Wrong addr format")
        maps_file.close()
        exit(1)
    addr_start = int(addr[0], 16)
    addr_end = int(addr[1], 16)
    print("\tAddr start [{:x}] | end [{:x}]".format(addr_start, addr_end))

    # open and read mem
    try:
        mem_file = open(mem_filename, 'rb+')
    except IOError as e:
        print("[ERROR] Can not open file {}:".format(mem_filename))
        print("        I/O error({}): {}".format(e.errno, e.strerror))
        maps_file.close()
        exit(1)

    # read heap
    mem_file.seek(addr_start)
    heap = mem_file.read(addr_end - addr_start)

    # find string
    try:
        i = heap.index(bytes(search_string, "ASCII"))
    except Exception:
        print("Can't find '{}'".format(search_string))
        maps_file.close()
        mem_file.close()
        exit(0)
    print("[*] Found '{}' at {:x}".format(search_string, i))

    # write the new string
    print("[*] Writing '{}' at {:x}".format(write_string, addr_start + i))
    mem_file.seek(addr_start + i)
    mem_file.write(bytes(write_string, "ASCII"))

    # close files
    maps_file.close()
    mem_file.close()

    # there is only one heap in our example
    break