关于内核系统调用的hooks问题

关于内核系统调用的hooks问题
转载请注明出处 [ By selinux.com]SELinux+
代码地址:https://github.com/qfong/mkm
截获系统调用的常用的方法通过sys_call_table的方法,如:
注册hooks函数:

ref_sys_read = (void *) sys_call_table[SYS_read];
sys_call_table[__NR_read] = (unsigned long *)hooks_sys_read;

unhooks函数

sys_call_table[__NR_read] = (unsigned long *)ref_sys_read;

注册hooks函数,但系统必须导出sys_call_table内核符合,但在2.6内核和部分2.4的内核系统中,sys_call_table不再导出,但可以在内存中找到它。

下面给出的方法是基于通过搜索代码块指针为SYS_CLOSE,寻找一个特定的内存模式,用于发现sys_call_table的指针地址,然后依靠的地址表的前6个字到该地址。这是在x86系统以及一招,但一向被认为是不稳定和不太可靠。如下面的代码:

static unsigned long **acquire_sys_call_table(void)
{
    unsigned long int offset = PAGE_OFFSET;
    unsigned long **sct;

    while(offset < ULLONG_MAX)
    {
        sct = (unsigned long **)offset;
        if(sct[__NR_close] == (unsigned long *) sys_close)
        {
            return sct;
        }
        offset += sizeof(void *);
    }
    printk(KERN_INFO "getting syscall table failed.\n");
    return NULL;
}

但也有一些依靠sys_call的中断处理程序地址中断描述符表,中断陷入点为0x80。然后一个搜索sys_call_table指针地址的处理程序,用于查找需要跳转到的确切sys_call中断处理。

ARM Linux上的系统,没有IDT(Interrupt Descriptor Table),而是软件中断(SWI)处理程序所在LDR指向的地址0x00000008,或在0xFFFF0008或高向量实现(high-vector implementations)如Android。按照地址,查找并加载其系统调用表。

unsigned long* find_sys_call_table()
{
  //Address of the sofware interrupt (swi) handler in high vector ARM systems like Android
  const void *swi_addr = 0xFFFF0008;
 
  unsigned long* sys_call_table = NULL;
  unsigned long* ptr = NULL;
  unsigned long vector_swi_offset = 0;
  unsigned long vector_swi_instruction = 0;
  unsigned long *vector_swi_addr_ptr = NULL;
 
  // Get the load pc instruction from the swi
  memcpy(&vector_swi_instruction, swi_addr, sizeof(vector_swi_instruction));
  printk(KERN_INFO "-> vector_swi instruction %lx\n", vector_swi_instruction);
 
  // Read the offset from the swi adress where the handler pointer lives
  vector_swi_offset = vector_swi_instruction & (unsigned long)0x00000fff;
  printk (KERN_INFO "-> vector_swi offset %lx\n", vector_swi_offset);
 
  // Get the pointer to the swi handler (offset is from the load instruction location + 2 words, due to ARM quirks) 
  vector_swi_addr_ptr = (unsigned long *)((unsigned long)swi_addr + vector_swi_offset + 8);
  printk (KERN_INFO "-> vector_swi_addr_ptr %p, value %lx\n", vector_swi_addr_ptr, *vector_swi_addr_ptr);
 
  /************
   * Starting at the beginning of the handler, search for the sys_call_table address load
   * This code is the result of the /arch/arm/kernel/entry-common.S file, starting at the line
   * ENTRY(vector_swi).  You'll see that there is always a zero_fp after saving register state
   * before any function begins.  It's a good "lighthouse" to search for to make sure 
   * you've entered the stack-frame-proper before looking for the sys_call_table pointer load 
   * instruction
   ************/
 
  ptr=*vector_swi_addr_ptr;
  bool foundFirstLighthouse = false;
  unsigned long sys_call_table_offset = 0;
 
  printk (KERN_INFO "-> ptr %p, init_mm.end_code %lx\n", ptr, init_mm.end_code);
 
  // Don't search past the end of the code block.  This is a dumb bound, I should be searching till I hit the
  // equivalent of a ret in ARM, but I didn't feel like figuring it out since ARM doesn't have a ret instruction
  while ((unsigned long)ptr < init_mm.end_code && sys_call_table == NULL)
  {
    // Find the zero_fp invocation (which translates into a load of zero into R11)
    if ((*ptr & (unsigned long)0xffff0fff) == 0xe3a00000)
    {
      foundFirstLighthouse = true;
      printk (KERN_INFO "-> found first lighthouse at %p, value %lx\n", ptr, *ptr);
    }
 
    // Search for the loading of the sys_call_table (in entry-common.S, given as "adr  tbl, sys_call_table",
    // which translates to an add and a ldr.  The add loads the sys_call_table pointer) 
    if (foundFirstLighthouse && ((*ptr & (unsigned long)0xffff0000) == 0xe28f0000))
    {
      // Get the offset from the add that will contain the actual pointer
      sys_call_table_offset = *ptr & (unsigned long)0x00000fff;
      printk (KERN_INFO "-> sys_call_table reference found at  %p, value %lx, offset %lx\n", ptr, *ptr, 
 
sys_call_table_offset);
 
      // Grab that damn pointer and get on with it!
      sys_call_table = (unsigned long)ptr + 8 + sys_call_table_offset;
      printk (KERN_INFO "-> sys_call_table found at %p\n", sys_call_table);
      break;
    }
 
    ptr++;
  }
 
  return sys_call_table;
}

下面是一些其他获得调用表的方法:
X86平台上

Phrack #58 0x07; sd, devik
unsigned long *find_sys_call_table ( void )
{
    char **p;
    unsigned long sct_off = 0;
    unsigned char code[255];

    asm("sidt %0":"=m" (idtr));//内存中断指令,得到中断描述符表起始地址
    memcpy(&idt, (void *)(idtr.base + 8 * 0x80), sizeof(idt));
    sct_off = (idt.off2 << 16) | idt.off1;
    memcpy(code, (void *)sct_off, sizeof(code));

    p = (char **)memmem(code, sizeof(code), "\xff\x14\x85", 3);

    if ( p )
        return *(unsigned long **)((char *)p + 3);
    else
        return NULL;
}

X86_64平台上

unsigned long *find_sys_call_table ( void )
{
    char **p;
    unsigned long sct_off = 0;
    unsigned char code[512];

    rdmsrl(MSR_LSTAR, sct_off);//获得system_call地址
    memcpy(code, (void *)sct_off, sizeof(code));

    p = (char **)memmem(code, sizeof(code), "\xff\x14\xc5", 3);

    if ( p )
    {
        unsigned long *sct = *(unsigned long **)((char *)p + 3);

        // Stupid compiler doesn't want to do bitwise math on pointers
        sct = (unsigned long *)(((unsigned long)sct & 0xffffffff) | 0xffffffff00000000);
        //在获得sys_call_table地址时,需要和0xffffffff00000000相或,否则可能引起宕机
        return sct;
    }
    else
        return NULL;
}

ia32平台

// Obtain sys_call_table on amd64; pouik
unsigned long *find_ia32_sys_call_table ( void )
{
    char **p;
    unsigned long sct_off = 0;
    unsigned char code[512];

    asm("sidt %0":"=m" (idtr));// /指定内存中断描述符
    memcpy(&idt, (void *)(idtr.base + 16 * 0x80), sizeof(idt));
    sct_off = (idt.off2 << 16) | idt.off1;
    memcpy(code, (void *)sct_off, sizeof(code));

    p = (char **)memmem(code, sizeof(code), "\xff\x14\xc5", 3);

    if ( p )
    {
        unsigned long *sct = *(unsigned long **)((char *)p + 3);

        // Stupid compiler doesn't want to do bitwise math on pointers
        sct = (unsigned long *)(((unsigned long)sct & 0xffffffff) | 0xffffffff00000000);
      //需要和0xffffffff00000000相或,防止寄存器溢出导致宕机
        return sct;
    }
    else
        return NULL;
}

ARM

// Phrack #68 0x06; dong-hoon you
unsigned long *find_sys_call_table ( void )
{
	void *swi_addr = (long *)0xffff0008;
	unsigned long offset, *vector_swi_addr;

	offset = ((*(long *)swi_addr) & 0xfff) + 8;
	vector_swi_addr = *(unsigned long **)(swi_addr + offset);

	while ( vector_swi_addr++ )
		if( ((*(unsigned long *)vector_swi_addr) & 0xfffff000) == 0xe28f8000 )
        {
			offset = ((*(unsigned long *)vector_swi_addr) & 0xfff) + 8;
			return vector_swi_addr + offset;
		}

	return NULL;
}

关于写保护
由于内核的页标记为只读,尝试用函数去写这个区域的内存,会产生一个内核oops。这种保护可以很简单的被规避,但通过设置cr0寄存器的WP位为0,禁止写保护CPU上。控制寄存器维基百科的文章 证实了这一点属性:

位 名称 全名 说明
16 WP 写保护 确定CPU是否可以写入页面标记为只读
WP位将需要在代码中的多个点的设置和重置,所以它使抽象的操作纲领性意义。下面的代码来源于PAX项目(http://pax.grsecurity.net/),专门从native_pax_open_kernel()和native_pax_close_kernel()例程。采取格外谨慎,以防止潜在的竞争条件引起由倒霉的调度在SMP系统,由丹•罗森伯格在一篇博客文章中(http://vulnfactory.org/blog/2011/08/12/wp-safe-or-not/)解释:
As described in the Intel Manuals (Volume 3A, Section 2.5):

WP        Write Protect (bit 16 of CR0) - When set, inhibits supervisor-level proce-
          dures from writing into read-only pages; when clear, allows supervisor-level
          procedures to write into read-only pages (regardless of the U/S bit setting;
          see Section 4.1.3 and Section 4.6). This flag facilitates implementation of the
          copy-on-write method of creating a new process (forking) used by operating
          systems such as UNIX.

Code如下:

inline unsigned long disable_wp ( void )
{
    unsigned long cr0;

    preempt_disable();//抢占内核的控制
    barrier();

    cr0 = read_cr0();
    write_cr0(cr0 & ~X86_CR0_WP);
    return cr0;
}

inline void restore_wp ( unsigned long cr0 )
{
    write_cr0(cr0);

    barrier();
    preempt_enable_no_resched();
}

也可以通过以下代码来实现

/*
*检测CR0写保护位是否设置,如果没有,页写保护已被禁用,并将bitwise-AND的16位置0,并设置CR0
*/
static void disable_page_protection(void)
{
    unsigned long value;
    asm volatile("mov %%cr0, %0" : "=r"(value));
    if(!(value & 0x00010000))
    {
        return ;
    }

    asm volatile("mov %0, %%cr0" : : "r"(value & ~0x00010000));
}

static void enable_page_protection(void)
{
    unsigned long value;
    asm volatile("mov %%cr0, %0" : "=r"(value));

    if((value & 0x00010000))
    {
        return ;
    }
    asm volatile("mov %0, %%cr0" : : "r"(value|0x00010000));
}

在一些ARM中,WP位的概念不存在,必须采取特殊照顾,而在ARM中引入了指令缓存架构来处理数据。虽然数据和指令高速缓存的概念也存在x86和x86_64硬件架构的,这样的功能没有在发展过程中构成障碍。在Android里面,可以不需要对内存保护机制进行考虑。

实现代码地址:https://github.com/qfong/mkm
附文档: 丹•罗森伯格在一篇博客文章
WP: Safe or Not?

During the course of kernel exploitation (or some other form of runtime kernel modification), it is frequently desirable to be able to modify the contents of read-only memory. On x86, a classic trick is to leverage the WP (write-protect) bit in the CR0 register.

As described in the Intel Manuals (Volume 3A, Section 2.5):

WP Write Protect (bit 16 of CR0) – When set, inhibits supervisor-level proce-
dures from writing into read-only pages; when clear, allows supervisor-level
procedures to write into read-only pages (regardless of the U/S bit setting;
see Section 4.1.3 and Section 4.6). This flag facilitates implementation of the
copy-on-write method of creating a new process (forking) used by operating
systems such as UNIX.
In an exploit where code execution has been achieved and the attacker wishes to, for example, install hooks in a read-only data structure, a simple solution is to toggle this bit to 0, perform the write, and toggle it back. This technique has been well-known for years, and is not only used in rootkits but also in commercial anti-virus products (is there a difference?).

In practice, this approach works nearly all of the time. But there are some caveats to be aware of when using this trick in exploit development.

Scheduling Race

On SMP systems, there is a scheduling race that must be dealt with. In extremely unlucky circumstances, it’s possible that the current thread disables the WP bit, is scheduled out at a precise moment, is re-scheduled onto a CPU that still has the WP bit enabled, and faults when attempting to perform a write to read-only memory. Even though I’ve never seen this happen in practice, it’s easy enough to contend with. If this is being done via some mechanism where you have the capability to compile against the current kernel (e.g. a module), one correct way of addressing this is to disable preemption of the current thread while performing writes to read-only pages, as the PaX project does:

static inline unsigned long native_pax_open_kernel(void)
{
unsigned long cr0;

preempt_disable();
barrier();
cr0 = read_cr0() ^ X86_CR0_WP;
BUG_ON(unlikely(cr0 & X86_CR0_WP));
write_cr0(cr0);
return cr0 ^ X86_CR0_WP;
}

static inline unsigned long native_pax_close_kernel(void)
{
unsigned long cr0;

cr0 = read_cr0() ^ X86_CR0_WP;
BUG_ON(unlikely(!(cr0 & X86_CR0_WP)));
write_cr0(cr0);
barrier();
preempt_enable_no_resched();
return cr0 ^ X86_CR0_WP;
}
These two functions, pax_open_kernel() and pax_close_kernel(), are used to modify structures such as the IDT when the PAX_KERNEXEC feature is enabled.

If you’re performing these modifications in an exploit, a simpler solution is to leverage the cli (clear interrupt flag) and sti (set interrupt flag) instructions to disable interrupts entirely during the course of the writes, which prevents re-scheduling as a side effect:

.macro disable_wp
cli
mov eax,cr0
and eax,0xfffeffff
mov cr0,eax
.endm

.macro enable_wp
mov eax,cr0
or eax,0x10000
mov cr0,eax
sti
.endm
Xen

The scheduling issue can easily be worked around, but twiz mentioned to me that there may be problems doing this on Xen. When not using HAP (hardware-assisted paging), Xen handles paging by creating shadow page tables, mapping the guest’s page tables read-only, and relying on WP to cause writes to guest page tables to trap and be handled properly by the hypervisor. As a result, CR0.WP was forcibly enabled in the past. This did not apply when using HAP, where the guest has always been able to freely access CR0.

However, in 2007, Xen added support for emulating the behavior of CR0.WP in order to support software that relies on WP modification (e.g. anti-virus). When the guest faults on a non-user write to a resident page while CR0.WP=0, the faulting instruction is then emulated to allow the write to succeed. This feature is limited by the completeness of the x86 emulator, but everything except the most esoteric instructions should be emulated properly. As a result, leveraging the WP bit to write to read-only memory on Xen should not pose any problems for exploit developers.

Thanks

Thanks to twiz for suggesting looking at Xen, and Keir Fraser for helpful information.

One Reply to “关于内核系统调用的hooks问题”

  1. kvm_intel code is pretty much the same in all 2.6.38 releases as far as cr0.wp is concerned: it is forced to 1 unless the host is EPT capable, in which case the guest is allowed to change it at its leisure regardless of other CR0 bits (IIRC this was implemented to allow QNX to run as a guest OS, since it fiddles with cr0.wp during boot pretty much in the same way as KERNEXEC does).

发表评论

您的电子邮箱地址不会被公开。