Saturday, March 20, 2010

page fault and fork


fork和mmap只是负责给进程分配虚拟存储的一些数据结构,并不分配物理页,等程序真正用到的时候才引发
page fault真正分配物理页面,建立虚实映射。page fault的处理函数是do_page_fault()。
do_page_fault()实现了demand paging;
do_page_fault() 和 fork() 配合实现COW机制;
do_page_fault() 和 do_mmap()或brk() 配合实现anonymous mapping,file mapping, shared memory mapping
比如libc共享,heap分配,stack扩展,IPC等;
do_page_fault()还处理了一些内核态的page fault。

fork()只是为child创建了一些必要的数据结构,之后并没有给child分配物理页,而是让child共享原来
parent的页面,等child或parent以后执行write操作时才引发page fault分配物理页面,建立虚实映射。

mmap()把文件映射到进程地址空间,比如libc和heap等等。只是建立vm_area_struct等数据结构,并未分配
物理页,等用到时才引发page fault分配物理页面,建立虚实映射。

这里只说page fault和fork的配合,主要看COW的实现细节。

先了解几个转换和flag:
pte_t pte; pte的高20bit是pfn,低12bit是flag。
struct page *mem_map; mem_map是存放page的数组
这样由pte -> pfn -> page 就能得到page地址了:

unsigned long pfn = pte_pfn(pte);
struct page *page = pfn_to_page(pfn);
static inline unsigned long pte_pfn(pte_t pte)
{
return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
}
#define pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET)) ARCH_PFN_OFFSET为0。

反过来也很容易找到相关函数:

#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
#define page_to_pfn(page) ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)
static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
{
return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) | massage_pgprot(pgprot));
}

明确内存管理中的3种flag和prot,具体意义见代码注释:
pte: 低12位是flag, _PAGE_PRESENT, _PAGE_DIRTY, _PAGE_ACCESSED, _PAGE_RW...,
vm_area_struct->vm_flags VM_READ VM_WRITE...
vm_area_struct->vm_page_prot PROT_READ PROT_WRITE PROT_EXEC PROT_GROWSDOWN PROT_GROWSUP
page struct: PG_dirty PG_swapcache…..

由vma->vm_flags组装vma->vm_page_prot,由vma->vm_page_prot组装pte。

do_page_fault():

case 1。如果page fault是cpu在内核态访问kernel space产生的, 这是由其它进程调用vmalloc()引起的,
因为vmalloc会修改pte。此时就依次copy kernel master PGD中init_mm的top level pt: pgd, pud, pmd。
IA32是2级页表,所以只需copy init_mm的pmd,并没有copy最低一级的pte。一个进程的pgd_t中共有1024个
表项,低768个表项每个进程互相不同,是私有的; 而高256项用的时候再去kernel master PGD中取,进程
共享内核空间在这里体现出来了.

case 2。如果page fault是cpu在内核态访问user space产生的,比如copy_from_user()时。由
search_exception_tables()去处理,先把处理程序放到表里,出现fault就去查表执行相应的处理程序。

case 3。如果page fault是cpu在用户态引发的,这里主要关心这个。此时要区分error和page fault。

如果用户进程访问了kernel space的地址,那是error;
page fault时不但要求address在memory region内,还要求memory region中的vm_flags和access type相符,
例如write引发了page fault,则 vma_flags & VM_WRITE == true 才行。

通过这两个检查之后,只要物理页当前不存在(pte & _PAGE_PRESENT == false) 就demand paging:
分配一个新物理页,里面填上相应内容。所谓当前不存在可能是从来就没存在过,也可能是存在过但现在不
在内存中了,需要从硬盘读进来。

在物理页已经存在的前提下进行write操作,如果页不属于任何进程,则设置pte的 _PAGE_RW ,直接写。
如果物理页已经存在,该页已经属于至少一个进程而且pte的flags是readonly(pte & _PAGE_RW == false)
则分配一个new page,把old page的内容copy到new page中。这种情况就属于COW的处理。
941 /*
942 * This routine handles page faults. It determines the address,
943 * and the problem, and then passes it off to one of the appropriate
944 * routines.
945 */
946 dotraplinkage void __kprobes
947 do_page_fault(struct pt_regs *regs, unsigned long error_code)
948 {
949 struct vm_area_struct *vma;
950 struct task_struct *tsk;
951 unsigned long address;
952 struct mm_struct *mm;
953 int write;
954 int fault;
955
956 tsk = current;
957 mm = tsk->mm;
958
959 /* Get the faulting address: */
960 address = read_cr2();
961
962 /*
963 * Detect and handle instructions that would cause a page fault for
964 * both a tracked kernel page and a userspace page.
965 */
966 if (kmemcheck_active(regs))
967 kmemcheck_hide(regs);
968 prefetchw(&mm->mmap_sem);
969
970 if (unlikely(kmmio_fault(regs, address)))
971 return;
972
973 /*
974 * We fault-in kernel-space virtual memory on-demand. The
975 * 'reference' page table is init_mm.pgd.
976 *
977 * NOTE! We MUST NOT take any locks for this case. We may
978 * be in an interrupt or a critical region, and should
979 * only copy the information from the master page table,
980 * nothing more.
981 *
982 * This verifies that the fault happens in kernel space
983 * (error_code & 4) == 0, and that the fault was not a
984 * protection error (error_code & 9) == 0.
985 */
986 if (unlikely(fault_in_kernel_space(address))) { // case 1
987 if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
988 if (vmalloc_fault(address) >= 0)
989 return;
990
991 if (kmemcheck_fault(regs, address, error_code))
992 return;
993 }
994
995 /* Can handle a stale RO->RW TLB: */
996 if (spurious_fault(error_code, address))
997 return;
998
999 /* kprobes don't want to hook the spurious faults: */
1000 if (notify_page_fault(regs))
1001 return;
1002 /*
1003 * Don't take the mm semaphore here. If we fixup a prefetch
1004 * fault we could otherwise deadlock:
1005 */
1006 bad_area_nosemaphore(regs, error_code, address);
1007
1008 return;
1009 }
1010
1011 /* kprobes don't want to hook the spurious faults: */
1012 if (unlikely(notify_page_fault(regs)))
1013 return;
1014 /*
1015 * It's safe to allow irq's after cr2 has been saved and the
1016 * vmalloc fault has been handled.
1017 *
1018 * User-mode registers count as a user access even for any
1019 * potential system fault or CPU buglet:
1020 */
1021 if (user_mode_vm(regs)) {
1022 local_irq_enable();
1023 error_code |= PF_USER;
1024 } else {
1025 if (regs->flags & X86_EFLAGS_IF)
1026 local_irq_enable();
1027 }
1028
1029 if (unlikely(error_code & PF_RSVD))
1030 pgtable_bad(regs, error_code, address);
1031
1032 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
1033
1034 /*
1035 * If we're in an interrupt, have no user context or are running
1036 * in an atomic region then we must not take the fault:
1037 */
1038 if (unlikely(in_atomic() || !mm)) {
1039 bad_area_nosemaphore(regs, error_code, address);
1040 return;
1041 }
1042
1043 /*
1044 * When running in the kernel we expect faults to occur only to
1045 * addresses in user space. All other faults represent errors in
1046 * the kernel and should generate an OOPS. Unfortunately, in the
1047 * case of an erroneous fault occurring in a code path which already
1048 * holds mmap_sem we will deadlock attempting to validate the fault
1049 * against the address space. Luckily the kernel only validly
1050 * references user space from well defined areas of code, which are
1051 * listed in the exceptions table.
1052 *
1053 * As the vast majority of faults will be valid we will only perform
1054 * the source reference check when there is a possibility of a
1055 * deadlock. Attempt to lock the address space, if we cannot we then
1056 * validate the source. If this is invalid we can skip the address
1057 * space check, thus avoiding the deadlock:
1058 */
1059 if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
1060 if ((error_code & PF_USER) == 0 &&
1061 !search_exception_tables(regs->ip)) { // case 2
1062 bad_area_nosemaphore(regs, error_code, address);
1063 return;
1064 }
1065 down_read(&mm->mmap_sem);
1066 } else {
1067 /*
1068 * The above down_read_trylock() might have succeeded in
1069 * which case we'll have missed the might_sleep() from
1070 * down_read():
1071 */
1072 might_sleep();
1073 }
1074
1075 vma = find_vma(mm, address);
1076 if (unlikely(!vma)) {
1077 bad_area(regs, error_code, address);
1078 return;
1079 }
1080 if (likely(vma->vm_start <= address))
1081 goto good_area;
1082 if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
1083 bad_area(regs, error_code, address);
1084 return;
1085 }
1086 if (error_code & PF_USER) {
1087 /*
1088 * Accessing the stack below %sp is always a bug.
1089 * The large cushion allows instructions like enter
1090 * and pusha to work. ("enter $65535, $31" pushes
1091 * 32 pointers and then decrements %sp by 65535.)
1092 */
1093 if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
1094 bad_area(regs, error_code, address);
1095 return;
1096 }
1097 }
1098 if (unlikely(expand_stack(vma, address))) { // stack expand
1099 bad_area(regs, error_code, address);
1100 return;
1101 }
1102
1103 /*
1104 * Ok, we have a good vm_area for this memory access, so
1105 * we can handle it..
1106 */
1107 good_area:
1108 write = error_code & PF_WRITE;
1109
1110 if (unlikely(access_error(error_code, write, vma))) { // access error
1111 bad_area_access_error(regs, error_code, address);
1112 return;
1113 }
1114
1115 /*
1116 * If for any reason at all we couldn't handle the fault,
1117 * make sure we exit gracefully rather than endlessly redo
1118 * the fault:
1119 */
1120 fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0); // case 3
1121
1122 if (unlikely(fault & VM_FAULT_ERROR)) {
1123 mm_fault_error(regs, error_code, address, fault);
1124 return;
1125 }
1126
1127 if (fault & VM_FAULT_MAJOR) {
1128 tsk->maj_flt++;
1129 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
1130 regs, address);
1131 } else {
1132 tsk->min_flt++;
1133 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
1134 regs, address);
1135 }
1136
1137 check_v8086_mode(regs, address, tsk);
1138
1139 up_read(&mm->mmap_sem);
1140 }


case 1:
do_page_fault() -> vmalloc_fault():
239 /*
240 * 32-bit:
241 *
242 * Handle a fault on the vmalloc or module mapping area
243 */
244 static noinline __kprobes int vmalloc_fault(unsigned long address)
245 {
246 unsigned long pgd_paddr;
247 pmd_t *pmd_k;
248 pte_t *pte_k;
249
250 /* Make sure we are in vmalloc area: */
251 if (!(address >= VMALLOC_START && address < VMALLOC_END))
252 return -1;
253
254 /*
255 * Synchronize this task's top level page-table // top level
256 * with the 'reference' page table.
257 *
258 * Do _not_ use "current" here. We might be inside
259 * an interrupt in the middle of a task switch..
260 */
261 pgd_paddr = read_cr3();
262 pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
263 if (!pmd_k)
264 return -1;
265
266 pte_k = pte_offset_kernel(pmd_k, address);
267 if (!pte_present(*pte_k))
268 return -1;
269
270 return 0;
271 }

注释已经说了是同步top level page table,未同步最后一级的页表,vmalloc_sync_one()里也确实是这样。
最后一级的pte是由 pte_k = pte_offset_kernel(pmd_k, address); 算出来的。

179 #ifdef CONFIG_X86_32
180 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
181 {
182 unsigned index = pgd_index(address);
183 pgd_t *pgd_k;
184 pud_t *pud, *pud_k;
185 pmd_t *pmd, *pmd_k;
186
187 pgd += index;
188 pgd_k = init_mm.pgd + index;
189
190 if (!pgd_present(*pgd_k))
191 return NULL;
192
193 /*
194 * set_pgd(pgd, *pgd_k); here would be useless on PAE
195 * and redundant with the set_pmd() on non-PAE. As would
196 * set_pud.
197 */
198 pud = pud_offset(pgd, address);
199 pud_k = pud_offset(pgd_k, address);
200 if (!pud_present(*pud_k))
201 return NULL;
202
203 pmd = pmd_offset(pud, address);
204 pmd_k = pmd_offset(pud_k, address);
205 if (!pmd_present(*pmd_k))
206 return NULL;
207
208 if (!pmd_present(*pmd))
209 set_pmd(pmd, *pmd_k);
210 else
211 BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
212
213 return pmd_k;
214 }

下面看看vmalloc和vfree是如何处理pte的。
vmalloc会更改init_mm中的pud,pmd,pte; vfree()则把init_mm中的pud,pmd,pte删掉。

vmalloc() -> __vmalloc_node() -> __vmalloc_area_node() -> map_vm_area() -> vmap_page_range():

/*
* Set up page tables in kva (addr, end). The ptes shall have prot "prot", and
* will have pfns corresponding to the "pages" array.
*
* Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
*/
static int vmap_page_range(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages)
{
pgd_t *pgd;
unsigned long next;
int err = 0;
int nr = 0;

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
err = vmap_pud_range(pgd, addr, next, prot, pages, &nr);
if (err)
break;
} while (pgd++, addr = next, addr != end);
flush_cache_vmap(addr, end);

if (unlikely(err))
return err;
return nr;
}

vmap_page_range()-> vmap_pud_range() -> vmap_pte_range():
91 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
92 unsigned long end, pgprot_t prot, struct page **pages, int *nr)
93 {
94 pte_t *pte;
95
96 /*
97 * nr is a running index into the array which helps higher level
98 * callers keep track of where we're up to.
99 */
100
101 pte = pte_alloc_kernel(pmd, addr);
102 if (!pte)
103 return -ENOMEM;
104 do {
105 struct page *page = pages[*nr];
106
107 if (WARN_ON(!pte_none(*pte)))
108 return -EBUSY;
109 if (WARN_ON(!page))
110 return -ENOMEM;
111 set_pte_at(&init_mm, addr, pte, mk_pte(page, prot)); // set pte
112 (*nr)++;
113 } while (pte++, addr += PAGE_SIZE, addr != end);
114 return 0;
115 }

vfree()-> __vunmap() -> remove_vm_area() -> vunmap_page_range() -> vmap_debug_free_range() ->
vunmap_page_range():

static void vunmap_page_range(unsigned long addr, unsigned long end)
{
pgd_t *pgd;
unsigned long next;

BUG_ON(addr >= end);
pgd = pgd_offset_k(addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
vunmap_pud_range(pgd, addr, next);
} while (pgd++, addr = next, addr != end);
}

vunmap_pud_range() -> vunmap_pmd_range() -> vunmap_pte_range():
37 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
38 {
39 pte_t *pte;
40
41 pte = pte_offset_kernel(pmd, addr);
42 do {
43 pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); // clear pte
44 WARN_ON(!pte_none(ptent) && !pte_present(ptent));
45 } while (pte++, addr += PAGE_SIZE, addr != end);
46 }

注意fork()时也copy了swapper进程的pgd的高256项,这意味着fork()出来的用户进程一旦切换到内核空间,
马上就可以使用内核空间的地址,因为页表已经有映射了。各个进程共享内核页表,在这也体现出来了。

copy_mm()->dup_mm()->mm_init()->mm_alloc_pgd()->pgd_alloc()-> pgd_ctor(pgd):

68 static void pgd_ctor(pgd_t *pgd)
69 {
70 /* If the pgd points to a shared pagetable level (either the
71 ptes in non-PAE, or shared PMD in PAE), then just copy the
72 references from swapper_pg_dir. */
73 if (PAGETABLE_LEVELS == 2 ||
74 (PAGETABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
75 PAGETABLE_LEVELS == 4) {
76 clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
77 swapper_pg_dir + KERNEL_PGD_BOUNDARY,
78 KERNEL_PGD_PTRS);
79 paravirt_alloc_pmd_clone(__pa(pgd) >> PAGE_SHIFT,
80 __pa(swapper_pg_dir) >> PAGE_SHIFT,
81 KERNEL_PGD_BOUNDARY,
82 KERNEL_PGD_PTRS);
83 }
84
85 /* list required to sync kernel mapping updates */
86 if (!SHARED_KERNEL_PMD)
87 pgd_list_add(pgd);
88 }

#define KERNEL_PGD_BOUNDARY pgd_index(PAGE_OFFSET) // 768
#define KERNEL_PGD_PTRS (PTRS_PER_PGD - KERNEL_PGD_BOUNDARY) // 1024-768=256

case 2: 注释说了,不进去了。
case 3: do_page_fault()->handle_mm_fault()->handle_pte_fault():
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
* RISC architectures). The early dirtying is also good on the i386.
*
* There is also a hook called "update_mmu_cache()" that architectures
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *pte, pmd_t *pmd, unsigned int flags)
{
pte_t entry;
spinlock_t *ptl;

entry = *pte;
if (!pte_present(entry)) { // demand paging
if (pte_none(entry)) {
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address, pte, pmd, flags, entry);
}
return do_anonymous_page(mm, vma, address, pte, pmd, flags);
}
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address, pte, pmd, flags, entry);
return do_swap_page(mm, vma, address, pte, pmd, flags, entry);
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
if (flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); // COW handling
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(vma, address, entry);
} else {
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (flags & FAULT_FLAG_WRITE)
flush_tlb_page(vma, address);
}
unlock:
pte_unmap_unlock(pte, ptl);
return 0;
}

可见demand paging还分好几种情况,先不说这里,先看COW。
fork()时就没有直接copy pte给child,只是设置parent的pte flag为readonly,如果child此后想write,就
会引发page fault,进入COW handling真正分配物理页。看看fork()中的相关部分:

do_fork() -> copy_process() -> copy_mm()-> dup_mm()-> dup_mmap()-> copy_page-> range()->
copy_pud_range()-> copy_pmd_range()-> copy_pte_range()-> copy_one_pte():

569 /*
570 * copy one vm_area from one task to the other. Assumes the page tables
571 * already present in the new task to be cleared in the whole range
572 * covered by this vma.
573 */
574
575 static inline unsigned long
576 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
577 pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
578 unsigned long addr, int *rss)
579 {
......

614 /*
615 * If it's a COW mapping, write protect it both
616 * in the parent and the child
617 */
618 if (is_cow_mapping(vm_flags)) { // COW
619 ptep_set_wrprotect(src_mm, addr, src_pte);
620 pte = pte_wrprotect(pte);
621 }
622
623 /*
624 * If it's a shared mapping, mark it clean in
625 * the child
626 */
627 if (vm_flags & VM_SHARED)
628 pte = pte_mkclean(pte);
629 pte = pte_mkold(pte);
630
631 page = vm_normal_page(vma, addr, pte);
632 if (page) {
633 get_page(page); // COW keypoint: page->_count++
634 page_dup_rmap(page);
635 rss[PageAnon(page)]++;
636 }
637
638 out_set_pte:
639 set_pte_at(dst_mm, addr, dst_pte, pte);
640 return 0;
641 }

static inline int is_cow_mapping(unsigned int flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

static inline pte_t pte_wrprotect(pte_t pte)
{
return pte_clear_flags(pte, _PAGE_RW);
}

pte_wrprotect() 去掉了pte的 _PAGE_RW 权限,使得pte为readonly。
vm_normal_page() 是由pte得到struct page。get_page()就是增加引用计数.
注意COW这里不包含flag VM_SHARED,VM_SHRED flag用于处理其它情况。

下面看看page fault和fork()是如何配合实现COW机制的,代码注释直接把cow作为动词使用:

1985 /*
1986 * This routine handles present pages, when users try to write
1987 * to a shared page. It is done by copying the page to a new address
1988 * and decrementing the shared-page counter for the old page.
1989 *
1990 * Note that this routine assumes that the protection checks have been
1991 * done by the caller (the low-level page fault routine in most cases).
1992 * Thus we can safely just mark it writable once we've done any necessary
1993 * COW.
1994 *
1995 * We also mark the page dirty at this point even though the page will
1996 * change only once the write actually happens. This avoids a few races,
1997 * and potentially makes it more efficient.
1998 *
1999 * We enter with non-exclusive mmap_sem (to exclude vma changes,
2000 * but allow concurrent faults), with pte both mapped and locked.
2001 * We return with mmap_sem still held, but pte unmapped and unlocked.
2002 */
2003 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
2004 unsigned long address, pte_t *page_table, pmd_t *pmd,
2005 spinlock_t *ptl, pte_t orig_pte)
2006 {
2007 struct page *old_page, *new_page;
2008 pte_t entry;
2009 int reuse = 0, ret = 0;
2010 int page_mkwrite = 0;
2011 struct page *dirty_page = NULL;
2012
2013 old_page = vm_normal_page(vma, address, orig_pte);
2014 if (!old_page) {
2015 /*
2016 * VM_MIXEDMAP !pfn_valid() case
2017 *
2018 * We should not cow pages in a shared writeable mapping.
2019 * Just mark the pages writable as we can't do any dirty
2020 * accounting on raw pfn maps.
2021 */
2022 if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
2023 (VM_WRITE|VM_SHARED))
2024 goto reuse;
2025 goto gotten;
2026 }
2027
2028 /*
2029 * Take out anonymous pages first, anonymous shared vmas are
2030 * not dirty accountable.
2031 */
2032 if (PageAnon(old_page) && !PageKsm(old_page)) {
2033 if (!trylock_page(old_page)) {
2034 page_cache_get(old_page);
2035 pte_unmap_unlock(page_table, ptl);
2036 lock_page(old_page);
2037 page_table = pte_offset_map_lock(mm, pmd, address,
2038 &ptl);
2039 if (!pte_same(*page_table, orig_pte)) {
2040 unlock_page(old_page);
2041 page_cache_release(old_page);
2042 goto unlock;
2043 }
2044 page_cache_release(old_page);
2045 }
2046 reuse = reuse_swap_page(old_page);
2047 unlock_page(old_page);
2048 } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
2049 (VM_WRITE|VM_SHARED))) {
2050 /*
2051 * Only catch write-faults on shared writable pages,
2052 * read-only shared pages can get COWed by
2053 * get_user_pages(.write=1, .force=1).
2054 */
2055 if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
2056 struct vm_fault vmf;
2057 int tmp;
2058
2059 vmf.virtual_address = (void __user *)(address &
2060 PAGE_MASK);
2061 vmf.pgoff = old_page->index;
2062 vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
2063 vmf.page = old_page;
2064
2065 /*
2066 * Notify the address space that the page is about to
2067 * become writable so that it can prohibit this or wait
2068 * for the page to get into an appropriate state.
2069 *
2070 * We do this without the lock held, so that it can
2071 * sleep if it needs to.
2072 */
2073 page_cache_get(old_page);
2074 pte_unmap_unlock(page_table, ptl);
2075
2076 tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
2077 if (unlikely(tmp &
2078 (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
2079 ret = tmp;
2080 goto unwritable_page;
2081 }
2082 if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
2083 lock_page(old_page);
2084 if (!old_page->mapping) {
2085 ret = 0; /* retry the fault */
2086 unlock_page(old_page);
2087 goto unwritable_page;
2088 }
2089 } else
2090 VM_BUG_ON(!PageLocked(old_page));
2091
2092 /*
2093 * Since we dropped the lock we need to revalidate
2094 * the PTE as someone else may have changed it. If
2095 * they did, we just return, as we can count on the
2096 * MMU to tell us if they didn't also make it writable.
2097 */
2098 page_table = pte_offset_map_lock(mm, pmd, address,
2099 &ptl);
2100 if (!pte_same(*page_table, orig_pte)) {
2101 unlock_page(old_page);
2102 page_cache_release(old_page);
2103 goto unlock;
2104 }
2105
2106 page_mkwrite = 1;
2107 }
2108 dirty_page = old_page;
2109 get_page(dirty_page);
2110 reuse = 1;
2111 }
2112
2113 if (reuse) {
2114 reuse:
2115 flush_cache_page(vma, address, pte_pfn(orig_pte));
2116 entry = pte_mkyoung(orig_pte);
2117 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
2118 if (ptep_set_access_flags(vma, address, page_table, entry,1))
2119 update_mmu_cache(vma, address, entry);
2120 ret |= VM_FAULT_WRITE;
2121 goto unlock;
2122 }
2123
2124 /*
2125 * Ok, we need to copy. Oh, well.. // start COW handling
2126 */
2127 page_cache_get(old_page); // old_page->_count++
2128 gotten:
2129 pte_unmap_unlock(page_table, ptl);
2130
2131 if (unlikely(anon_vma_prepare(vma)))
2132 goto oom;
2133
2134 if (is_zero_pfn(pte_pfn(orig_pte))) {
2135 new_page = alloc_zeroed_user_highpage_movable(vma, address);
2136 if (!new_page)
2137 goto oom;
2138 } else {
2139 new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
2140 if (!new_page)
2141 goto oom;
2142 cow_user_page(new_page, old_page, address, vma); // copy old_page contents to new_page
2143 }
2144 __SetPageUptodate(new_page);
2145
2146 /*
2147 * Don't let another task, with possibly unlocked vma,
2148 * keep the mlocked page.
2149 */
2150 if ((vma->vm_flags & VM_LOCKED) && old_page) {
2151 lock_page(old_page); /* for LRU manipulation */
2152 clear_page_mlock(old_page);
2153 unlock_page(old_page);
2154 }
2155
2156 if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
2157 goto oom_free_new;
2158
2159 /*
2160 * Re-check the pte - we dropped the lock
2161 */
2162 page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
2163 if (likely(pte_same(*page_table, orig_pte))) {
2164 if (old_page) {
2165 if (!PageAnon(old_page)) {
2166 dec_mm_counter(mm, file_rss);
2167 inc_mm_counter(mm, anon_rss);
2168 }
2169 } else
2170 inc_mm_counter(mm, anon_rss);
2171 flush_cache_page(vma, address, pte_pfn(orig_pte));
2172 entry = mk_pte(new_page, vma->vm_page_prot); // create pte
2173 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
2174 /*
2175 * Clear the pte entry and flush it first, before updating the
2176 * pte with the new entry. This will avoid a race condition
2177 * seen in the presence of one thread doing SMC and another
2178 * thread doing COW.
2179 */
2180 ptep_clear_flush(vma, address, page_table);
2181 page_add_new_anon_rmap(new_page, vma, address);
2182 /*
2183 * We call the notify macro here because, when using secondary
2184 * mmu page tables (such as kvm shadow page tables), we want the
2185 * new page to be mapped directly into the secondary page table.
2186 */
2187 set_pte_at_notify(mm, address, page_table, entry);
2188 update_mmu_cache(vma, address, entry);
2189 if (old_page) {
2190 /*
2191 * Only after switching the pte to the new page may
2192 * we remove the mapcount here. Otherwise another
2193 * process may come and find the rmap count decremented
2194 * before the pte is switched to the new page, and
2195 * "reuse" the old page writing into it while our pte
2196 * here still points into it and can be read by other
2197 * threads.
2198 *
2199 * The critical issue is to order this
2200 * page_remove_rmap with the ptp_clear_flush above.
2201 * Those stores are ordered by (if nothing else,)
2202 * the barrier present in the atomic_add_negative
2203 * in page_remove_rmap.
2204 *
2205 * Then the TLB flush in ptep_clear_flush ensures that
2206 * no process can access the old page before the
2207 * decremented mapcount is visible. And the old page
2208 * cannot be reused until after the decremented
2209 * mapcount is visible. So transitively, TLBs to
2210 * old page will be flushed before it can be reused.
2211 */
2212 page_remove_rmap(old_page);
2213 }
2214
2215 /* Free the old page.. */
2216 new_page = old_page; // attention!
2217 ret |= VM_FAULT_WRITE;
2218 } else
2219 mem_cgroup_uncharge_page(new_page);
2220
2221 if (new_page)
2222 page_cache_release(new_page);
2223 if (old_page)
2224 page_cache_release(old_page);
2225 unlock:
2226 pte_unmap_unlock(page_table, ptl);
2227 if (dirty_page) {
2228 /*
2229 * Yes, Virginia, this is actually required to prevent a race
2230 * with clear_page_dirty_for_io() from clearing the page dirty
2231 * bit after it clear all dirty ptes, but before a racing
2232 * do_wp_page installs a dirty pte.
2233 *
2234 * do_no_page is protected similarly.
2235 */
2236 if (!page_mkwrite) {
2237 wait_on_page_locked(dirty_page);
2238 set_page_dirty_balance(dirty_page, page_mkwrite);
2239 }
2240 put_page(dirty_page);
2241 if (page_mkwrite) {
2242 struct address_space *mapping = dirty_page->mapping;
2243
2244 set_page_dirty(dirty_page);
2245 unlock_page(dirty_page);
2246 page_cache_release(dirty_page);
2247 if (mapping) {
2248 /*
2249 * Some device drivers do not set page.mapping
2250 * but still dirty their pages
2251 */
2252 balance_dirty_pages_ratelimited(mapping);
2253 }
2254 }
2255
2256 /* file_update_time outside page_lock */
2257 if (vma->vm_file)
2258 file_update_time(vma->vm_file);
2259 }
2260 return ret;
2261 oom_free_new:
2262 page_cache_release(new_page);
2263 oom:
2264 if (old_page) {
2265 if (page_mkwrite) {
2266 unlock_page(old_page);
2267 page_cache_release(old_page);
2268 }
2269 page_cache_release(old_page);
2270 }
2271 return VM_FAULT_OOM;
2272
2273 unwritable_page:
2274 page_cache_release(old_page);
2275 return ret;
2276 }

#define page_cache_get(page) get_page(page)
#define page_cache_release(page) put_page(page)

注释说了: read-only shared page应该以COW的方式处理,writable shared page引发的page fault用另外
的方式处理。

抛开page cache相关部分,跳过PageAnon,直接看COW的处理部分:
先执行page_cache_get(old_page),
如果old page是一个zero page,那直接分配一个new page时直接初始化为0就行了,这样不会污染cache。
否则在分配一个new page后还要用cow_user_page()把old page中的内容copy过来。

然后组装对应于new page的pte。执行到下面这里时,要注意前面有行赋值: 2216 new_page = old_page;
2221 if (new_page)
2222 page_cache_release(new_page);
2223 if (old_page)
2224 page_cache_release(old_page);

所以这里实际是两次执行old_page->count--,而对应的page_cache_get(old_page)这里只执行了一次,
但这里 _count不会成负值,因为fork()里已经执行_count++了: 633 get_page(page);

可见,fork()结束后,page->_count就已经是1了,每fork一次,就执行一次_count++。这样如果child执行
write时,发生page fault后,就会按COW方式处理,给child分配新的物理页,copy parent页面的内容进去。
当所有的child都执行write操作后,_count == 0。以后再对old page执行write操作时,就不必分配新物理
页了,直接write就可以了。

这里_count--后,并没有对old page权限做改动,old page的pte仍是read only,以便后面进行COW处理。
这里用_count field来处理race condition。COW的关键就在于引用计数_count的处理。
fork()时不给child分配物理内存,节省了存储和时间,却引起了 _count 处理的复杂和代码的难读。
比如若单是为了抵消fork()中的get_page(),这里执行一次page_cache_release()就可以了。
但综合考虑其它多种情况,这里是执行了一次page_cache_get(),两次page_cache_release
可见引用计数是COW的关键,虽然COW减少了存储开销,加快了速度,却带来了代码的复杂。

这里dirty == NULL,直接结束了。

No comments:

Post a Comment