一碗阳春面: page fault and fork


fork和mmap只是负责给进程分配虚拟存储的一些数据结构，并不分配物理页，等程序真正用到的时候才引发
page fault真正分配物理页面，建立虚实映射。page fault的处理函数是do_page_fault()。
do_page_fault()实现了demand paging; 
do_page_fault() 和 fork() 配合实现COW机制; 
do_page_fault() 和 do_mmap()或brk() 配合实现anonymous mapping，file mapping, shared memory mapping
比如libc共享，heap分配，stack扩展，IPC等;
do_page_fault()还处理了一些内核态的page fault。

fork()只是为child创建了一些必要的数据结构，之后并没有给child分配物理页，而是让child共享原来
parent的页面，等child或parent以后执行write操作时才引发page fault分配物理页面，建立虚实映射。

mmap()把文件映射到进程地址空间，比如libc和heap等等。只是建立vm_area_struct等数据结构，并未分配
物理页，等用到时才引发page fault分配物理页面，建立虚实映射。

这里只说page fault和fork的配合，主要看COW的实现细节。

先了解几个转换和flag:
pte_t pte;      pte的高20bit是pfn，低12bit是flag。
struct page *mem_map;    mem_map是存放page的数组
这样由pte -> pfn -> page 就能得到page地址了:

unsigned long pfn = pte_pfn(pte);
struct page *page = pfn_to_page(pfn);
static inline unsigned long pte_pfn(pte_t pte)
{
        return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
}
#define  pfn_to_page(pfn)      (mem_map + ((pfn) - ARCH_PFN_OFFSET))    ARCH_PFN_OFFSET为0。

反过来也很容易找到相关函数:

#define mk_pte(page, pgprot)   pfn_pte(page_to_pfn(page), (pgprot))
#define  page_to_pfn(page)     ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)
static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
{
        return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) | massage_pgprot(pgprot));
}

明确内存管理中的3种flag和prot，具体意义见代码注释:
pte: 低12位是flag, _PAGE_PRESENT, _PAGE_DIRTY, _PAGE_ACCESSED, _PAGE_RW...，
vm_area_struct->vm_flags       VM_READ VM_WRITE...
vm_area_struct->vm_page_prot   PROT_READ PROT_WRITE PROT_EXEC PROT_GROWSDOWN PROT_GROWSUP
page struct: PG_dirty PG_swapcache…..

由vma->vm_flags组装vma->vm_page_prot，由vma->vm_page_prot组装pte。

do_page_fault():

case 1。如果page fault是cpu在内核态访问kernel space产生的, 这是由其它进程调用vmalloc()引起的，
因为vmalloc会修改pte。此时就依次copy kernel master PGD中init_mm的top level pt: pgd, pud, pmd。
IA32是2级页表，所以只需copy init_mm的pmd，并没有copy最低一级的pte。一个进程的pgd_t中共有1024个
表项，低768个表项每个进程互相不同，是私有的; 而高256项用的时候再去kernel master PGD中取，进程
共享内核空间在这里体现出来了.

case 2。如果page fault是cpu在内核态访问user space产生的，比如copy_from_user()时。由
search_exception_tables()去处理，先把处理程序放到表里，出现fault就去查表执行相应的处理程序。

case 3。如果page fault是cpu在用户态引发的，这里主要关心这个。此时要区分error和page fault。

如果用户进程访问了kernel space的地址，那是error;
page fault时不但要求address在memory region内，还要求memory region中的vm_flags和access type相符，
例如write引发了page fault,则 vma_flags & VM_WRITE == true 才行。

通过这两个检查之后，只要物理页当前不存在(pte & _PAGE_PRESENT == false) 就demand paging:
分配一个新物理页，里面填上相应内容。所谓当前不存在可能是从来就没存在过，也可能是存在过但现在不
在内存中了，需要从硬盘读进来。

在物理页已经存在的前提下进行write操作，如果页不属于任何进程，则设置pte的 _PAGE_RW ，直接写。
如果物理页已经存在，该页已经属于至少一个进程而且pte的flags是readonly(pte & _PAGE_RW == false)
则分配一个new page，把old page的内容copy到new page中。这种情况就属于COW的处理。
 941 /*                      
 942  * This routine handles page faults.  It determines the address,
 943  * and the problem, and then passes it off to one of the appropriate
 944  * routines.
 945  */
 946 dotraplinkage void __kprobes
 947 do_page_fault(struct pt_regs *regs, unsigned long error_code)
 948 {
 949         struct vm_area_struct *vma;
 950         struct task_struct *tsk;
 951         unsigned long address;
 952         struct mm_struct *mm;
 953         int write;
 954         int fault;
 955 
 956         tsk = current;
 957         mm = tsk->mm;
 958 
 959         /* Get the faulting address: */
 960         address = read_cr2();
 961 
 962         /*
 963          * Detect and handle instructions that would cause a page fault for
 964          * both a tracked kernel page and a userspace page.
 965          */
 966         if (kmemcheck_active(regs))
 967                 kmemcheck_hide(regs);
 968         prefetchw(&mm->mmap_sem);
 969 
 970         if (unlikely(kmmio_fault(regs, address)))
 971                 return;
 972 
 973         /*
 974          * We fault-in kernel-space virtual memory on-demand. The
 975          * 'reference' page table is init_mm.pgd.
 976          *
 977          * NOTE! We MUST NOT take any locks for this case. We may
 978          * be in an interrupt or a critical region, and should
 979          * only copy the information from the master page table,
 980          * nothing more.
 981          *
 982          * This verifies that the fault happens in kernel space
 983          * (error_code & 4) == 0, and that the fault was not a
 984          * protection error (error_code & 9) == 0.
 985          */
 986         if (unlikely(fault_in_kernel_space(address))) {                  // case 1
 987                 if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
 988                         if (vmalloc_fault(address) >= 0)
 989                                 return;
 990 
 991                         if (kmemcheck_fault(regs, address, error_code))
 992                                 return;
 993                 }
 994 
 995                 /* Can handle a stale RO->RW TLB: */
 996                 if (spurious_fault(error_code, address))
 997                         return;
 998 
 999                 /* kprobes don't want to hook the spurious faults: */
1000                 if (notify_page_fault(regs))
1001                         return;
1002                 /*
1003                  * Don't take the mm semaphore here. If we fixup a prefetch
1004                  * fault we could otherwise deadlock:
1005                  */
1006                 bad_area_nosemaphore(regs, error_code, address);
1007 
1008                 return;
1009         }
1010 
1011         /* kprobes don't want to hook the spurious faults: */
1012         if (unlikely(notify_page_fault(regs)))
1013                 return;
1014         /*
1015          * It's safe to allow irq's after cr2 has been saved and the
1016          * vmalloc fault has been handled.
1017          *
1018          * User-mode registers count as a user access even for any
1019          * potential system fault or CPU buglet:
1020          */
1021         if (user_mode_vm(regs)) {
1022                 local_irq_enable();
1023                 error_code |= PF_USER;
1024         } else {
1025                 if (regs->flags & X86_EFLAGS_IF)
1026                         local_irq_enable();
1027         }
1028 
1029         if (unlikely(error_code & PF_RSVD))
1030                 pgtable_bad(regs, error_code, address);
1031 
1032         perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
1033 
1034         /*
1035          * If we're in an interrupt, have no user context or are running
1036          * in an atomic region then we must not take the fault:
1037          */
1038         if (unlikely(in_atomic() || !mm)) {
1039                 bad_area_nosemaphore(regs, error_code, address);
1040                 return;
1041         }
1042 
1043         /*
1044          * When running in the kernel we expect faults to occur only to
1045          * addresses in user space.  All other faults represent errors in
1046          * the kernel and should generate an OOPS.  Unfortunately, in the
1047          * case of an erroneous fault occurring in a code path which already
1048          * holds mmap_sem we will deadlock attempting to validate the fault
1049          * against the address space.  Luckily the kernel only validly
1050          * references user space from well defined areas of code, which are
1051          * listed in the exceptions table.
1052          *
1053          * As the vast majority of faults will be valid we will only perform
1054          * the source reference check when there is a possibility of a
1055          * deadlock. Attempt to lock the address space, if we cannot we then
1056          * validate the source. If this is invalid we can skip the address
1057          * space check, thus avoiding the deadlock:
1058          */
1059         if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
1060                 if ((error_code & PF_USER) == 0 &&
1061                     !search_exception_tables(regs->ip)) {                   // case 2
1062                         bad_area_nosemaphore(regs, error_code, address);
1063                         return;
1064                 }
1065                 down_read(&mm->mmap_sem);
1066         } else {
1067                 /*
1068                  * The above down_read_trylock() might have succeeded in
1069                  * which case we'll have missed the might_sleep() from
1070                  * down_read():
1071                  */
1072                 might_sleep();
1073         }
1074 
1075         vma = find_vma(mm, address);
1076         if (unlikely(!vma)) {
1077                 bad_area(regs, error_code, address);
1078                 return;
1079         }
1080         if (likely(vma->vm_start <= address))
1081                 goto good_area;
1082         if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
1083                 bad_area(regs, error_code, address);
1084                 return;
1085         }
1086         if (error_code & PF_USER) {
1087                 /*
1088                  * Accessing the stack below %sp is always a bug.
1089                  * The large cushion allows instructions like enter
1090                  * and pusha to work. ("enter $65535, $31" pushes
1091                  * 32 pointers and then decrements %sp by 65535.)
1092                  */
1093                 if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
1094                         bad_area(regs, error_code, address);
1095                         return;
1096                 }
1097         }
1098         if (unlikely(expand_stack(vma, address))) {      // stack expand
1099                 bad_area(regs, error_code, address);
1100                 return;
1101         }
1102 
1103         /*
1104          * Ok, we have a good vm_area for this memory access, so
1105          * we can handle it..
1106          */
1107 good_area:
1108         write = error_code & PF_WRITE;
1109 
1110         if (unlikely(access_error(error_code, write, vma))) {        // access error
1111                 bad_area_access_error(regs, error_code, address);
1112                 return;
1113         }
1114 
1115         /*
1116          * If for any reason at all we couldn't handle the fault,
1117          * make sure we exit gracefully rather than endlessly redo
1118          * the fault:
1119          */
1120         fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);  // case 3
1121 
1122         if (unlikely(fault & VM_FAULT_ERROR)) {
1123                 mm_fault_error(regs, error_code, address, fault);
1124                 return;
1125         }
1126 
1127         if (fault & VM_FAULT_MAJOR) {
1128                 tsk->maj_flt++;
1129                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
1130                                      regs, address);
1131         } else {
1132                 tsk->min_flt++;
1133                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
1134                                      regs, address);
1135         }
1136 
1137         check_v8086_mode(regs, address, tsk);
1138 
1139         up_read(&mm->mmap_sem);
1140 }


case 1:
do_page_fault() -> vmalloc_fault():
239 /*
240  * 32-bit:
241  *
242  *   Handle a fault on the vmalloc or module mapping area
243  */
244 static noinline __kprobes int vmalloc_fault(unsigned long address)
245 {
246         unsigned long pgd_paddr;
247         pmd_t *pmd_k;
248         pte_t *pte_k;
249 
250         /* Make sure we are in vmalloc area: */
251         if (!(address >= VMALLOC_START && address < VMALLOC_END))
252                 return -1;
253 
254         /*
255          * Synchronize this task's top level page-table       // top level 
256          * with the 'reference' page table.
257          *
258          * Do _not_ use "current" here. We might be inside
259          * an interrupt in the middle of a task switch..
260          */
261         pgd_paddr = read_cr3();
262         pmd_k = vmalloc_sync_one(__va(pgd_paddr), address);
263         if (!pmd_k)
264                 return -1;
265 
266         pte_k = pte_offset_kernel(pmd_k, address);
267         if (!pte_present(*pte_k))
268                 return -1;
269 
270         return 0;
271 }

注释已经说了是同步top level page table，未同步最后一级的页表，vmalloc_sync_one()里也确实是这样。
最后一级的pte是由 pte_k = pte_offset_kernel(pmd_k, address); 算出来的。

179 #ifdef CONFIG_X86_32
180 static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
181 {
182         unsigned index = pgd_index(address);
183         pgd_t *pgd_k;
184         pud_t *pud, *pud_k;
185         pmd_t *pmd, *pmd_k;
186 
187         pgd += index;
188         pgd_k = init_mm.pgd + index;
189 
190         if (!pgd_present(*pgd_k))
191                 return NULL;
192 
193         /*
194          * set_pgd(pgd, *pgd_k); here would be useless on PAE
195          * and redundant with the set_pmd() on non-PAE. As would
196          * set_pud.
197          */
198         pud = pud_offset(pgd, address);
199         pud_k = pud_offset(pgd_k, address);
200         if (!pud_present(*pud_k))
201                 return NULL;
202 
203         pmd = pmd_offset(pud, address);
204         pmd_k = pmd_offset(pud_k, address);
205         if (!pmd_present(*pmd_k))
206                 return NULL;
207 
208         if (!pmd_present(*pmd))
209                 set_pmd(pmd, *pmd_k);
210         else
211                 BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
212 
213         return pmd_k;
214 }

下面看看vmalloc和vfree是如何处理pte的。
vmalloc会更改init_mm中的pud,pmd,pte; vfree()则把init_mm中的pud,pmd,pte删掉。

vmalloc() -> __vmalloc_node() -> __vmalloc_area_node() -> map_vm_area() -> vmap_page_range():

/*
 * Set up page tables in kva (addr, end). The ptes shall have prot "prot", and
 * will have pfns corresponding to the "pages" array.
 *
 * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
 */
static int vmap_page_range(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages)
{
        pgd_t *pgd;
        unsigned long next;
        int err = 0;
        int nr = 0;

        BUG_ON(addr >= end);
        pgd = pgd_offset_k(addr);
        do {
                next = pgd_addr_end(addr, end);
                err = vmap_pud_range(pgd, addr, next, prot, pages, &nr);
                if (err)
                        break;
        } while (pgd++, addr = next, addr != end);
        flush_cache_vmap(addr, end);

        if (unlikely(err))
                return err;
        return nr;
}

vmap_page_range()-> vmap_pud_range() -> vmap_pte_range():
 91 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 92                 unsigned long end, pgprot_t prot, struct page **pages, int *nr)
 93 {
 94         pte_t *pte;
 95 
 96         /*
 97          * nr is a running index into the array which helps higher level
 98          * callers keep track of where we're up to.
 99          */
100 
101         pte = pte_alloc_kernel(pmd, addr);
102         if (!pte)
103                 return -ENOMEM;
104         do {
105                 struct page *page = pages[*nr];
106 
107                 if (WARN_ON(!pte_none(*pte)))
108                         return -EBUSY;
109                 if (WARN_ON(!page))
110                         return -ENOMEM;
111                 set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));   // set pte
112                 (*nr)++;
113         } while (pte++, addr += PAGE_SIZE, addr != end);
114         return 0;
115 }

vfree()-> __vunmap() -> remove_vm_area() -> vunmap_page_range() -> vmap_debug_free_range() ->
vunmap_page_range():
 
static void vunmap_page_range(unsigned long addr, unsigned long end)
{
        pgd_t *pgd;
        unsigned long next;

        BUG_ON(addr >= end);
        pgd = pgd_offset_k(addr);
        do {
                next = pgd_addr_end(addr, end);
                if (pgd_none_or_clear_bad(pgd))
                        continue;
                vunmap_pud_range(pgd, addr, next);
        } while (pgd++, addr = next, addr != end);
}

vunmap_pud_range() -> vunmap_pmd_range() -> vunmap_pte_range():
37 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end)
38 {
39         pte_t *pte;
40 
41         pte = pte_offset_kernel(pmd, addr);
42         do {
43                 pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);   // clear pte
44                 WARN_ON(!pte_none(ptent) && !pte_present(ptent));
45         } while (pte++, addr += PAGE_SIZE, addr != end);
46 }

注意fork()时也copy了swapper进程的pgd的高256项，这意味着fork()出来的用户进程一旦切换到内核空间，
马上就可以使用内核空间的地址，因为页表已经有映射了。各个进程共享内核页表，在这也体现出来了。

copy_mm()->dup_mm()->mm_init()->mm_alloc_pgd()->pgd_alloc()-> pgd_ctor(pgd):

68 static void pgd_ctor(pgd_t *pgd)
69 {
70         /* If the pgd points to a shared pagetable level (either the
71            ptes in non-PAE, or shared PMD in PAE), then just copy the
72            references from swapper_pg_dir. */
73         if (PAGETABLE_LEVELS == 2 ||
74             (PAGETABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
75             PAGETABLE_LEVELS == 4) {
76                 clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
77                                 swapper_pg_dir + KERNEL_PGD_BOUNDARY,
78                                 KERNEL_PGD_PTRS);
79                 paravirt_alloc_pmd_clone(__pa(pgd) >> PAGE_SHIFT,
80                                          __pa(swapper_pg_dir) >> PAGE_SHIFT,
81                                          KERNEL_PGD_BOUNDARY,
82                                          KERNEL_PGD_PTRS);
83         }
84 
85         /* list required to sync kernel mapping updates */
86         if (!SHARED_KERNEL_PMD)
87                 pgd_list_add(pgd);
88 }

#define KERNEL_PGD_BOUNDARY     pgd_index(PAGE_OFFSET)                 // 768
#define KERNEL_PGD_PTRS         (PTRS_PER_PGD - KERNEL_PGD_BOUNDARY)   // 1024-768=256

case 2: 注释说了，不进去了。
case 3: do_page_fault()->handle_mm_fault()->handle_pte_fault():
/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures).  The early dirtying is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults), and pte mapped but not yet locked.
 * We return with mmap_sem still held, but pte unmapped and unlocked.
 */
static inline int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                        unsigned long address, pte_t *pte, pmd_t *pmd, unsigned int flags)
{
        pte_t entry;
        spinlock_t *ptl;

        entry = *pte;
        if (!pte_present(entry)) {          // demand paging
                if (pte_none(entry)) {
                        if (vma->vm_ops) {
                                if (likely(vma->vm_ops->fault))
                                        return do_linear_fault(mm, vma, address, pte, pmd, flags, entry);
                        }
                        return do_anonymous_page(mm, vma, address, pte, pmd, flags);
                }
                if (pte_file(entry))
                        return do_nonlinear_fault(mm, vma, address, pte, pmd, flags, entry);
                return do_swap_page(mm, vma, address, pte, pmd, flags, entry);
        }

        ptl = pte_lockptr(mm, pmd);
        spin_lock(ptl);
        if (unlikely(!pte_same(*pte, entry)))
                goto unlock;
        if (flags & FAULT_FLAG_WRITE) {
                if (!pte_write(entry))
                        return do_wp_page(mm, vma, address, pte, pmd, ptl, entry);  // COW handling
                entry = pte_mkdirty(entry);
        }
        entry = pte_mkyoung(entry);
        if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
                update_mmu_cache(vma, address, entry);
        } else {
                /*
                 * This is needed only for protection faults but the arch code
                 * is not yet telling us if this is a protection fault or not.
                 * This still avoids useless tlb flushes for .text page faults
                 * with threads.
                 */
                if (flags & FAULT_FLAG_WRITE)
                        flush_tlb_page(vma, address);
        }
unlock:
        pte_unmap_unlock(pte, ptl);
        return 0;
}

    可见demand paging还分好几种情况，先不说这里，先看COW。
fork()时就没有直接copy pte给child，只是设置parent的pte flag为readonly，如果child此后想write，就
会引发page fault，进入COW handling真正分配物理页。看看fork()中的相关部分:

do_fork() -> copy_process() -> copy_mm()-> dup_mm()-> dup_mmap()-> copy_page-> range()->
copy_pud_range()-> copy_pmd_range()-> copy_pte_range()-> copy_one_pte():

 569 /*
 570  * copy one vm_area from one task to the other. Assumes the page tables
 571  * already present in the new task to be cleared in the whole range
 572  * covered by this vma.
 573  */
 574 
 575 static inline unsigned long
 576 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 577                 pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
 578                 unsigned long addr, int *rss)
 579 {
             ......

 614         /*
 615          * If it's a COW mapping, write protect it both
 616          * in the parent and the child
 617          */
 618         if (is_cow_mapping(vm_flags)) {                      // COW
 619                 ptep_set_wrprotect(src_mm, addr, src_pte);
 620                 pte = pte_wrprotect(pte);
 621         }
 622 
 623         /*
 624          * If it's a shared mapping, mark it clean in
 625          * the child
 626          */
 627         if (vm_flags & VM_SHARED)
 628                 pte = pte_mkclean(pte);
 629         pte = pte_mkold(pte);
 630 
 631         page = vm_normal_page(vma, addr, pte);
 632         if (page) {
 633                 get_page(page);                  // COW keypoint: page->_count++
 634                 page_dup_rmap(page);
 635                 rss[PageAnon(page)]++;
 636         }
 637 
 638 out_set_pte:
 639         set_pte_at(dst_mm, addr, dst_pte, pte);
 640         return 0;
 641 }

static inline int is_cow_mapping(unsigned int flags)
{
        return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

static inline pte_t pte_wrprotect(pte_t pte)
{
        return pte_clear_flags(pte, _PAGE_RW);
}
  
pte_wrprotect() 去掉了pte的 _PAGE_RW 权限，使得pte为readonly。
vm_normal_page() 是由pte得到struct page。get_page()就是增加引用计数. 
注意COW这里不包含flag VM_SHARED，VM_SHRED flag用于处理其它情况。

下面看看page fault和fork()是如何配合实现COW机制的，代码注释直接把cow作为动词使用:

1985 /*
1986  * This routine handles present pages, when users try to write
1987  * to a shared page. It is done by copying the page to a new address
1988  * and decrementing the shared-page counter for the old page.
1989  *
1990  * Note that this routine assumes that the protection checks have been
1991  * done by the caller (the low-level page fault routine in most cases).
1992  * Thus we can safely just mark it writable once we've done any necessary
1993  * COW.
1994  *
1995  * We also mark the page dirty at this point even though the page will
1996  * change only once the write actually happens. This avoids a few races,
1997  * and potentially makes it more efficient.
1998  *
1999  * We enter with non-exclusive mmap_sem (to exclude vma changes,
2000  * but allow concurrent faults), with pte both mapped and locked.
2001  * We return with mmap_sem still held, but pte unmapped and unlocked.
2002  */
2003 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
2004                 unsigned long address, pte_t *page_table, pmd_t *pmd,
2005                 spinlock_t *ptl, pte_t orig_pte)
2006 {
2007         struct page *old_page, *new_page;
2008         pte_t entry;
2009         int reuse = 0, ret = 0;
2010         int page_mkwrite = 0;
2011         struct page *dirty_page = NULL;
2012 
2013         old_page = vm_normal_page(vma, address, orig_pte);
2014         if (!old_page) {
2015                 /*
2016                  * VM_MIXEDMAP !pfn_valid() case
2017                  *
2018                  * We should not cow pages in a shared writeable mapping.
2019                  * Just mark the pages writable as we can't do any dirty
2020                  * accounting on raw pfn maps.
2021                  */
2022                 if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
2023                                      (VM_WRITE|VM_SHARED))
2024                         goto reuse;
2025                 goto gotten;
2026         }
2027 
2028         /*
2029          * Take out anonymous pages first, anonymous shared vmas are
2030          * not dirty accountable.
2031          */
2032         if (PageAnon(old_page) && !PageKsm(old_page)) {
2033                 if (!trylock_page(old_page)) {
2034                         page_cache_get(old_page);
2035                         pte_unmap_unlock(page_table, ptl);
2036                         lock_page(old_page);
2037                         page_table = pte_offset_map_lock(mm, pmd, address,
2038                                                          &ptl);
2039                         if (!pte_same(*page_table, orig_pte)) {
2040                                 unlock_page(old_page);
2041                                 page_cache_release(old_page);
2042                                 goto unlock;
2043                         }
2044                         page_cache_release(old_page);
2045                 }
2046                 reuse = reuse_swap_page(old_page);
2047                 unlock_page(old_page);
2048         } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
2049                                         (VM_WRITE|VM_SHARED))) {
2050                 /*
2051                  * Only catch write-faults on shared writable pages,
2052                  * read-only shared pages can get COWed by
2053                  * get_user_pages(.write=1, .force=1).
2054                  */
2055                 if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
2056                         struct vm_fault vmf;
2057                         int tmp;
2058 
2059                         vmf.virtual_address = (void __user *)(address &
2060                                                                 PAGE_MASK);
2061                         vmf.pgoff = old_page->index;
2062                         vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
2063                         vmf.page = old_page;
2064 
2065                         /*
2066                          * Notify the address space that the page is about to
2067                          * become writable so that it can prohibit this or wait
2068                          * for the page to get into an appropriate state.
2069                          *
2070                          * We do this without the lock held, so that it can
2071                          * sleep if it needs to.
2072                          */
2073                         page_cache_get(old_page);
2074                         pte_unmap_unlock(page_table, ptl);
2075 
2076                         tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
2077                         if (unlikely(tmp &
2078                                         (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
2079                                 ret = tmp;
2080                                 goto unwritable_page;
2081                         }
2082                         if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
2083                                 lock_page(old_page);
2084                                 if (!old_page->mapping) {
2085                                         ret = 0; /* retry the fault */
2086                                         unlock_page(old_page);
2087                                         goto unwritable_page;
2088                                 }
2089                         } else
2090                                 VM_BUG_ON(!PageLocked(old_page));
2091 
2092                         /*
2093                          * Since we dropped the lock we need to revalidate
2094                          * the PTE as someone else may have changed it.  If
2095                          * they did, we just return, as we can count on the
2096                          * MMU to tell us if they didn't also make it writable.
2097                          */
2098                         page_table = pte_offset_map_lock(mm, pmd, address,
2099                                                          &ptl);
2100                         if (!pte_same(*page_table, orig_pte)) {
2101                                 unlock_page(old_page);
2102                                 page_cache_release(old_page);
2103                                 goto unlock;
2104                         }
2105 
2106                         page_mkwrite = 1;
2107                 }
2108                 dirty_page = old_page;
2109                 get_page(dirty_page);
2110                 reuse = 1;
2111         }
2112 
2113         if (reuse) {
2114 reuse:
2115                 flush_cache_page(vma, address, pte_pfn(orig_pte));
2116                 entry = pte_mkyoung(orig_pte);             
2117                 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
2118                 if (ptep_set_access_flags(vma, address, page_table, entry,1))
2119                         update_mmu_cache(vma, address, entry);
2120                 ret |= VM_FAULT_WRITE;
2121                 goto unlock;
2122         }
2123 
2124         /*
2125          * Ok, we need to copy. Oh, well..         // start COW handling
2126          */
2127         page_cache_get(old_page);                  // old_page->_count++
2128 gotten:
2129         pte_unmap_unlock(page_table, ptl);
2130 
2131         if (unlikely(anon_vma_prepare(vma)))
2132                 goto oom;
2133 
2134         if (is_zero_pfn(pte_pfn(orig_pte))) {
2135                 new_page = alloc_zeroed_user_highpage_movable(vma, address);  
2136                 if (!new_page)
2137                         goto oom;
2138         } else {
2139                 new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
2140                 if (!new_page)
2141                         goto oom;
2142                 cow_user_page(new_page, old_page, address, vma);  // copy old_page contents to new_page
2143         }
2144         __SetPageUptodate(new_page);
2145 
2146         /*
2147          * Don't let another task, with possibly unlocked vma,
2148          * keep the mlocked page.
2149          */
2150         if ((vma->vm_flags & VM_LOCKED) && old_page) {
2151                 lock_page(old_page);    /* for LRU manipulation */
2152                 clear_page_mlock(old_page);
2153                 unlock_page(old_page);
2154         }
2155 
2156         if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
2157                 goto oom_free_new;
2158 
2159         /*
2160          * Re-check the pte - we dropped the lock
2161          */
2162         page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
2163         if (likely(pte_same(*page_table, orig_pte))) {
2164                 if (old_page) {
2165                         if (!PageAnon(old_page)) {
2166                                 dec_mm_counter(mm, file_rss);
2167                                 inc_mm_counter(mm, anon_rss);
2168                         }
2169                 } else
2170                         inc_mm_counter(mm, anon_rss);
2171                 flush_cache_page(vma, address, pte_pfn(orig_pte));
2172                 entry = mk_pte(new_page, vma->vm_page_prot);        // create pte
2173                 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
2174                 /*
2175                  * Clear the pte entry and flush it first, before updating the
2176                  * pte with the new entry. This will avoid a race condition
2177                  * seen in the presence of one thread doing SMC and another
2178                  * thread doing COW.
2179                  */
2180                 ptep_clear_flush(vma, address, page_table);
2181                 page_add_new_anon_rmap(new_page, vma, address);
2182                 /*
2183                  * We call the notify macro here because, when using secondary
2184                  * mmu page tables (such as kvm shadow page tables), we want the
2185                  * new page to be mapped directly into the secondary page table.
2186                  */
2187                 set_pte_at_notify(mm, address, page_table, entry);
2188                 update_mmu_cache(vma, address, entry);
2189                 if (old_page) {
2190                         /*
2191                          * Only after switching the pte to the new page may
2192                          * we remove the mapcount here. Otherwise another
2193                          * process may come and find the rmap count decremented
2194                          * before the pte is switched to the new page, and
2195                          * "reuse" the old page writing into it while our pte
2196                          * here still points into it and can be read by other
2197                          * threads.
2198                          *
2199                          * The critical issue is to order this
2200                          * page_remove_rmap with the ptp_clear_flush above.
2201                          * Those stores are ordered by (if nothing else,)
2202                          * the barrier present in the atomic_add_negative
2203                          * in page_remove_rmap.
2204                          *
2205                          * Then the TLB flush in ptep_clear_flush ensures that
2206                          * no process can access the old page before the
2207                          * decremented mapcount is visible. And the old page
2208                          * cannot be reused until after the decremented
2209                          * mapcount is visible. So transitively, TLBs to
2210                          * old page will be flushed before it can be reused.
2211                          */
2212                         page_remove_rmap(old_page);
2213                 }
2214 
2215                 /* Free the old page.. */
2216                 new_page = old_page;                // attention!
2217                 ret |= VM_FAULT_WRITE;
2218         } else
2219                 mem_cgroup_uncharge_page(new_page);
2220 
2221         if (new_page)
2222                 page_cache_release(new_page);
2223         if (old_page)
2224                 page_cache_release(old_page);
2225 unlock:
2226         pte_unmap_unlock(page_table, ptl);
2227         if (dirty_page) {
2228                 /*
2229                  * Yes, Virginia, this is actually required to prevent a race
2230                  * with clear_page_dirty_for_io() from clearing the page dirty
2231                  * bit after it clear all dirty ptes, but before a racing
2232                  * do_wp_page installs a dirty pte.
2233                  *
2234                  * do_no_page is protected similarly.
2235                  */
2236                 if (!page_mkwrite) {
2237                         wait_on_page_locked(dirty_page);
2238                         set_page_dirty_balance(dirty_page, page_mkwrite);
2239                 }
2240                 put_page(dirty_page);
2241                 if (page_mkwrite) {
2242                         struct address_space *mapping = dirty_page->mapping;
2243 
2244                         set_page_dirty(dirty_page);
2245                         unlock_page(dirty_page);
2246                         page_cache_release(dirty_page);
2247                         if (mapping)    {
2248                                 /*
2249                                  * Some device drivers do not set page.mapping
2250                                  * but still dirty their pages
2251                                  */
2252                                 balance_dirty_pages_ratelimited(mapping);
2253                         }
2254                 }
2255 
2256                 /* file_update_time outside page_lock */
2257                 if (vma->vm_file)
2258                         file_update_time(vma->vm_file);
2259         }
2260         return ret;
2261 oom_free_new:
2262         page_cache_release(new_page);
2263 oom:
2264         if (old_page) {
2265                 if (page_mkwrite) {
2266                         unlock_page(old_page);
2267                         page_cache_release(old_page);
2268                 }
2269                 page_cache_release(old_page);
2270         }
2271         return VM_FAULT_OOM;
2272 
2273 unwritable_page:
2274         page_cache_release(old_page);
2275         return ret;
2276 }

#define page_cache_get(page)            get_page(page)
#define page_cache_release(page)        put_page(page)

注释说了: read-only shared page应该以COW的方式处理，writable shared page引发的page fault用另外
的方式处理。

抛开page cache相关部分，跳过PageAnon，直接看COW的处理部分:
先执行page_cache_get(old_page)，
如果old page是一个zero page，那直接分配一个new page时直接初始化为0就行了，这样不会污染cache。
否则在分配一个new page后还要用cow_user_page()把old page中的内容copy过来。

然后组装对应于new page的pte。执行到下面这里时，要注意前面有行赋值: 2216   new_page = old_page;
2221         if (new_page)
2222                 page_cache_release(new_page);
2223         if (old_page)
2224                 page_cache_release(old_page);

所以这里实际是两次执行old_page->count--，而对应的page_cache_get(old_page)这里只执行了一次，
但这里 _count不会成负值，因为fork()里已经执行_count++了: 633      get_page(page);

可见，fork()结束后，page->_count就已经是1了，每fork一次，就执行一次_count++。这样如果child执行
write时，发生page fault后，就会按COW方式处理，给child分配新的物理页，copy parent页面的内容进去。
当所有的child都执行write操作后，_count == 0。以后再对old page执行write操作时，就不必分配新物理
页了，直接write就可以了。

这里_count--后，并没有对old page权限做改动，old page的pte仍是read only，以便后面进行COW处理。
这里用_count field来处理race condition。COW的关键就在于引用计数_count的处理。
fork()时不给child分配物理内存，节省了存储和时间，却引起了 _count 处理的复杂和代码的难读。
比如若单是为了抵消fork()中的get_page()，这里执行一次page_cache_release()就可以了。
但综合考虑其它多种情况，这里是执行了一次page_cache_get()，两次page_cache_release
可见引用计数是COW的关键，虽然COW减少了存储开销，加快了速度，却带来了代码的复杂。

这里dirty == NULL，直接结束了。
一碗阳春面

Saturday, March 20, 2010

page fault and fork

No comments:

Post a Comment