一碗阳春面: MemZone 的阈值


    proc 是个基于内存的文件系统，全在内存中。
    有时为了优化系统，需要自己修改编译kernel源码。proc 就把常用的kernel参数提供给了用户，这样
通过修改相应参数就可以使系统运行的更好，不必改源码了，连重启都不用，非常方便。

    比如对于 OOM_Killer，当系统内存低于一定的阈值时，kernel 就会杀掉一个进程释放相应内存以防止
系统crash，想改变这个阈值只需直接修改proc中的参数。如果想自己定一套标准选择杀哪个进程，那就只有
改源码了。

    网上一个人说当他的 Centos 5.4 还剩600多MB内存时，通过用户进程申请了200MB内存后，kernel就kill
掉系统的某个进程，/var/log/messages 有 OOM_Killer 的记录，但log里看不出什么原因。从没遇到过
OOM_Killer的情况，正好走一遍。他机器的主要参数:

# uname -a
Linux lisa1043.lisa.com 2.6.18.8-xen #1 SMP Fri Jul 2 17:26:35 CST 2010 i686 i686 i386 GNU/Linux
   
   是个 32 位系统。

# free -m
             total       used       free     shared    buffers     cached
Mem:          4021       3394        626          0         11         15
-/+ buffers/cache:       3367        653
Swap:            0          0          0

    可见4GB的内存，page cache 只有26MB了，确实是系统内存紧张了，连 page cache 都快释放没了。
还剩653MB内存free。既然发生了 OOM_Kill，就涉及阈值了。当这个阈值很小时， OOM_Killer 就基本没用，
因为要杀别的进程时，killer进程本身就需要有额外的内存来运行，所以此时基本还是会crash。而如果把
阈值设置太大，就会造成内存的浪费。因此阈值的选择有个tradeoff: 要提高稳定性就不能怕浪费内存。
如何让浪费的内存尽量少就要管理员反复实验再确定了。

# cat /var/log/messages
Jun 10 10:01:18 lisa1043 kernel: [23217.687231] oom-killer: gfp_mask=0x201d2, order=0
Jun 10 10:01:18 lisa1043 kernel: [23217.687237]  [] out_of_memory+0x25/0x13d
Jun 10 10:01:18 lisa1043 kernel: [23217.687244]  [] __alloc_pages+0x1fe/0x27e
Jun 10 10:01:18 lisa1043 kernel: [23217.687249]  [] __do_page_cache_readahead+0xef/0x22f
Jun 10 10:01:18 lisa1043 kernel: [23217.687253]  [] sync_page+0x0/0x3b
Jun 10 10:01:18 lisa1043 kernel: [23217.687256]  [] __delayacct_blkio_end+0x32/0x35
Jun 10 10:01:18 lisa1043 kernel: [23217.687259]  [] __wait_on_bit_lock+0x4b/0x52
Jun 10 10:01:18 lisa1043 kernel: [23217.687264]  [] dm_any_congested+0x2f/0x35 [dm_mod]
Jun 10 10:01:18 lisa1043 kernel: [23217.687273]  [] filemap_nopage+0x151/0x312
Jun 10 10:01:18 lisa1043 kernel: [23217.687278]  [] __handle_mm_fault+0x71f/0x1481
Jun 10 10:01:18 lisa1043 kernel: [23217.687290]  [] pty_write+0x2a/0x34
Jun 10 10:01:18 lisa1043 kernel: [23217.687296]  [] tty_default_put_char+0x17/0x1a
Jun 10 10:01:18 lisa1043 kernel: [23217.687299]  [] remove_wait_queue+0xc/0x34
Jun 10 10:01:18 lisa1043 kernel: [23217.687302]  [] __wake_up+0x2a/0x3d
Jun 10 10:01:18 lisa1043 kernel: [23217.687308]  [] schedule+0x6a9/0x788
Jun 10 10:01:18 lisa1043 kernel: [23217.687313]  [] do_page_fault+0x611/0xa5c
Jun 10 10:01:18 lisa1043 kernel: [23217.687319]  [] do_page_fault+0x0/0xa5c
Jun 10 10:01:18 lisa1043 kernel: [23217.687322]  [] error_code+0x2b/0x30
Jun 10 10:01:18 lisa1043 kernel: [23217.687329] Mem-info:                       
Jun 10 10:01:18 lisa1043 kernel: [23217.687330] DMA per-cpu:                    
Jun 10 10:01:18 lisa1043 kernel: [23217.687331] cpu 0 hot: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687332] cpu 0 cold: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687334] cpu 1 hot: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687335] cpu 1 cold: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687336] cpu 2 hot: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687337] cpu 2 cold: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687339] cpu 3 hot: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687340] cpu 3 cold: high 0, batch 1 used:0
Jun 10 10:01:18 lisa1043 kernel: [23217.687341] DMA32 per-cpu: empty            
Jun 10 10:01:18 lisa1043 kernel: [23217.687342] Normal per-cpu:                 
Jun 10 10:01:18 lisa1043 kernel: [23217.687343] cpu 0 hot: high 186, batch 31 used:19
Jun 10 10:01:18 lisa1043 kernel: [23217.687345] cpu 0 cold: high 62, batch 15 used:50
Jun 10 10:01:18 lisa1043 kernel: [23217.687346] cpu 1 hot: high 186, batch 31 used:183
Jun 10 10:01:18 lisa1043 kernel: [23217.687347] cpu 1 cold: high 62, batch 15 used:56
Jun 10 10:01:18 lisa1043 kernel: [23217.687349] cpu 2 hot: high 186, batch 31 used:128
Jun 10 10:01:18 lisa1043 kernel: [23217.687350] cpu 2 cold: high 62, batch 15 used:48
Jun 10 10:01:18 lisa1043 kernel: [23217.687351] cpu 3 hot: high 186, batch 31 used:30
Jun 10 10:01:18 lisa1043 kernel: [23217.687353] cpu 3 cold: high 62, batch 15 used:52
Jun 10 10:01:18 lisa1043 kernel: [23217.687354] HighMem per-cpu:                
Jun 10 10:01:18 lisa1043 kernel: [23217.687355] cpu 0 hot: high 186, batch 31 used:31
Jun 10 10:01:18 lisa1043 kernel: [23217.687356] cpu 0 cold: high 62, batch 15 used:23
Jun 10 10:01:18 lisa1043 kernel: [23217.687358] cpu 1 hot: high 186, batch 31 used:31
Jun 10 10:01:18 lisa1043 kernel: [23217.687359] cpu 1 cold: high 62, batch 15 used:52
Jun 10 10:01:18 lisa1043 kernel: [23217.687360] cpu 2 hot: high 186, batch 31 used:29
Jun 10 10:01:18 lisa1043 kernel: [23217.687361] cpu 2 cold: high 62, batch 15 used:46
Jun 10 10:01:18 lisa1043 kernel: [23217.687363] cpu 3 hot: high 186, batch 31 used:30
Jun 10 10:01:18 lisa1043 kernel: [23217.687364] cpu 3 cold: high 62, batch 15 used:31
Jun 10 10:01:18 lisa1043 kernel: [23217.687366] Free pages:      565692kB   (780kB HighMem)
Jun 10 10:01:18 lisa1043 kernel: [23217.687368] Active:806839 inactive:429      
dirty:0 writeback:0 unstable:0 free:141423 slab:3667 mapped:400 pagetables:1866 
Jun 10 10:01:18 lisa1043 kernel: [23217.687371] DMA free:12640kB min:68kB  low:84kB
high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? yes
Jun 10 10:01:18 lisa1043 kernel: [23217.687372] lowmem_reserve[]: 0 0 851 24149 
Jun 10 10:01:18 lisa1043 kernel: [23217.687375] DMA32 free:0kB min:0kB low:0kB  
high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
Jun 10 10:01:18 lisa1043 kernel: [23217.687377] lowmem_reserve[]: 0 0 851 24149 
Jun 10 10:01:18 lisa1043 kernel: [23217.687380] Normal free:552272kB min:3696kB low:4620kB
high:5544kB active:0kB inactive:128kB present:872440kB pages_scanned:192 all_unreclaimable? no
Jun 10 10:01:18 lisa1043 kernel: [23217.687382] lowmem_reserve[]: 0 0 0 186383  
Jun 10 10:01:18 lisa1043 kernel: [23217.687385] HighMem free:780kB min:512kB
low:25796kB high:51080kB active:3227372kB inactive:1588kB present:23857080kB
pages_scanned:6578854 all_unreclaimable? yes                                    
Jun 10 10:01:18 lisa1043 kernel: [23217.687387] lowmem_reserve[]: 0 0 0 0       
Jun 10 10:01:18 lisa1043 kernel: [23217.687389] DMA: 2*4kB 3*8kB 4*16kB 2*32kB  
1*64kB 3*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 2*4096kB = 12640kB             
Jun 10 10:01:18 lisa1043 kernel: [23217.687394] DMA32: empty                    
Jun 10 10:01:18 lisa1043 kernel: [23217.687395] Normal: 948*4kB 1066*8kB        
621*16kB 281*32kB 99*64kB 29*128kB 10*256kB 1*512kB 0*1024kB 0*2048kB 124*4096kB = 552272kB
Jun 10 10:01:18 lisa1043 kernel: [23217.687400] HighMem: 71*4kB 4*8kB 5*16kB    
0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 780kB        
Jun 10 10:01:18 lisa1043 kernel: [23217.687405] Swap cache: add 0, delete 0, find 0/0, race 0+0
Jun 10 10:01:18 lisa1043 kernel: [23217.687406] Free swap  = 0kB                
Jun 10 10:01:18 lisa1043 kernel: [23217.687407] Total swap = 0kB                
Jun 10 10:01:18 lisa1043 kernel: [23217.687408] Free swap:            0kB       
Jun 10 10:01:18 lisa1043 kernel: [23217.740869] 6186476 pages of RAM            
Jun 10 10:01:18 lisa1043 kernel: [23217.740872] 6186476 pages of RAM            
Jun 10 10:01:18 lisa1043 kernel: [23217.740874] 5964270 pages of HIGHMEM        
Jun 10 10:01:18 lisa1043 kernel: [23217.740876] 72276 reserved pages            
Jun 10 10:01:18 lisa1043 kernel: [23217.740877] 5441 pages shared               
Jun 10 10:01:18 lisa1043 kernel: [23217.740879] 0 pages swap cached             
Jun 10 10:01:18 lisa1043 kernel: [23217.740880] 0 pages dirty                   
Jun 10 10:01:18 lisa1043 kernel: [23217.740881] 0 pages writeback               
Jun 10 10:01:18 lisa1043 kernel: [23217.740883] 400 pages mapped                
Jun 10 10:01:18 lisa1043 kernel: [23217.740884] 3667 pages slab                 
Jun 10 10:01:18 lisa1043 kernel: [23217.740885] 1866 pages pagetables           
Jun 10 10:01:18 lisa1043 kernel: [23217.740887] 5964270 pages of HIGHMEM        
Jun 10 10:01:18 lisa1043 kernel: [23217.740888] 72276 reserved pages            
Jun 10 10:01:18 lisa1043 kernel: [23217.740889] 5440 pages shared               
Jun 10 10:01:19 lisa1043 kernel: [23217.740890] 0 pages swap cached             
Jun 10 10:01:19 lisa1043 kernel: [23217.740891] 0 pages dirty                   
Jun 10 10:01:19 lisa1043 kernel: [23217.740891] 0 pages writeback               
Jun 10 10:01:19 lisa1043 kernel: [23217.740892] 400 pages mapped                
Jun 10 10:01:19 lisa1043 kernel: [23217.740893] 3667 pages slab                 
Jun 10 10:01:19 lisa1043 kernel: [23217.740894] 1866 pages pagetables           
Jun 10 10:01:19 lisa1043 kernel: [23217.740978] Out of Memory: Kill process 23774 (bash)
score 22314 and children.                                          
Jun 10 10:01:19 lisa1043 kernel: [23217.740980] Out of memory: Killed process 23970 (runmem).
                             ....
     
   虽然看不懂log的意思，但是有很多keyword可供参考: 
            DMA DMA32 Normal HighMem lowmem_reserve[] active inactive
这都是 zone 里的词汇，联想到 lru list。

# cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3160
        min      17
        low      21
        high     25
        active   0
        inactive 0
        scanned  0 (a: 17 i: 17)
        spanned  4096
        present  4096
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 0
    nr_slab      0
    nr_page_table_pages 0
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
        protection: (0, 0, 851, 24149)
  pagesets
 all_unreclaimable: 1
  prev_priority:     12
  start_pfn:         0
Node 0, zone   Normal
  pages free     137757
        min      924
        low      1155
        high     1386
        active   48
        inactive 35
        scanned  97 (a: 5 i: 7)
        spanned  218110
        present  218110
    nr_anon_pages 0
    nr_mapped    1
    nr_file_pages 80
    nr_slab      4052
    nr_page_table_pages 1827
    nr_dirty     0
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
       protection: (0, 0, 0, 186383)
  pagesets
    cpu: 0 pcp: 0
              count: 9
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 61
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 1 pcp: 0
              count: 46
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 59
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 2 pcp: 0
              count: 60
             high:  186
              batch: 31
    cpu: 2 pcp: 1
              count: 51
              high:  62
              batch: 15
  vm stats threshold: 24
    cpu: 3 pcp: 0
              count: 121
              high:  186
              batch: 31
    cpu: 3 pcp: 1
              count: 53
              high:  62
              batch: 15
  vm stats threshold: 24
  all_unreclaimable: 0
  prev_priority:     2
  start_pfn:         4096
Node 0, zone  HighMem
  pages free     11114
        min      128
        low      6449
        high     12770
        active   795251
        inactive 381
        scanned  297953 (a: 0 i: 20)
        spanned  5964270
        present  5964270
    nr_anon_pages 793116
    nr_mapped    2155
    nr_file_pages 2494
    nr_slab      0
    nr_page_table_pages 0
    nr_dirty     38
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
        protection: (0, 0, 0, 0)
  pagesets
    cpu: 0 pcp: 0
              count: 152
              high:  186
              batch: 31
    cpu: 0 pcp: 1
              count: 0
              high:  62
              batch: 15
  vm stats threshold: 54
    cpu: 1 pcp: 0
              count: 184
              high:  186
              batch: 31
    cpu: 1 pcp: 1
              count: 5
              high:  62
              batch: 15
  vm stats threshold: 54
    cpu: 2 pcp: 0
              count: 71
              high:  186
              batch: 31
    cpu: 2 pcp: 1
              count: 3
              high:  62
              batch: 15
  vm stats threshold: 54
    cpu: 3 pcp: 0
              count: 22
              high:  186
              batch: 31
    cpu: 3 pcp: 1
              count: 5
              high:  62
              batch: 15
  vm stats threshold: 54
  all_unreclaimable: 0
  prev_priority:     2
  start_pfn:         222206


    以上是系统的三个MemZone: DMA Normal HighMem 的信息，要知道这里面的参数才行。
    每个 Zone 都有各自的free_pages和阈值。当用户进程申请内存时，首先试图在 Zone HighMem 中分配，
如果 free_pages 低于阈值了就转向 Zone Normal 分配，如果 Zone Normal 的 free_pages 也低于阈值就
转向 Zone DMA 分配，如果 Zone DMA 的 free_pages 也低于阈值就只好调 OOM_Killer 杀掉某个进程了。

    man proc 搜 zoneinfo 没有介绍。到 Documentation/ 下 $ locate proc |grep Docu  找到
filesystems/proc.txt，但里面也没有相关参数介绍，好在指明了详细参数可参考 Documentation/sysctl/
在 Documentation/sysctls/vm.txt 中搜 zoneinfo 终于找到了参数介绍，根据文档来分析:

    在 zone HighMem 中，11114 < 12770 + protection[3] = 12770 + 0，free pages 低于阈值了，无法
从 Zone HighMem 中分配，进而求助 Zone Normal
    在 zone Normal 中，137757 < 1386 +  protection[3] = 1386 + 186383，free pages 也低于阈值，
也无法分配，转而求助 DMA
    在 zone DMA 中，3160 < 25 + 24149 仍然无法分配

    此时 OOM_Killer 就会根据一个特定的标准杀掉系统的某个进程。


    文档还说可以通过 /proc/sys/vm/lowmem_reserve_ratio 来调整阈值。如果想增大阈值，就减小
lowmem_reserve_ratio；想减小阈值就增大 lowmem_reserve_ratio。 那人系统的信息是:

# sysctl -a | grep vm
...
vm.lowmem_reserve_ratio = 256   256     32
vm.panic_on_oom = 0
...

    man proc 搜oom发现 panic_on_oom = 0 表示当系统内存无法满足进程的申请时，将调用 OOM_Killer
杀掉一个进程，而不是直接crash。此时需要减小阈值，也就是增大 lowmem_reserve_ratio，比如 

# echo  512   512   64 > /proc/sys/vm/lowmem_reserve_ratio

    具体增大多少可以根据文档中的公式计算。公式如下:

zone[i]'s protection[j] is calculated by following expression.

(i < j):
  zone[i]->protection[j]
  = (total sums of present_pages from zone[i+1] to zone[j] on the node)
    / lowmem_reserve_ratio[i];
(i = j):
   (should not be protected. = 0;
(i > j):
   (not necessary, but looks 0)

    由上面的公式可以看出，本级阈值的设定会把上一级拉下水。
    本级的阈值计算会用到上一级的 zone 的 present_pages，上一级的 present_pages 越多，本级的
zone 的阈值，即 lowmem_reserve_ratio，就会越大。也就越不容易从低级的 zone 中分配。
    目的很明显: 只要上一级的 zone 中有很多page，就别太过分的在我这分配。比如我是 Normal，
在设定我的阈值时，如果不考虑 HighMem 的 present_pages，此时来了内存分配的请求，如果我和 DMA
都无法满足，由于页面请求是从高级到低级的，不会在返回 HighMem 去尝试，那就会发生 OOM_Killer，
而此时 HighMem 中却可能仍然有很多page可以分配。 

    protection 这里有点迷糊，本来是3个 zone，怎么 protection 有四个元素呢? 仔细看文档，发现
文档的例子是 x64 机器，而提问题的哥们的机器是i686 32位的。看了源码，64 位多了一个 zone DMA32。
$ vi -t zone  在文件中搜 DMA 找到了 zone_type:

191 enum zone_type {
192 #ifdef CONFIG_ZONE_DMA
193         /*
194          * ZONE_DMA is used when there are devices that are not able
195          * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
196          * carve out the portion of memory that is needed for these devices.
197          * The range is arch specific.
198          *
199          * Some examples
200          *
201          * Architecture         Limit
202          * ---------------------------
203          * parisc, ia64, sparc  <4G
204          * s390                 <2G
205          * arm                  Various
206          * alpha                Unlimited or 0-16MB.
207          *
208          * i386, x86_64 and multiple other arches
209          *                      <16M.
210          */
211         ZONE_DMA,
212 #endif
213 #ifdef CONFIG_ZONE_DMA32
214         /*
215          * x86_64 needs two ZONE_DMAs because it supports devices that are
216          * only able to do DMA to the lower 16M but also 32 bit devices that
217          * can only do DMA areas below 4G.
218          */
219         ZONE_DMA32,
220 #endif
221         /*
222          * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
223          * performed on pages in ZONE_NORMAL if the DMA devices support
224          * transfers to all addressable memory.
225          */
226         ZONE_NORMAL,
227 #ifdef CONFIG_HIGHMEM
228         /*
229          * A memory area that is only addressable by the kernel through
230          * mapping portions into its own address space. This is for example
231          * used by i386 to allow the kernel to address the memory beyond
232          * 900MB. The kernel will set up special mappings (page
233          * table entries on i386) for each page that the kernel needs to
234          * access.
235          */
236         ZONE_HIGHMEM,
237 #endif
238         ZONE_MOVABLE,
239         __MAX_NR_ZONES
240 };

/var/log/messages 也说明了 DMA32 为空
Jun 10 10:01:18 lisa1043 kernel: [23217.687341] DMA32 per-cpu: empty

验证一下:
    Normal 的 protection[3] = zone[2]->protection[3] =
(zone[3] present_pages) / lowmem_reserve_ratio[2] = 5964270 / 32 = 186383

    DMA 的 protection[3] = zone[1]->protection[3] =
(zone[2] present_pages + zone[3] present_pages) / lowmem_reserve_ratio[1] =
(218110 + 5964270 ) / 256 = 24149

    DMA 的 protection[2] = zone[1]->protection[2] =
zone[2] present_pages / lowmem_reserve_ratio[1] = 218110 / 256 = 851

    都符合系统数据，文档说的对，公式这里就不用继续看源码验证了。
    对于这个 lowmem_reserve_ratio 的意义及其存在的目的，文档有介绍，注意文档中的
例子是 x64 架构，有4个 Zone，而 i686 32 位有3个Zone:

lowmem_reserve_ratio

For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should change the lowmem_reserve_ratio setting.

The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32
-
Note: # of this elements is one fewer than number of zones. Because the highest
      zone's value is not necessary for following calculation.

But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.

-
Node 0, zone      DMA
  pages free     1355
        min      3
        low      3
        high     4
        :
        :
    numa_other   0
        protection: (0, 2004, 2004, 2004)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  pagesets
    cpu: 0 pcp: 0
        :
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.

In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.

    另外，源码也提到了 lowmem_reserve_ratio，$ vi -t zone

280 struct zone {
281         /* Fields commonly accessed by the page allocator */
282 
283         /* zone watermarks, access with *_wmark_pages(zone) macros */
284         unsigned long watermark[NR_WMARK];
285 
286         /*
287          * We don't know if the memory that we're going to allocate will be freeable
288          * or/and it will be released eventually, so to avoid totally wasting several
289          * GB of ram we must reserve some of the lower zone memory (otherwise we risk
290          * to run OOM on the lower zones despite there's tons of freeable ram
291          * on the higher zones). This array is recalculated at runtime if the
292          * sysctl_lowmem_reserve_ratio sysctl changes.
293          */
294         unsigned long           lowmem_reserve[MAX_NR_ZONES];
295 
296 #ifdef CONFIG_NUMA
297         int node;
298         /*
299          * zone reclaim becomes active if more unmapped pages exist.
300          */
301         unsigned long           min_unmapped_pages;
302         unsigned long           min_slab_pages;
303 #endif
304         struct per_cpu_pageset __percpu *pageset;
305         /*
306          * free areas of different sizes
307          */
308         spinlock_t              lock;
309         int                     all_unreclaimable; /* All pages pinned */
310 #ifdef CONFIG_MEMORY_HOTPLUG
311         /* see spanned/present_pages for more description */
312         seqlock_t               span_seqlock;
313 #endif
314         struct free_area        free_area[MAX_ORDER];
315 
316 #ifndef CONFIG_SPARSEMEM
317         /*
318          * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
319          * In SPARSEMEM, this map is stored in struct mem_section
320          */
321         unsigned long           *pageblock_flags;
322 #endif /* CONFIG_SPARSEMEM */
323 
324 
325         ZONE_PADDING(_pad1_)
326 
327         /* Fields commonly accessed by the page reclaim scanner */
328         spinlock_t              lru_lock;
329         struct zone_lru {
330                 struct list_head list;
331         } lru[NR_LRU_LISTS];
332 
333         struct zone_reclaim_stat reclaim_stat;
334 
335         unsigned long           pages_scanned;     /* since last reclaim */
336         unsigned long           flags;             /* zone flags, see below */
337 
338         /* Zone statistics */
339         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
340 
341         /*
342          * prev_priority holds the scanning priority for this zone.  It is
343          * defined as the scanning priority at which we achieved our reclaim
344          * target at the previous try_to_free_pages() or balance_pgdat()
345          * invocation.
346          *
347          * We use prev_priority as a measure of how much stress page reclaim is
348          * under - it drives the swappiness decision: whether to unmap mapped
349          * pages.
350          *
351          * Access to both this field is quite racy even on uniprocessor.  But
352          * it is expected to average out OK.
353          */
354         int prev_priority;
355 
356         /*
357          * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
358          * this zone's LRU.  Maintained by the pageout code.
359          */
360         unsigned int inactive_ratio;
361 
362 
363         ZONE_PADDING(_pad2_)
364         /* Rarely used or read-mostly fields */
365 
366         /*
367          * wait_table           -- the array holding the hash table
368          * wait_table_hash_nr_entries   -- the size of the hash table array
369          * wait_table_bits      -- wait_table_size == (1 << wait_table_bits)
370          *
371          * The purpose of all these is to keep track of the people
372          * waiting for a page to become available and make them
373          * runnable again when possible. The trouble is that this
374          * consumes a lot of space, especially when so few things
375          * wait on pages at a given time. So instead of using
376          * per-page waitqueues, we use a waitqueue hash table.
377          *
378          * The bucket discipline is to sleep on the same queue when
379          * colliding and wake all in that wait queue when removing.
380          * When something wakes, it must check to be sure its page is
381          * truly available, a la thundering herd. The cost of a
382          * collision is great, but given the expected load of the
383          * table, they should be so rare as to be outweighed by the
384          * benefits from the saved space.
385          *
386          * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
387          * primary users of these fields, and in mm/page_alloc.c
388          * free_area_init_core() performs the initialization of them.
389          */
390         wait_queue_head_t       * wait_table;
391         unsigned long           wait_table_hash_nr_entries;
392         unsigned long           wait_table_bits;
393 
394         /*
395          * Discontig memory support fields.
396          */
397         struct pglist_data      *zone_pgdat;
398         /* zone_start_pfn == zone_start_paddr << PAGE_SHIFT */
399         unsigned long           zone_start_pfn;
400 
401         /*
402          * zone_start_pfn, spanned_pages and present_pages are all
403          * protected by span_seqlock.  It is a seqlock because it has
404          * to be read outside of zone->lock, and it is done in the main
405          * allocator path.  But, it is written quite infrequently.
406          *
407          * The lock is declared along with zone->lock because it is
408          * frequently read in proximity to zone->lock.  It's good to
409          * give them a chance of being in the same cacheline.
410          */
411         unsigned long           spanned_pages;  /* total size, including holes */
412         unsigned long           present_pages;  /* amount of memory (excluding holes) */
413 
414         /*
415          * rarely used fields:
416          */
417         const char              *name;
418 } ____cacheline_internodealigned_in_smp;
     
    其中 free_area[] 就是buddy页面级的分配算法的数据结构。

那哥们的buddy信息如下:
# cat /proc/buddyinfo
Node 0, zone      DMA      2      3      4      2      1      3      1      1     1      1      2
Node 0, zone   Normal    562    884    360     93     15      7      0      1     1      1    129
Node 0, zone  HighMem    703    380    180    111     34     26      8      3     2      0      2

第2行 Normal Zone 意义如下: 
容量为2^0 个page的链表中有562个chunk可供分配
容量为2^1 个page的链表中有884个chunk可供分配
容量为2^2 个page的链表中有360个chunk可供分配
.....
容量为2^10 个page的链表中有129个chunk可供分配

    从buddyinfo 可以看出 HighMem 中 2^10 个page的chunk有2个，大小为8MB，无法满足200MB的请求；
Normal 中 2^10 个page的chunk有129个，大小为 129*4MB = 516MB，所以本该能满足内存申请。
 
    啥事都有好处和坏处。虽然阈值的设定方法把上级拉下水有很大好处，但在提问哥们那，坏处就显现出来了:
如果上级 present_pages 很少了，由于你把他拉下水了，这时不但没占着便宜，反而被他牵连了，明明自己
有很多页面可供分配，这时却无法分配了。

    另外，这个buddyinfo 也能看出buddy分配算法的碎片程度。
    
    struct zone 中还有个 lru list 结构就是用来缓存page的，当系统空闲page太少时，kernel就会
回收不再被频繁使用的页面:

 327         /* Fields commonly accessed by the page reclaim scanner */
 328         spinlock_t              lru_lock;
 329         struct zone_lru {
 330                 struct list_head list;
 331         } lru[NR_LRU_LISTS];
    
    看到这个 lru list，想起来页面分配这里有个问题曾经困惑我好久: 
    kernel 启动时就对所有物理页面都做了映射，有pte，highmem 即使没有也能做临时映射；而用户进程
只能在申请页面时才建立pte，从而能够访问物理页面。但既然kernel能访问全部物理页面了，用户进程还咋
申请啊? 反过来，kernel 岂不是也可能破坏用户进程申请的页面吗? 就是觉得 kernel 和用户进程在使用
物理页面时会有冲突。

    实际上，kernel 做为管理者，确实嫩够访问普通用户进程的页面，但是 kernel 程序员在写代码时会
避免让 kernel 去破坏用户进程的页面。当然对用户进程数据的read操作是有的，但是这事只有 kernel 知
道，其它用户进程并不知道kernel read到什么了，普通用户进程根本访问不了 kernel space，无法和
kernel 交流，只能被管理。因此read操作不会有问题。

    kernel 启动时虽然建立了pte，可以直接访问所有非highmeme的物理内存，或者经过临时映射访问任何
highmem位置的物理内存，但只是能访问，并没有独占权。当用户进程来申请页面时，只要有空闲页面，
kernel 就一定要负责把页面分配给人家，kernel 不会妨碍用户进程占有页面。反过来，如果 kernel 自己
想申请新的页面，而此时已经没有空闲页面可用了，那 kernel 没有权利把用户进程已经申请的页面给剥夺
来自己用，kernel 必须要等，什么时候有用户进程释放页面了，什么时候 kernel 才能用，否则 kernel
就得一直等着。为了避免产生这种情况，kernel 程序员就让 kernel 具有很好的管理能力: 平时把常用的
页面放在 lru list 中，便于快速重用；同时设置一个空闲页面的阈值，当空闲页面数量太少时，kernel
就要停下手中的工作，先去lru list中清理一些空闲页面出来。这样一来，kernel 即不会剥夺用户进程的
页面，也尽量减少了自己死等用户进程释放页面的情况。

    正是由于kernel程序员写的kernel能够控制应用程序员写的用户进程，因此kernel程序员比较难做。

    对 proc 的修改很关键，容易出问题，因此不能直接在 production machine 上改，必须先在
develop machine 上试验，而且只看文档是不够的，必须结合kernel源码来分析。
一碗阳春面

Friday, July 9, 2010

MemZone 的阈值

No comments:

Post a Comment