commit b0807bc10a6ac95ab8bf3bbf57703a0f2edd9aa9 Author: Jiri Slaby Date: Fri Sep 26 11:55:45 2014 +0200 Linux 3.12.30 commit 99ed1bd0c77355d65de5f112eb92d79f9bace84f Author: Mel Gorman Date: Thu Aug 28 19:35:45 2014 +0100 mm: page_alloc: reduce cost of the fair zone allocation policy commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream. The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 51aad0a51582e4147380137ba34785663a1b5f93 Author: Mel Gorman Date: Thu Aug 28 19:35:44 2014 +0100 mm: page_alloc: abort fair zone allocation policy when remotes nodes are encountered commit f7b5d647946aae1647bf5cd26c16b3a793c1ac49 upstream. The purpose of numa_zonelist_order=zone is to preserve lower zones for use with 32-bit devices. If locality is preferred then the numa_zonelist_order=node policy should be used. Unfortunately, the fair zone allocation policy overrides this by skipping zones on remote nodes until the lower one is found. While this makes sense from a page aging and performance perspective, it breaks the expected zonelist policy. This patch restores the expected behaviour for zone-list ordering. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit fc915114c8369c660b6876b06519902de3bd10c6 Author: Mel Gorman Date: Thu Aug 28 19:35:43 2014 +0100 mm: vmscan: only update per-cpu thresholds for online CPU commit bb0b6dffa2ccfbd9747ad0cc87c7459622896e60 upstream. When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered to get more accurate counts to avoid breaching watermarks. This threshold update iterates over all possible CPUs which is unnecessary. Only online CPUs need to be updated. If a new CPU is onlined, refresh_zone_stat_thresholds() will set the thresholds correctly. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 4a4ede23dd902513b3a17d3e61cef9baf650d33e Author: Mel Gorman Date: Thu Aug 28 19:35:42 2014 +0100 mm: move zone->pages_scanned into a vmstat counter commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream. zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b4fc580f75325271de2841891bb5816cea5ca101 Author: Mel Gorman Date: Thu Aug 28 19:35:41 2014 +0100 mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream. The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 59d4b11371a0a23eef41138c122074aabaeaff4b Author: Mel Gorman Date: Thu Aug 28 19:35:40 2014 +0100 mm: pagemap: avoid unnecessary overhead when tracepoints are deactivated commit 24b7e5819ad5cbef2b7c7376510862aa8319d240 upstream. This was formerly the series "Improve sequential read throughput" which noted some major differences in performance of tiobench since 3.0. While there are a number of factors, two that dominated were the introduction of the fair zone allocation policy and changes to CFQ. The behaviour of fair zone allocation policy makes more sense than tiobench as a benchmark and CFQ defaults were not changed due to insufficient benchmarking. This series is what's left. It's one functional fix to the fair zone allocation policy when used on NUMA machines and a reduction of overhead in general. tiobench was used for the comparison despite its flaws as an IO benchmark as in this case we are primarily interested in the overhead of page allocator and page reclaim activity. On UMA, it makes little difference to overhead 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 383.61 386.77 System 403.83 401.74 Elapsed 5411.50 5413.11 On a 4-socket NUMA machine it's a bit more noticable 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 746.94 802.00 System 65336.22 40852.33 Elapsed 27553.52 27368.46 This patch (of 6): The LRU insertion and activate tracepoints take PFN as a parameter forcing the overhead to the caller. Move the overhead to the tracepoint fast-assign method to ensure the cost is only incurred when the tracepoint is active. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 04bab05a95fece32015d897d4058880bbb5c65eb Author: Jerome Marchand Date: Thu Aug 28 19:35:39 2014 +0100 memcg, vmscan: Fix forced scan of anonymous pages commit 2ab051e11bfa3cbb7b24177f3d6aaed10a0d743e upstream. When memory cgoups are enabled, the code that decides to force to scan anonymous pages in get_scan_count() compares global values (free, high_watermark) to a value that is restricted to a memory cgroup (file). It make the code over-eager to force anon scan. For instance, it will force anon scan when scanning a memcg that is mainly populated by anonymous page, even when there is plenty of file pages to get rid of in others memcgs, even when swappiness == 0. It breaks user's expectation about swappiness and hurts performance. This patch makes sure that forced anon scan only happens when there not enough file pages for the all zone, not just in one random memcg. [hannes@cmpxchg.org: cleanups] Signed-off-by: Jerome Marchand Acked-by: Michal Hocko Acked-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: Mel Gorman Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 788a2f69f9b8b77b30ace8d1ef9380fa4ea5c6ec Author: Joonsoo Kim Date: Thu Aug 28 19:35:38 2014 +0100 vmalloc: use rcu list iterator to reduce vmap_area_lock contention commit 474750aba88817c53f39424e5567b8e4acc4b39b upstream. Richard Yao reported a month ago that his system have a trouble with vmap_area_lock contention during performance analysis by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo stressfully, but he didn't answer it. https://lkml.org/lkml/2014/4/10/416 Although I'm not sure that this is right usage or not, there is a solution reducing vmap_area_lock contention with no side-effect. That is just to use rcu list iterator in get_vmalloc_info(). rcu can be used in this function because all RCU protocol is already respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite vmap layer") back in linux-2.6.28 Specifically : insertions use list_add_rcu(), deletions use list_del_rcu() and kfree_rcu(). Note the rb tree is not used from rcu reader (it would not be safe), only the vmap_area_list has full RCU protection. Note that __purge_vmap_area_lazy() already uses this rcu protection. rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); Peter: : While rcu list traversal over the vmap_area_list is safe, this may : arrive at different results than the spinlocked version. The rcu list : traversal version will not be a 'snapshot' of a single, valid instant : of the entire vmap_area_list, but rather a potential amalgam of : different list states. Joonsoo: : Yes, you are right, but I don't think that we should be strict here. : Meminfo is already not a 'snapshot' at specific time. While we try to get : certain stats, the other stats can change. And, although we may arrive at : different results than the spinlocked version, the difference would not be : large and would not make serious side-effect. [edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim Reported-by: Richard Yao Acked-by: Eric Dumazet Cc: Peter Hurley Cc: Zhang Yanfei Cc: Johannes Weiner Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 84a6c7694aad1e1fe41ee7f66b9142e6c6b0347d Author: Jerome Marchand Date: Thu Aug 28 19:35:37 2014 +0100 mm: make copy_pte_range static again commit 21bda264f4243f61dfcc485174055f12ad0530b4 upstream. Commit 71e3aac0724f ("thp: transparent hugepage core") adds copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this function have been used outside of memory.c, but it currently isn't. This patch makes copy_pte_range() static again. Signed-off-by: Jerome Marchand Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 6a01f8dd2508cf79abbdccc44a6a41b2e17fb3cb Author: David Rientjes Date: Thu Aug 28 19:35:36 2014 +0100 mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode commit 14a4e2141e24304fff2c697be6382ffb83888185 upstream. Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") improved the previous khugepaged logic which allocated a transparent hugepages from the node of the first page being collapsed. However, it is still possible to collapse pages to remote memory which may suffer from additional access latency. With the current policy, it is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely if the majority are allocated from that node. When zone_reclaim_mode is enabled, it means the VM should make every attempt to allocate locally to prevent NUMA performance degradation. In this case, we do not want to collapse hugepages to remote nodes that would suffer from increased access latency. Thus, when zone_reclaim_mode is enabled, only allow collapsing to nodes with RECLAIM_DISTANCE or less. There is no functional change for systems that disable zone_reclaim_mode. Signed-off-by: David Rientjes Cc: Dave Hansen Cc: Andrea Arcangeli Acked-by: Vlastimil Babka Acked-by: Mel Gorman Cc: Rik van Riel Cc: "Kirill A. Shutemov" Cc: Bob Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 1d08848674752a433634f5c144500447de8ab727 Author: Hugh Dickins Date: Thu Aug 28 19:35:35 2014 +0100 mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault() commit c0d73261f5c1355a35b8b40e871d31578ce0c044 upstream. Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte upon which all subsequent decisions and pte_same() tests will be made. I have no evidence that its lack is responsible for the mm/filemap.c:202 BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity, and I am not optimistic that it will fix it. But I have found no other explanation, and ACCESS_ONCE() here will surely not hurt. If gcc does re-access the pte before passing it down, then that would be disastrous for correct page fault handling, and certainly could explain the page_mapped() BUGs seen (concurrent fault causing page to be mapped in a second time on top of itself: mapcount 2 for a single pte). Signed-off-by: Hugh Dickins Cc: Sasha Levin Cc: Linus Torvalds Cc: "Kirill A. Shutemov" Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 335886a3689bb3be6bc37f6601006ee20ccd80a0 Author: Hugh Dickins Date: Thu Aug 28 19:35:34 2014 +0100 shmem: fix init_page_accessed use to stop !PageLRU bug commit 66d2f4d28cd030220e7ea2a628993fcabcb956d1 upstream. Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU) in isolate_lru_pages() at mm/vmscan.c:1281! Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") looks like interrupted work-in-progress. mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's - shmem_write_begin() is clearly wrong to use it after shmem_getpage(), when the page is always visible in radix_tree, and often already on LRU. Revert change to shmem_write_begin(), and use init_page_accessed() or mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp(). SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed() before; but since many other filesystems use [__]page_symlink(), which did and does mark the page accessed, consider this as rectifying an oversight. Signed-off-by: Hugh Dickins Acked-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Michal Hocko Cc: Dave Hansen Cc: Prabhakar Lad Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 9057c212b93709fa54e62d434502a64591cfa5e6 Author: Mel Gorman Date: Thu Aug 28 19:35:33 2014 +0100 mm: avoid unnecessary atomic operations during end_page_writeback() commit 888cf2db475a256fb0cda042140f73d7881f81fe upstream. If a page is marked for immediate reclaim then it is moved to the tail of the LRU list. This occurs when the system is under enough memory pressure for pages under writeback to reach the end of the LRU but we test for this using atomic operations on every writeback. This patch uses an optimistic non-atomic test first. It'll miss some pages in rare cases but the consequences are not severe enough to warrant such a penalty. While the function does not dominate profiles during a simple dd test the cost of it is reduced. 73048 0.7428 vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback 23740 0.2409 vmlinux-3.15.0-rc5-lessatomic end_page_writeback Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit d618a27c7808608e376de803a4fd3940f33776c2 Author: Mel Gorman Date: Thu Aug 28 19:35:32 2014 +0100 mm: non-atomically mark page accessed during page cache allocation where possible commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Rik van Riel Cc: Peter Zijlstra Tested-by: Prabhakar Lad Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 967e64285aef3670019bd05bef74085d9ea98104 Author: Mel Gorman Date: Thu Aug 28 19:35:31 2014 +0100 fs: buffer: do not use unnecessary atomic operations when discarding buffers commit e7470ee89f003634a88e7b5e5a7b65b3025987de upstream. Discarding buffers uses a bunch of atomic operations when discarding buffers because ...... I can't think of a reason. Use a cmpxchg loop to clear all the necessary flags. In most (all?) cases this will be a single atomic operations. [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file] Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Rik van Riel Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 6ffef5d8bfc16845e25a7ee784426382b5c82c20 Author: Mel Gorman Date: Thu Aug 28 19:35:30 2014 +0100 mm: do not use unnecessary atomic operations when adding pages to the LRU commit 6fb81a17d21f2a138b8f424af4cf379f2b694060 upstream. When adding pages to the LRU we clear the active bit unconditionally. As the page could be reachable from other paths we cannot use unlocked operations without risk of corruption such as a parallel mark_page_accessed. This patch tests if is necessary to clear the active flag before using an atomic operation. This potentially opens a tiny race when PageActive is checked as mark_page_accessed could be called after PageActive was checked. The race already exists but this patch changes it slightly. The consequence is that that the page may be promoted to the active list that might have been left on the inactive list before the patch. It's too tiny a race and too marginal a consequence to always use atomic operations for. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Rik van Riel Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit fa6d2dd2223cafd36fcd901d83eed2f134c8b793 Author: Mel Gorman Date: Thu Aug 28 19:35:29 2014 +0100 mm: do not use atomic operations when releasing pages commit e3741b506c5088fa8c911bb5884c430f770fb49d upstream. There should be no references to it any more and a parallel mark should not be reordered against us. Use non-locked varient to clear page active. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 6d3e133532cd49fa925fe5c2a949c5900afd601c Author: Mel Gorman Date: Thu Aug 28 19:35:28 2014 +0100 mm: shmem: avoid atomic operation during shmem_getpage_gfp commit 07a427884348d38a6fd56fa4d78249c407196650 upstream. shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Rik van Riel Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit f161eedc71da293a9bcfcf3d7f6c1da070a61ef0 Author: Mel Gorman Date: Thu Aug 28 19:35:27 2014 +0100 mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free commit cfc47a2803db42140167b92d991ef04018e162c7 upstream. get_pageblock_migratetype() is called during free with IRQs disabled. This is unnecessary and disables IRQs for longer than necessary. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Acked-by: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 3e7379c0f4fae4e35784a1a3954bc43683b86308 Author: Mel Gorman Date: Thu Aug 28 19:35:26 2014 +0100 mm: page_alloc: convert hot/cold parameter and immediate callers to bool commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream. cold is a bool, make it one. Make the likely case the "if" part of the block instead of the else as according to the optimisation manual this is preferred. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c01947d6dfa1a3fee7bd54523dd59414e1d1fefc Author: Mel Gorman Date: Thu Aug 28 19:35:25 2014 +0100 mm: page_alloc: reduce number of times page_to_pfn is called commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream. In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Acked-by: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit da530fd87d14f640f23d1ef51333951493f508af Author: Mel Gorman Date: Thu Aug 28 19:35:24 2014 +0100 mm: page_alloc: use unsigned int for order in more places commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream. X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 35515de99f95c55f6c416fc7cc2b16832b9f58ee Author: Mel Gorman Date: Thu Aug 28 19:35:23 2014 +0100 mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path commit 5dab29113ca56335c78be3f98bf5ddf2ef8eb6a6 upstream. ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely branch in the fast path. This patch moves the check out of the fast path and after it has been determined that the watermarks have not been met. This helps the common fast path at the cost of making the slow path slower and hitting kswapd with a performance cost. It's a reasonable tradeoff. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit d6217476e187b28df3c99100e6f9a054b6ba80c6 Author: Mel Gorman Date: Thu Aug 28 19:35:22 2014 +0100 mm: page_alloc: only check the alloc flags and gfp_mask for dirty once commit a6e21b14f22041382e832d30deda6f26f37b1097 upstream. Currently it's calculated once per zone in the zonelist. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 6a85a5ad29a2582ae347ac3f38baa9e2e5daab5b Author: Mel Gorman Date: Thu Aug 28 19:35:21 2014 +0100 mm: page_alloc: only check the zone id check if pages are buddies commit d34c5fa06fade08a689fc171bf756fba2858ae73 upstream. A node/zone index is used to check if pages are compatible for merging but this happens unconditionally even if the buddy page is not free. Defer the calculation as long as possible. Ideally we would check the zone boundary but nodes can overlap. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Rik van Riel Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit dc2786f0c19a779395ef69189dd5e7df2573b29b Author: Mel Gorman Date: Thu Aug 28 19:35:20 2014 +0100 mm: page_alloc: calculate classzone_idx once from the zonelist ref commit d8846374a85f4290a473a4e2a64c1ba046c4a0e1 upstream. There is no need to calculate zone_idx(preferred_zone) multiple times or use the pgdat to figure it out. Signed-off-by: Mel Gorman Acked-by: Rik van Riel Acked-by: David Rientjes Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Dan Carpenter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit ee1760b2b4920841f45b2ad07a1c9f99e08568e7 Author: Mel Gorman Date: Thu Aug 28 19:35:19 2014 +0100 mm: page_alloc: use jump labels to avoid checking number_of_cpusets commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream. If cpusets are not in use then we still check a global variable on every page allocation. Use jump labels to avoid the overhead. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Peter Zijlstra Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit f99bfd27bda2e71932f41115098aacb16263988c Author: Mel Gorman Date: Thu Aug 28 19:35:18 2014 +0100 include/linux/jump_label.h: expose the reference count commit ea5e9539abf1258f23e725cb9cb25aa74efa29eb upstream. This patch exposes the jump_label reference count in preparation for the next patch. cpusets cares about both the jump_label being enabled and how many users of the cpusets there currently are. Signed-off-by: Peter Zijlstra Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Jan Kara Cc: Michal Hocko Cc: Hugh Dickins Cc: Dave Hansen Cc: Theodore Ts'o Cc: "Paul E. McKenney" Cc: Oleg Nesterov Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 605cf7a17583cbbdbb58a99c6752133ad0f5c378 Author: Mel Gorman Date: Thu Aug 28 19:35:17 2014 +0100 mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full" commit 800a1e750c7b04c2aa2459afca77e936e01c0029 upstream. If a zone cannot be used for a dirty page then it gets marked "full" which is cached in the zlc and later potentially skipped by allocation requests that have nothing to do with dirty zones. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 55dcadc2753639b07efed5d8baafeab47d4613a7 Author: Mel Gorman Date: Thu Aug 28 19:35:16 2014 +0100 mm: page_alloc: do not update zlc unless the zlc is active commit 65bb371984d6a2c909244eb749e482bb40b72e36 upstream. The zlc is used on NUMA machines to quickly skip over zones that are full. However it is always updated, even for the first zone scanned when the zlc might not even be active. As it's a write to a bitmap that potentially bounces cache line it's deceptively expensive and most machines will not care. Only update the zlc if it was active. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 69aa12f2961a0310d58375815fa391e5528a79b3 Author: Jianyu Zhan Date: Thu Aug 28 19:35:15 2014 +0100 mm/swap.c: clean up *lru_cache_add* functions commit 2329d3751b082b4fd354f334a88662d72abac52d upstream. In mm/swap.c, __lru_cache_add() is exported, but actually there are no users outside this file. This patch unexports __lru_cache_add(), and makes it static. It also exports lru_cache_add_file(), as it is use by cifs and fuse, which can loaded as modules. Signed-off-by: Jianyu Zhan Cc: Minchan Kim Cc: Johannes Weiner Cc: Shaohua Li Cc: Bob Liu Cc: Seth Jennings Cc: Joonsoo Kim Cc: Rafael Aquini Cc: Mel Gorman Acked-by: Rik van Riel Cc: Andrea Arcangeli Cc: Khalid Aziz Cc: Christoph Hellwig Reviewed-by: Zhang Yanfei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 4cd64dcede969cf30f47a6e6ba8e378e74d0790d Author: Vlastimil Babka Date: Thu Aug 28 19:35:14 2014 +0100 mm/page_alloc: prevent MIGRATE_RESERVE pages from being misplaced commit 5bcc9f86ef09a933255ee66bd899d4601785dad5 upstream. For the MIGRATE_RESERVE pages, it is useful when they do not get misplaced on free_list of other migratetype, otherwise they might get allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks. While this cannot be avoided completely when allocating new MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should prevent the misplacement where possible. Currently, it is possible for the misplacement to happen when a MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a fallback for other desired migratetype, and then later freed back through free_pcppages_bulk() without being actually used. This happens because free_pcppages_bulk() uses get_freepage_migratetype() to choose the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with the *desired* migratetype and not the page's original MIGRATE_RESERVE migratetype. This patch fixes the problem by moving the call to set_freepage_migratetype() from rmqueue_bulk() down to __rmqueue_smallest() and __rmqueue_fallback() where the actual page's migratetype (e.g. from which free_list the page is taken from) is used. Note that this migratetype might be different from the pageblock's migratetype due to freepage stealing decisions. This is OK, as page stealing never uses MIGRATE_RESERVE as a fallback, and also takes care to leave all MIGRATE_CMA pages on the correct freelist. Therefore, as an additional benefit, the call to get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can be removed completely. This relies on the fact that MIGRATE_CMA pageblocks are created only during system init, and the above. The related is_migrate_isolate() check is also unnecessary, as memory isolation has other ways to move pages between freelists, and drain pcp lists containing pages that should be isolated. The buffered_rmqueue() can also benefit from calling get_freepage_migratetype() instead of get_pageblock_migratetype(). Signed-off-by: Vlastimil Babka Reported-by: Yong-Taek Lee Reported-by: Bartlomiej Zolnierkiewicz Suggested-by: Joonsoo Kim Acked-by: Joonsoo Kim Suggested-by: Mel Gorman Acked-by: Minchan Kim Cc: KOSAKI Motohiro Cc: Marek Szyprowski Cc: Hugh Dickins Cc: Rik van Riel Cc: Michal Nazarewicz Cc: "Wang, Yalin" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 2ce666c175bb502cae050c97364acf48449b9d6a Author: Mel Gorman Date: Thu Aug 28 19:35:13 2014 +0100 mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813 upstream. Commit "mm: vmscan: obey proportional scanning requirements for kswapd" ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 611 1305 Swap Ins 394 1651 Swap Outs 4394 5891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 15921292 16258483 Kswapd pages reclaimed 15913301 16248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans 23% 22% Page writes by reclaim 4858.000 6232.000 Page writes file 464 341 Page writes anon 4394 5891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Hugh Dickins Cc: Tim Chen Cc: Dave Chinner Tested-by: Yuanhan Liu Cc: Bob Liu Cc: Jan Kara Cc: Rik van Riel Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 2d37a72e406e226656ff9009bb7913e6ff9c3025 Author: Tim Chen Date: Thu Aug 28 19:35:12 2014 +0100 fs/superblock: avoid locking counting inodes and dentries before reclaiming them commit d23da150a37c9fe3cc83dbaf71b3e37fd434ed52 upstream. We remove the call to grab_super_passive in call to super_cache_count. This becomes a scalability bottleneck as multiple threads are trying to do memory reclamation, e.g. when we are doing large amount of file read and page cache is under pressure. The cached objects quickly got reclaimed down to 0 and we are aborting the cache_scan() reclaim. But counting creates a log jam acquiring the sb_lock. We are holding the shrinker_rwsem which ensures the safety of call to list_lru_count_node() and s_op->nr_cached_objects. The shrinker is unregistered now before ->kill_sb() so the operation is safe when we are doing unmount. The impact will depend heavily on the machine and the workload but for a small machine using postmark tuned to use 4xRAM size the results were 3.15.0-rc5 3.15.0-rc5 vanilla shrinker-v1r1 Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%) Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%) Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%) Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%) Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%) Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%) Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%) ffsb running in a configuration that is meant to simulate a mail server showed 3.15.0-rc5 3.15.0-rc5 vanilla shrinker-v1r1 Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%) Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%) Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%) Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%) Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%) Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%) Signed-off-by: Tim Chen Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Hugh Dickins Cc: Dave Chinner Tested-by: Yuanhan Liu Cc: Bob Liu Cc: Jan Kara Acked-by: Rik van Riel Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 2f11e3a821aff1a50c8ebc2fadbeb22c1ea1adbd Author: Dave Chinner Date: Thu Aug 28 19:35:11 2014 +0100 fs/superblock: unregister sb shrinker before ->kill_sb() commit 28f2cd4f6da24a1aa06c226618ed5ad69e13df64 upstream. This series is aimed at regressions noticed during reclaim activity. The first two patches are shrinker patches that were posted ages ago but never merged for reasons that are unclear to me. I'm posting them again to see if there was a reason they were dropped or if they just got lost. Dave? Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you retest the vm scalability test cases on a larger machine? Hugh, does this work for you on the memcg test cases? Based on ext4, I get the following results but unfortunately my larger test machines are all unavailable so this is based on a relatively small machine. postmark 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%) Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%) Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%) Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%) Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%) Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%) Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%) ffsb (mail server simulator) 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%) Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%) Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%) Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%) Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%) Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%) dd of a large file 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%) WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%) WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%) stutter (times mmap latency during large amounts of IO) 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%) Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%) Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%) Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%) Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%) Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%) Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%) Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%) Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%) Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%) Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%) Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%) Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%) Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%) This patch (of 3): We will like to unregister the sb shrinker before ->kill_sb(). This will allow cached objects to be counted without call to grab_super_passive() to update ref count on sb. We want to avoid locking during memory reclamation especially when we are skipping the memory reclaim when we are out of cached objects. This is safe because grab_super_passive does a try-lock on the sb->s_umount now, and so if we are in the unmount process, it won't ever block. That means what used to be a deadlock and races we were avoiding by using grab_super_passive() is now: shrinker umount down_read(shrinker_rwsem) down_write(sb->s_umount) shrinker_unregister down_write(shrinker_rwsem) grab_super_passive(sb) down_read_trylock(sb->s_umount) .... up_read(shrinker_rwsem) up_write(shrinker_rwsem) ->kill_sb() .... So it is safe to deregister the shrinker before ->kill_sb(). Signed-off-by: Tim Chen Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Hugh Dickins Cc: Dave Chinner Tested-by: Yuanhan Liu Cc: Bob Liu Cc: Jan Kara Acked-by: Rik van Riel Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 94109cd2a5dbda0eb858ebe8d4d709150e4e04f0 Author: Hugh Dickins Date: Thu Aug 28 19:35:10 2014 +0100 mm: fix direct reclaim writeback regression commit 8bdd638091605dc66d92c57c4b80eb87fffc15f7 upstream. Shortly before 3.16-rc1, Dave Jones reported: WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971 xfs_vm_writepage+0x5ce/0x630 [xfs]() CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3 Call Trace: xfs_vm_writepage+0x5ce/0x630 [xfs] shrink_page_list+0x8f9/0xb90 shrink_inactive_list+0x253/0x510 shrink_lruvec+0x563/0x6c0 shrink_zone+0x3b/0x100 shrink_zones+0x1f1/0x3c0 try_to_free_pages+0x164/0x380 __alloc_pages_nodemask+0x822/0xc90 alloc_pages_vma+0xaf/0x1c0 handle_mm_fault+0xa31/0xc50 etc. 970 if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == 971 PF_MEMALLOC)) I did not respond at the time, because a glance at the PageDirty block in shrink_page_list() quickly shows that this is impossible: we don't do writeback on file pages (other than tmpfs) from direct reclaim nowadays. Dave was hallucinating, but it would have been disrespectful to say so. However, my own /var/log/messages now shows similar complaints WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b() WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b() from stressing some mmotm trees during July. Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked, so fail shrink_page_list()'s page_is_file_cache() test, and so proceed to mapping->a_ops->writepage()? Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination page freeing callback") has provided such a way to compaction: if migrating a SwapBacked page fails, its newpage may be put back on the list for later use with PageSwapBacked still set, and nothing will clear it. Whether that can do anything worse than issue WARN_ON_ONCEs, and get some statistics wrong, is unclear: easier to fix than to think through the consequences. Fixing it here, before the put_new_page(), addresses the bug directly, but is probably the worst place to fix it. Page migration is doing too many parts of the job on too many levels: fixing it in move_to_new_page() to complement its SetPageSwapBacked would be preferable, except why is it (and newpage->mapping and newpage->index) done there, rather than down in migrate_page_move_mapping(), once we are sure of success? Not a cleanup to get into right now, especially not with memcg cleanups coming in 3.17. Reported-by: Dave Jones Signed-off-by: Hugh Dickins Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c32064785ad77722a76437f7d0826f063158d3f2 Author: Shaohua Li Date: Thu Aug 28 19:35:09 2014 +0100 x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 upstream. We use the accessed bit to age a page at page reclaim time, and currently we also flush the TLB when doing so. But in some workloads TLB flush overhead is very heavy. In my simple multithreaded app with a lot of swap to several pcie SSDs, removing the tlb flush gives about 20% ~ 30% swapout speedup. Fortunately just removing the TLB flush is a valid optimization: on x86 CPUs, clearing the accessed bit without a TLB flush doesn't cause data corruption. It could cause incorrect page aging and the (mistaken) reclaim of hot pages, but the chance of that should be relatively low. So as a performance optimization don't flush the TLB when clearing the accessed bit, it will eventually be flushed by a context switch or a VM operation anyway. [ In the rare event of it not getting flushed for a long time the delay shouldn't really matter because there's no real memory pressure for swapout to react to. ] Suggested-by: Linus Torvalds Signed-off-by: Shaohua Li Acked-by: Rik van Riel Acked-by: Mel Gorman Acked-by: Hugh Dickins Acked-by: Johannes Weiner Cc: linux-mm@kvack.org Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org [ Rewrote the changelog and the code comments. ] Signed-off-by: Ingo Molnar Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b25fd5de48e3ad079abd40a4ea946f126a8bceb0 Author: Vlastimil Babka Date: Thu Aug 28 19:35:08 2014 +0100 mm, compaction: properly signal and act upon lock and need_sched() contention commit be9765722e6b7ece8263cbab857490332339bd6f upstream. Compaction uses compact_checklock_irqsave() function to periodically check for lock contention and need_resched() to either abort async compaction, or to free the lock, schedule and retake the lock. When aborting, cc->contended is set to signal the contended state to the caller. Two problems have been identified in this mechanism. First, compaction also calls directly cond_resched() in both scanners when no lock is yet taken. This call either does not abort async compaction, or set cc->contended appropriately. This patch introduces a new compact_should_abort() function to achieve both. In isolate_freepages(), the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to match what the migration scanner does in the preliminary page checks. In case a pageblock is found suitable for calling isolate_freepages_block(), the checks within there are done on higher frequency. Second, isolate_freepages() does not check if isolate_freepages_block() aborted due to contention, and advances to the next pageblock. This violates the principle of aborting on contention, and might result in pageblocks not being scanned completely, since the scanning cursor is advanced. This problem has been noticed in the code by Joonsoo Kim when reviewing related patches. This patch makes isolate_freepages_block() check the cc->contended flag and abort. In case isolate_freepages() has already isolated some pages before aborting due to contention, page migration will proceed, which is OK since we do not want to waste the work that has been done, and page migration has own checks for contention. However, we do not want another isolation attempt by either of the scanners, so cc->contended flag check is added also to compaction_alloc() and compact_finished() to make sure compaction is aborted right after the migration. The outcome of the patch should be reduced lock contention by async compaction and lower latencies for higher-order allocations where direct compaction is involved. [akpm@linux-foundation.org: fix typo in comment] Reported-by: Joonsoo Kim Signed-off-by: Vlastimil Babka Reviewed-by: Naoya Horiguchi Cc: Minchan Kim Cc: Mel Gorman Cc: Bartlomiej Zolnierkiewicz Cc: Michal Nazarewicz Cc: Christoph Lameter Cc: Rik van Riel Acked-by: Michal Nazarewicz Tested-by: Shawn Guo Tested-by: Kevin Hilman Tested-by: Stephen Warren Tested-by: Fabio Estevam Cc: David Rientjes Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 71a5b801344ae1f03e2fb6ddad600c554f243257 Author: Vlastimil Babka Date: Thu Aug 28 19:35:07 2014 +0100 mm/compaction: avoid rescanning pageblocks in isolate_freepages commit e9ade569910a82614ff5f2c2cea2b65a8d785da4 upstream. The compaction free scanner in isolate_freepages() currently remembers PFN of the highest pageblock where it successfully isolates, to be used as the starting pageblock for the next invocation. The rationale behind this is that page migration might return free pages to the allocator when migration fails and we don't want to skip them if the compaction continues. Since migration now returns free pages back to compaction code where they can be reused, this is no longer a concern. This patch changes isolate_freepages() so that the PFN for restarting is updated with each pageblock where isolation is attempted. Using stress-highalloc from mmtests, this resulted in 10% reduction of the pages scanned by the free scanner. Note that the somewhat similar functionality that records highest successful pageblock in zone->compact_cached_free_pfn, remains unchanged. This cache is used when the whole compaction is restarted, not for multiple invocations of the free scanner during single compaction. Signed-off-by: Vlastimil Babka Cc: Minchan Kim Cc: Mel Gorman Cc: Joonsoo Kim Cc: Bartlomiej Zolnierkiewicz Acked-by: Michal Nazarewicz Reviewed-by: Naoya Horiguchi Cc: Christoph Lameter Cc: Rik van Riel Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 5e4084a627820bd6a8d94bca6acb878e5308716c Author: Vlastimil Babka Date: Thu Aug 28 19:35:06 2014 +0100 mm/compaction: do not count migratepages when unnecessary commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream. During compaction, update_nr_listpages() has been used to count remaining non-migrated and free pages after a call to migrage_pages(). The freepages counting has become unneccessary, and it turns out that migratepages counting is also unnecessary in most cases. The only situation when it's needed to count cc->migratepages is when migrate_pages() returns with a negative error code. Otherwise, the non-negative return value is the number of pages that were not migrated, which is exactly the count of remaining pages in the cc->migratepages list. Furthermore, any non-zero count is only interesting for the tracepoint of mm_compaction_migratepages events, because after that all remaining unmigrated pages are put back and their count is set to 0. This patch therefore removes update_nr_listpages() completely, and changes the tracepoint definition so that the manual counting is done only when the tracepoint is enabled, and only when migrate_pages() returns a negative error code. Furthermore, migrate_pages() and the tracepoints won't be called when there's nothing to migrate. This potentially avoids some wasted cycles and reduces the volume of uninteresting mm_compaction_migratepages events where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this was about 75% of the events. The mm_compaction_isolate_migratepages event is better for determining that nothing was isolated for migration, and this one was just duplicating the info. Signed-off-by: Vlastimil Babka Reviewed-by: Naoya Horiguchi Cc: Minchan Kim Cc: Mel Gorman Cc: Joonsoo Kim Cc: Bartlomiej Zolnierkiewicz Acked-by: Michal Nazarewicz Cc: Christoph Lameter Cc: Rik van Riel Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 0d9b79244228f7ce9d2b2d3a16418a1927669ace Author: David Rientjes Date: Thu Aug 28 19:35:05 2014 +0100 mm, compaction: terminate async compaction when rescheduling commit aeef4b83806f49a0c454b7d4578671b71045bee2 upstream. Async compaction terminates prematurely when need_resched(), see compact_checklock_irqsave(). This can never trigger, however, if the cond_resched() in isolate_migratepages_range() always takes care of the scheduling. If the cond_resched() actually triggers, then terminate this pageblock scan for async compaction as well. Signed-off-by: David Rientjes Acked-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 812fcdf3758425af5bd8c8ec32ffdb7a9615e00c Author: David Rientjes Date: Thu Aug 28 19:35:04 2014 +0100 mm, compaction: embed migration mode in compact_control commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream. We're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to avoid unnecessary latency. This also alters compaction triggered from sysfs, either for the entire system or for a node, to force MIGRATE_SYNC. [akpm@linux-foundation.org: fix build] [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()] Signed-off-by: David Rientjes Suggested-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Greg Thelen Cc: Naoya Horiguchi Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 7e95430e4232f55bfd8f1728fb620957912c32f4 Author: David Rientjes Date: Thu Aug 28 19:35:03 2014 +0100 mm, compaction: add per-zone migration pfn cache for async compaction commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream. Each zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compaction from scanning an enormous amount of non-MOVABLE pageblocks each time it is called. On large machines, this could be potentially very expensive. This patch adds a per-zone cached migration scanner pfn only for async compaction. It is updated everytime a pageblock has been scanned in its entirety and when no pages from it were successfully isolated. The cached migration scanner pfn for sync compaction is updated only when called for sync compaction. Signed-off-by: David Rientjes Acked-by: Vlastimil Babka Reviewed-by: Naoya Horiguchi Cc: Greg Thelen Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b264e9ab205b3038b032a18df181fd6ed7a77ce3 Author: David Rientjes Date: Thu Aug 28 19:35:02 2014 +0100 mm, compaction: return failed migration target pages back to freelist commit d53aea3d46d64e95da9952887969f7533b9ab25e upstream. Greg reported that he found isolated free pages were returned back to the VM rather than the compaction freelist. This will cause holes behind the free scanner and cause it to reallocate additional memory if necessary later. He detected the problem at runtime seeing that ext4 metadata pages (esp the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were constantly visited by compaction calls of migrate_pages(). These pages had a non-zero b_count which caused fallback_migrate_page() -> try_to_release_page() -> try_to_free_buffers() to fail. Memory compaction works by having a "freeing scanner" scan from one end of a zone which isolates pages as migration targets while another "migrating scanner" scans from the other end of the same zone which isolates pages for migration. When page migration fails for an isolated page, the target page is returned to the system rather than the freelist built by the freeing scanner. This may require the freeing scanner to continue scanning memory after suitable migration targets have already been returned to the system needlessly. This patch returns destination pages to the freeing scanner freelist when page migration fails. This prevents unnecessary work done by the freeing scanner but also encourages memory to be as compacted as possible at the end of the zone. Signed-off-by: David Rientjes Reported-by: Greg Thelen Acked-by: Mel Gorman Acked-by: Vlastimil Babka Reviewed-by: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit ee92d4d6d28561c5019da10ed62c5bb8bdcd3c1e Author: David Rientjes Date: Thu Aug 28 19:35:01 2014 +0100 mm, migration: add destination page freeing callback commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream. Memory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targets, this can reuse already freed memory instead of scanning additional memory. If the caller provides a function to handle freeing of destination pages, it is called when page migration fails. If the caller passes NULL then freeing back to the system will be handled as usual. This patch introduces no functional change. Signed-off-by: David Rientjes Reviewed-by: Naoya Horiguchi Acked-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Greg Thelen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit e644c10bf85cd655e616a5985a42b907bc7f6f19 Author: Vlastimil Babka Date: Thu Aug 28 19:35:00 2014 +0100 mm/compaction: cleanup isolate_freepages() commit c96b9e508f3d06ddb601dcc9792d62c044ab359e upstream. isolate_freepages() is currently somewhat hard to follow thanks to many looks like it is related to the 'low_pfn' variable, but in fact it is not. This patch renames the 'high_pfn' variable to a hopefully less confusing name, and slightly changes its handling without a functional change. A comment made obsolete by recent changes is also updated. [akpm@linux-foundation.org: comment fixes, per Minchan] [iamjoonsoo.kim@lge.com: cleanups] Signed-off-by: Vlastimil Babka Cc: Minchan Kim Cc: Mel Gorman Cc: Joonsoo Kim Cc: Bartlomiej Zolnierkiewicz Cc: Michal Nazarewicz Cc: Naoya Horiguchi Cc: Christoph Lameter Cc: Rik van Riel Cc: Dongjun Shin Cc: Sunghwan Yun Signed-off-by: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 87db4a8abdf503c5e67eca7373dadb177bd0d715 Author: Heesub Shin Date: Thu Aug 28 19:34:59 2014 +0100 mm/compaction: clean up unused code lines commit 13fb44e4b0414d7e718433a49e6430d5b76bd46e upstream. Remove code lines currently not in use or never called. Signed-off-by: Heesub Shin Acked-by: Vlastimil Babka Cc: Dongjun Shin Cc: Sunghwan Yun Cc: Minchan Kim Cc: Mel Gorman Cc: Joonsoo Kim Cc: Bartlomiej Zolnierkiewicz Cc: Michal Nazarewicz Cc: Naoya Horiguchi Cc: Christoph Lameter Cc: Rik van Riel Cc: Dongjun Shin Cc: Sunghwan Yun Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 32e8fcae4ace22d053805fe258a2ae78973d4a85 Author: Fabian Frederick Date: Thu Aug 28 19:34:58 2014 +0100 mm/readahead.c: inline ra_submit commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream. Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick Suggested-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 72ef5b50e048aae663fe9b8a3702646a773ea414 Author: Al Viro Date: Thu Aug 28 19:34:57 2014 +0100 callers of iov_copy_from_user_atomic() don't need pagecache_disable() commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream. ... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 4fb08e5ab578244164bd2269bd0d5d14669ed35a Author: Sasha Levin Date: Thu Aug 28 19:34:56 2014 +0100 mm: remove read_cache_page_async() commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream. This patch removes read_cache_page_async() which wasn't really needed anywhere and simplifies the code around it a bit. read_cache_page_async() is useful when we want to read a page into the cache without waiting for it to complete. This happens when the appropriate callback 'filler' doesn't complete its read operation and releases the page lock immediately, and instead queues a different completion routine to do that. This never actually happened anywhere in the code. read_cache_page_async() had 3 different callers: - read_cache_page() which is the sync version, it would just wait for the requested read to complete using wait_on_page_read(). - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler function it supplied doesn't do any async reads, and would complete before the filler function returns - making it actually a sync read. - CRAMFS would call it using the read_mapping_page_async() wrapper, with a similar story to JFFS2 - the filler function doesn't do anything that reminds async reads and would always complete before the filler function returns. To sum it up, the code in mm/filemap.c never took advantage of having read_cache_page_async(). While there are filler callbacks that do async reads (such as the block one), we always called it with the read_cache_page(). This patch adds a mandatory wait for read to complete when adding a new page to the cache, and removes read_cache_page_async() and its wrappers. Signed-off-by: Sasha Levin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 5c3ce5b2b6bd0c685914b68c1e6106278845add0 Author: Johannes Weiner Date: Thu Aug 28 19:34:55 2014 +0100 mm: madvise: fix MADV_WILLNEED on shmem swapouts commit 55231e5c898c5c03c14194001e349f40f59bd300 upstream. MADV_WILLNEED currently does not read swapped out shmem pages back in. Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") made find_get_page() filter exceptional radix tree entries but failed to convert all find_get_page() callers that WANT exceptional entries over to find_get_entry(). One of them is shmem swap readahead in madvise, which now skips over any swap-out records. Convert it to find_get_entry(). Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner Reported-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit e714f0cf03c3adfba331c85deca6844e47b60cc5 Author: Johannes Weiner Date: Thu Aug 28 19:34:54 2014 +0100 mm + fs: prepare for non-page entries in page cache radix trees commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream. shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Cc: Andrea Arcangeli Cc: Bob Liu Cc: Christoph Hellwig Cc: Dave Chinner Cc: Greg Thelen Cc: Hugh Dickins Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Luigi Semenzato Cc: Mel Gorman Cc: Metin Doslu Cc: Michel Lespinasse Cc: Ozgun Erdogan Cc: Peter Zijlstra Cc: Roman Gushchin Cc: Ryan Mallon Cc: Tejun Heo Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 3721b421230f4e8546bff908156187cba5b49215 Author: Johannes Weiner Date: Thu Aug 28 19:34:53 2014 +0100 mm: filemap: move radix tree hole searching here commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream. The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Bob Liu Cc: Christoph Hellwig Cc: Dave Chinner Cc: Greg Thelen Cc: Hugh Dickins Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Luigi Semenzato Cc: Metin Doslu Cc: Michel Lespinasse Cc: Ozgun Erdogan Cc: Peter Zijlstra Cc: Roman Gushchin Cc: Ryan Mallon Cc: Tejun Heo Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit a3d18e49a8fdaed9fba4b65b00ca6dffc0418f45 Author: Johannes Weiner Date: Thu Aug 28 19:34:52 2014 +0100 mm: shmem: save one radix tree lookup when truncating swapped pages commit 6dbaf22ce1f1dfba33313198eb5bd989ae76dd87 upstream. Page cache radix tree slots are usually stabilized by the page lock, but shmem's swap cookies have no such thing. Because the overall truncation loop is lockless, the swap entry is currently confirmed by a tree lookup and then deleted by another tree lookup under the same tree lock region. Use radix_tree_delete_item() instead, which does the verification and deletion with only one lookup. This also allows removing the delete-only special case from shmem_radix_tree_replace(). Signed-off-by: Johannes Weiner Reviewed-by: Minchan Kim Reviewed-by: Rik van Riel Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Bob Liu Cc: Christoph Hellwig Cc: Dave Chinner Cc: Greg Thelen Cc: Hugh Dickins Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Luigi Semenzato Cc: Metin Doslu Cc: Michel Lespinasse Cc: Ozgun Erdogan Cc: Peter Zijlstra Cc: Roman Gushchin Cc: Ryan Mallon Cc: Tejun Heo Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 50c4613d2293063a18c05dccfc8f4335be21b3e4 Author: Johannes Weiner Date: Thu Aug 28 19:34:51 2014 +0100 lib: radix-tree: add radix_tree_delete_item() commit 53c59f262d747ea82e7414774c59a489501186a0 upstream. Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner Reviewed-by: Minchan Kim Reviewed-by: Rik van Riel Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Bob Liu Cc: Christoph Hellwig Cc: Dave Chinner Cc: Greg Thelen Cc: Hugh Dickins Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Luigi Semenzato Cc: Metin Doslu Cc: Michel Lespinasse Cc: Ozgun Erdogan Cc: Peter Zijlstra Cc: Roman Gushchin Cc: Ryan Mallon Cc: Tejun Heo Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c05ac84a8523a88e23dfd89f5bb96934409cda8a Author: Linus Torvalds Date: Thu Aug 28 19:34:50 2014 +0100 mm: don't pointlessly use BUG_ON() for sanity check commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream. BUG_ON() is a big hammer, and should be used _only_ if there is some major corruption that you cannot possibly recover from, making it imperative that the current process (and possibly the whole machine) be terminated with extreme prejudice. The trivial sanity check in the vmacache code is *not* such a fatal error. Recovering from it is absolutely trivial, and using BUG_ON() just makes it harder to debug for no actual advantage. To make matters worse, the placement of the BUG_ON() (only if the range check matched) actually makes it harder to hit the sanity check to begin with, so _if_ there is a bug (and we just got a report from Srivatsa Bhat that this can indeed trigger), it is harder to debug not just because the machine is possibly dead, but because we don't have better coverage. BUG_ON() must *die*. Maybe we should add a checkpatch warning for it, because it is simply just about the worst thing you can ever do if you hit some "this cannot happen" situation. Reported-by: Srivatsa S. Bhat Cc: Davidlohr Bueso Cc: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 9c0073071075474cb066cca50b7b8574c1314e3d Author: Davidlohr Bueso Date: Thu Aug 28 19:34:49 2014 +0100 mm: per-thread vma caching commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream. This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: Linus Torvalds Reviewed-by: Michel Lespinasse Cc: Oleg Nesterov Tested-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit edce92fcf24868572b57678126f17de622ae660d Author: Christoph Lameter Date: Thu Aug 28 19:34:48 2014 +0100 vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state() commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream. Seems to be called with preemption enabled. Therefore it must use mod_zone_page_state instead. Signed-off-by: Christoph Lameter Reported-by: Grygorii Strashko Tested-by: Grygorii Strashko Cc: Tejun Heo Cc: Santosh Shilimkar Cc: Ingo Molnar Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 4b946426a0af6a745684f38d365ef505f29f7188 Author: Vladimir Davydov Date: Thu Aug 28 19:34:47 2014 +0100 mm: vmscan: shrink_slab: rename max_pass -> freeable commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream. The name `max_pass' is misleading, because this variable actually keeps the estimate number of freeable objects, not the maximal number of objects we can scan in this pass, which can be twice that. Rename it to reflect its actual meaning. Signed-off-by: Vladimir Davydov Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b3b0bd3970bbe5c770a3f0764c4da9111605ad47 Author: Vladimir Davydov Date: Thu Aug 28 19:34:46 2014 +0100 mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream. When direct reclaim is executed by a process bound to a set of NUMA nodes, we should scan only those nodes when possible, but currently we will scan kmem from all online nodes even if the kmem shrinker is NUMA aware. That said, binding a process to a particular NUMA node won't prevent it from shrinking inode/dentry caches from other nodes, which is not good. Fix this. Signed-off-by: Vladimir Davydov Cc: Mel Gorman Cc: Michal Hocko Cc: Johannes Weiner Cc: Rik van Riel Cc: Dave Chinner Cc: Glauber Costa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 5a6ad555bd1cc9c3126df8b8d69ea332b9a4ec8b Author: Jens Axboe Date: Thu Aug 28 19:34:45 2014 +0100 mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream. In some testing I ran today (some fio jobs that spread over two nodes), we end up spending 40% of the time in filemap_check_errors(). That smells fishy. Looking further, this is basically what happens: blkdev_aio_read() generic_file_aio_read() filemap_write_and_wait_range() if (!mapping->nr_pages) filemap_check_errors() and filemap_check_errors() always attempts two test_and_clear_bit() on the mapping flags, thus dirtying it for every single invocation. The patch below tests each of these bits before clearing them, avoiding this issue. In my test case (4-socket box), performance went from 1.7M IOPS to 4.0M IOPS. Signed-off-by: Jens Axboe Acked-by: Jeff Moyer Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 337c9823cf0b93c8f4d1c4654bd93cf24e5b837b Author: Mel Gorman Date: Thu Aug 28 19:34:44 2014 +0100 mm: optimize put_mems_allowed() usage commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream. Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 8010da4957163cce03a9a27b7f3dbeb5a77a3954 Author: Raghavendra K T Date: Thu Aug 28 19:34:43 2014 +0100 mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream. Currently max_sane_readahead() returns zero on the cpu whose NUMA node has no local memory which leads to readahead failure. Fix this readahead failure by returning minimum of (requested pages, 512). Users running applications on a memory-less cpu which needs readahead such as streaming application see considerable boost in the performance. Result: fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement. fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on the normal NUMA cases w/ patch. Kernel Avg Stddev base 7.4975 3.92% patched 7.4174 3.26% [Andrew: making return value PAGE_SIZE independent] Suggested-by: Linus Torvalds Signed-off-by: Raghavendra K T Acked-by: Jan Kara Cc: Wu Fengguang Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c4199ba1a1d8802a69ed9f65026bc8781a72c708 Author: David Rientjes Date: Thu Aug 28 19:34:42 2014 +0100 mm, compaction: ignore pageblock skip when manually invoking compaction commit 91ca9186484809c57303b33778d841cc28f696ed upstream. The cached pageblock hint should be ignored when triggering compaction through /proc/sys/vm/compact_memory so all eligible memory is isolated. Manually invoking compaction is known to be expensive, there's no need to skip pageblocks based on heuristics (mainly for debugging). Signed-off-by: David Rientjes Acked-by: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit d36e7004412992303202433b25bdbb1e7c929f57 Author: David Rientjes Date: Thu Aug 28 19:34:41 2014 +0100 mm, compaction: determine isolation mode only once commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream. The conditions that control the isolation mode in isolate_migratepages_range() do not change during the iteration, so extract them out and only define the value once. This actually does have an effect, gcc doesn't optimize it itself because of cc->sync. Signed-off-by: David Rientjes Cc: Mel Gorman Acked-by: Rik van Riel Acked-by: Vlastimil Babka Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 0a1802eaa49d84dcd7052afb4c334a0c94af1832 Author: Joonsoo Kim Date: Thu Aug 28 19:34:40 2014 +0100 mm/compaction: clean-up code on success of ballon isolation commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream. It is just for clean-up to reduce code size and improve readability. There is no functional change. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit fc8dc0a9f5b062d2b5fbfe368e4d9cdff23ca76b Author: Joonsoo Kim Date: Thu Aug 28 19:34:39 2014 +0100 mm/compaction: check pageblock suitability once per pageblock commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream. isolation_suitable() and migrate_async_suitable() is used to be sure that this pageblock range is fine to be migragted. It isn't needed to call it on every page. Current code do well if not suitable, but, don't do well when suitable. 1) It re-checks isolation_suitable() on each page of a pageblock that was already estabilished as suitable. 2) It re-checks migrate_async_suitable() on each page of a pageblock that was not entered through the next_pageblock: label, because last_pageblock_nr is not otherwise updated. This patch fixes situation by 1) calling isolation_suitable() only once per pageblock and 2) always updating last_pageblock_nr to the pageblock that was just checked. Additionally, move PageBuddy() check after pageblock unit check, since pageblock check is the first thing we should do and makes things more simple. [vbabka@suse.cz: rephrase commit description] Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit e45dcd3deabda867dbb5ceb05706d32bb4267d79 Author: Joonsoo Kim Date: Thu Aug 28 19:34:38 2014 +0100 mm/compaction: change the timing to check to drop the spinlock commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream. It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th pfn page. This may results in below situation while isolating migratepage. 1. try isolate 0x0 ~ 0x200 pfn pages. 2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop the spinlock. 3. Then, to complete isolating, retry to aquire the lock. I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the criteria about dropping the lock. This has no harm 0x0 pfn, because, at this time, locked variable would be false. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 093b8ab7059129701849d3816f3d16f7b57fbfe0 Author: Joonsoo Kim Date: Thu Aug 28 19:34:37 2014 +0100 mm/compaction: do not call suitable_migration_target() on every page commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream. suitable_migration_target() checks that pageblock is suitable for migration target. In isolate_freepages_block(), it is called on every page and this is inefficient. So make it called once per pageblock. suitable_migration_target() also checks if page is highorder or not, but it's criteria for highorder is pageblock order. So calling it once within pageblock range has no problem. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit d9c7a696a1bec88a4c5e4a54f8ae3919957c208f Author: Joonsoo Kim Date: Thu Aug 28 19:34:36 2014 +0100 mm/compaction: disallow high-order page for migration target commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream. Purpose of compaction is to get a high order page. Currently, if we find high-order page while searching migration target page, we break it to order-0 pages and use them as migration target. It is contrary to purpose of compaction, so disallow high-order page to be used for migration target. Additionally, clean-up logic in suitable_migration_target() to simplify the code. There is no functional changes from this clean-up. Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 5939eba4f79c4d689957bd64877aef2da9310286 Author: David Rientjes Date: Thu Aug 28 19:34:35 2014 +0100 mm, compaction: avoid isolating pinned pages commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream. Page migration will fail for memory that is pinned in memory with, for example, get_user_pages(). In this case, it is unnecessary to take zone->lru_lock or isolating the page and passing it to page migration which will ultimately fail. This is a racy check, the page can still change from under us, but in that case we'll just fail later when attempting to move the page. This avoids very expensive memory compaction when faulting transparent hugepages after pinning a lot of memory with a Mellanox driver. On a 128GB machine and pinning ~120GB of memory, before this patch we see the enormous disparity in the number of page migration failures because of the pinning (from /proc/vmstat): compact_pages_moved 8450 compact_pagemigrate_failed 15614415 0.05% of pages isolated are successfully migrated and explicitly triggering memory compaction takes 102 seconds. After the patch: compact_pages_moved 9197 compact_pagemigrate_failed 7 99.9% of pages isolated are now successfully migrated in this configuration and memory compaction takes less than one second. Signed-off-by: David Rientjes Acked-by: Hugh Dickins Acked-by: Mel Gorman Cc: Joonsoo Kim Cc: Rik van Riel Cc: Greg Thelen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c824c468fd7660207f10cd333b2a1fa188d533b0 Author: Yasuaki Ishimatsu Date: Thu Aug 28 19:34:34 2014 +0100 mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream. Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB memory machine since onlining memory sections is too slow. And we found out setup_zone_migrate_reserve spent >90% of the time. The problem is, setup_zone_migrate_reserve scans all pageblocks unconditionally, but it is only necessary if the number of reserved block was reduced (i.e. memory hot remove). Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that the number of reserved pageblocks is almost always unchanged. This patch adds zone->nr_migrate_reserve_block to maintain the number of MIGRATE_RESERVE pageblocks and it reduces the overhead of setup_zone_migrate_reserve dramatically. The following table shows time of onlining a memory section. Amount of memory | 128GB | 192GB | 256GB| --------------------------------------------- linux-3.12 | 23.9 | 31.4 | 44.5 | This patch | 8.3 | 8.3 | 8.6 | Mel's proposal patch | 10.9 | 19.2 | 31.3 | --------------------------------------------- (millisecond) 128GB : 4 nodes and each node has 32GB of memory 192GB : 6 nodes and each node has 32GB of memory 256GB : 8 nodes and each node has 32GB of memory (*1) Mel proposed his idea by the following threads. https://lkml.org/lkml/2013/10/30/272 [akpm@linux-foundation.org: tweak comment] Signed-off-by: KOSAKI Motohiro Signed-off-by: Yasuaki Ishimatsu Reported-by: Yasuaki Ishimatsu Tested-by: Yasuaki Ishimatsu Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 760265401f255337347c42bc8fb0b9c1c682d904 Author: Vladimir Davydov Date: Thu Aug 28 19:34:33 2014 +0100 mm: vmscan: call NUMA-unaware shrinkers irrespective of nodemask commit ec97097bca147d5718a5d2c024d1ec740b10096d upstream. If a shrinker is not NUMA-aware, shrink_slab() should call it exactly once with nid=0, but currently it is not true: if node 0 is not set in the nodemask or if it is not online, we will not call such shrinkers at all. As a result some slabs will be left untouched under some circumstances. Let us fix it. Signed-off-by: Vladimir Davydov Reported-by: Dave Chinner Cc: Mel Gorman Cc: Michal Hocko Cc: Johannes Weiner Cc: Rik van Riel Cc: Glauber Costa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 671133cd6bd95dd7c752655b4041f05645d3bc27 Author: Vladimir Davydov Date: Thu Aug 28 19:34:32 2014 +0100 mm: vmscan: shrink all slab objects if tight on memory commit 0b1fb40a3b1291f2f12f13f644ac95cf756a00e6 upstream. When reclaiming kmem, we currently don't scan slabs that have less than batch_size objects (see shrink_slab_node()): while (total_scan >= batch_size) { shrinkctl->nr_to_scan = batch_size; shrinker->scan_objects(shrinker, shrinkctl); total_scan -= batch_size; } If there are only a few shrinkers available, such a behavior won't cause any problems, because the batch_size is usually small, but if we have a lot of slab shrinkers, which is perfectly possible since FS shrinkers are now per-superblock, we can end up with hundreds of megabytes of practically unreclaimable kmem objects. For instance, mounting a thousand of ext2 FS images with a hundred of files in each and iterating over all the files using du(1) will result in about 200 Mb of FS caches that cannot be dropped even with the aid of the vm.drop_caches sysctl! This problem was initially pointed out by Glauber Costa [*]. Glauber proposed to fix it by making the shrink_slab() always take at least one pass, to put it simply, turning the scan loop above to a do{}while() loop. However, this proposal was rejected, because it could result in more aggressive and frequent slab shrinking even under low memory pressure when total_scan is naturally very small. This patch is a slightly modified version of Glauber's approach. Similarly to Glauber's patch, it makes shrink_slab() scan less than batch_size objects, but only if the total number of objects we want to scan (total_scan) is greater than the total number of objects available (max_pass). Since total_scan is biased as half max_pass if the current delta change is small: if (delta < max_pass / 4) total_scan = min(total_scan, max_pass / 2); this is only possible if we are scanning at high prio. That said, this patch shouldn't change the vmscan behaviour if the memory pressure is low, but if we are tight on memory, we will do our best by trying to reclaim all available objects, which sounds reasonable. [*] http://www.spinics.net/lists/cgroups/msg06913.html Signed-off-by: Vladimir Davydov Cc: Mel Gorman Cc: Michal Hocko Cc: Johannes Weiner Cc: Rik van Riel Cc: Dave Chinner Cc: Glauber Costa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit df94648c0144550a0d6205ad8260fe626ec39caa Author: Shaohua Li Date: Thu Aug 28 19:34:31 2014 +0100 swap: add a simple detector for inappropriate swapin readahead commit 579f82901f6f41256642936d7e632f3979ad76d4 upstream. This is a patch to improve swap readahead algorithm. It's from Hugh and I slightly changed it. Hugh's original changelog: swapin readahead does a blind readahead, whether or not the swapin is sequential. This may be ok on harddisk, because large reads have relatively small costs, and if the readahead pages are unneeded they can be reclaimed easily - though, what if their allocation forced reclaim of useful pages? But on SSD devices large reads are more expensive than small ones: if the readahead pages are unneeded, reading them in caused significant overhead. This patch adds very simplistic random read detection. Stealing the PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages() simply looks at readahead's current success rate, and narrows or widens its readahead window accordingly. There is little science to its heuristic: it's about as stupid as can be whilst remaining effective. The table below shows elapsed times (in centiseconds) when running a single repetitive swapping load across a 1000MB mapping in 900MB ram with 1GB swap (the harddisk tests had taken painfully too long when I used mem=500M, but SSD shows similar results for that). Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch which Shaohua showed to be defective; HughNew this Nov 14 patch, with page_cluster as usual at default of 3 (8-page reads); HughPC4 this same patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0 (1-page reads: no readahead). HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for sequential access to the mapping, cycling five times around; Rand for the same number of random touches. Anon for a MAP_PRIVATE anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs. One weakness of Shaohua's vma/anon_vma approach was that it did not optimize Shmem: seen below. Konstantin's approach was perhaps mistuned, 50% slower on Seq: did not compete and is not shown below. HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 73921 76210 75611 76904 78191 121542 Seq Shmem 73601 73176 73855 72947 74543 118322 Rand Anon 895392 831243 871569 845197 846496 841680 Rand Shmem 1058375 1053486 827935 764955 764376 756489 SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0 Seq Anon 24634 24198 24673 25107 21614 70018 Seq Shmem 24959 24932 25052 25703 22030 69678 Rand Anon 43014 26146 28075 25989 26935 25901 Rand Shmem 45349 45215 28249 24268 24138 24332 These tests are, of course, two extremes of a very simple case: under heavier mixed loads I've not yet observed any consistent improvement or degradation, and wider testing would be welcome. Shaohua Li: Test shows Vanilla is slightly better in sequential workload than Hugh's patch. I observed with Hugh's patch sometimes the readahead size is shrinked too fast (from 8 to 1 immediately) in sequential workload if there is no hit. And in such case, continuing doing readahead is good actually. I don't prepare a sophisticated algorithm for the sequential workload because so far we can't guarantee sequential accessed pages are swap out sequentially. So I slightly change Hugh's heuristic - don't shrink readahead size too fast. Here is my test result (unit second, 3 runs average): Vanilla Hugh New Seq 356 370 360 Random 4525 2447 2444 Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'. The first part is running a random workload (till around 1200 of the x-axis) and the second part is running a sequential workload. swapin and swapout throughput are almost identical in steady state in both workloads. These are expected behavior. while in Vanilla, swapin is much bigger than swapout especially in random workload (because wrong readahead). Original patches by: Shaohua Li and Konstantin Khlebnikov. [fengguang.wu@intel.com: swapin_nr_pages() can be static] Signed-off-by: Hugh Dickins Signed-off-by: Shaohua Li Signed-off-by: Fengguang Wu Cc: Rik van Riel Cc: Wu Fengguang Cc: Minchan Kim Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 3d3de516335991816962c2027830f88d8a412a4a Author: Vlastimil Babka Date: Thu Aug 28 19:34:30 2014 +0100 mm: compaction: reset scanner positions immediately when they meet commit 55b7c4c99f6a448f72179297fe6432544f220063 upstream. Compaction used to start its migrate and free page scaners at the zone's lowest and highest pfn, respectively. Later, caching was introduced to remember the scanners' progress across compaction attempts so that pageblocks are not re-scanned uselessly. Additionally, pageblocks where isolation failed are marked to be quickly skipped when encountered again in future compactions. Currently, both the reset of cached pfn's and clearing of the pageblock skip information for a zone is done in __reset_isolation_suitable(). This function gets called when: - compaction is restarting after being deferred - compact_blockskip_flush flag is set in compact_finished() when the scanners meet (and not again cleared when direct compaction succeeds in allocation) and kswapd acts upon this flag before going to sleep This behavior is suboptimal for several reasons: - when direct sync compaction is called after async compaction fails (in the allocation slowpath), it will effectively do nothing, unless kswapd happens to process the compact_blockskip_flush flag meanwhile. This is racy and goes against the purpose of sync compaction to more thoroughly retry the compaction of a zone where async compaction has failed. The restart-after-deferring path cannot help here as deferring happens only after the sync compaction fails. It is also done only for the preferred zone, while the compaction might be done for a fallback zone. - the mechanism of marking pageblock to be skipped has little value since the cached pfn's are reset only together with the pageblock skip flags. This effectively limits pageblock skip usage to parallel compactions. This patch changes compact_finished() so that cached pfn's are reset immediately when the scanners meet. Clearing pageblock skip flags is unchanged, as well as the other situations where cached pfn's are reset. This allows the sync-after-async compaction to retry pageblocks not marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async compactions now skips without marking them. Signed-off-by: Vlastimil Babka Cc: Rik van Riel Acked-by: Mel Gorman Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b488972c6afdf534055b9413f4da0359087db991 Author: Vlastimil Babka Date: Thu Aug 28 19:34:29 2014 +0100 mm: compaction: do not mark unmovable pageblocks as skipped in async compaction commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0 upstream. Compaction temporarily marks pageblocks where it fails to isolate pages as to-be-skipped in further compactions, in order to improve efficiency. One of the reasons to fail isolating pages is that isolation is not attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type. The problem is that blocks skipped due to not being MIGRATE_MOVABLE in async compaction become skipped due to the temporary mark also in future sync compaction. Moreover, this may follow quite soon during __alloc_page_slowpath, without much time for kswapd to clear the pageblock skip marks. This goes against the idea that sync compaction should try to scan these blocks more thoroughly than the async compaction. The fix is to ensure in async compaction that these !MIGRATE_MOVABLE blocks are not marked to be skipped. Note this should not affect performance or locking impact of further async compactions, as skipping a block due to being !MIGRATE_MOVABLE is done soon after skipping a block marked to be skipped, both without locking. Signed-off-by: Vlastimil Babka Cc: Rik van Riel Acked-by: Mel Gorman Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 6e3335e29177af49e8e4b8ba67452aa7afd91d7e Author: Vlastimil Babka Date: Thu Aug 28 19:34:28 2014 +0100 mm: compaction: encapsulate defer reset logic commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream. Currently there are several functions to manipulate the deferred compaction state variables. The remaining case where the variables are touched directly is when a successful allocation occurs in direct compaction, or is expected to be successful in the future by kswapd. Here, the lowest order that is expected to fail is updated, and in the case of successful allocation, the deferred status and counter is reset completely. Create a new function compaction_defer_reset() to encapsulate this functionality and make it easier to understand the code. No functional change. Signed-off-by: Vlastimil Babka Acked-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 280f6d65ae22f07d7a8e36b006e2bb39c2d287bb Author: Mel Gorman Date: Thu Aug 28 19:34:27 2014 +0100 mm: compaction: trace compaction begin and end commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f upstream. The broad goal of the series is to improve allocation success rates for huge pages through memory compaction, while trying not to increase the compaction overhead. The original objective was to reintroduce capturing of high-order pages freed by the compaction, before they are split by concurrent activity. However, several bugs and opportunities for simple improvements were found in the current implementation, mostly through extra tracepoints (which are however too ugly for now to be considered for sending). The patches mostly deal with two mechanisms that reduce compaction overhead, which is caching the progress of migrate and free scanners, and marking pageblocks where isolation failed to be skipped during further scans. Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in compaction and potentially debug scanner pfn values. Patch 2 encapsulates the some functionality for handling deferred compactions for better maintainability, without a functional change type is not determined without being actually needed. Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after they have been read to initialize a compaction run. Patch 4 fixes a bug where scanners meeting is sometimes not properly detected and can lead to multiple compaction attempts quitting early without doing any work. Patch 5 improves the chances of sync compaction to process pageblocks that async compaction has skipped due to being !MIGRATE_MOVABLE. Patch 6 improves the chances of sync direct compaction to actually do anything when called after async compaction fails during allocation slowpath. The impact of patches were validated using mmtests's stress-highalloc benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine with 4GB memory. Due to instability of the results (mostly related to the bugs fixed by patches 2 and 3), 10 iterations were performed, taking min,mean,max values for success rates and mean values for time and vmstat-based metrics. First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline due to no functional changes in 1 and 2. Comments below. stress-highalloc 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp Success 1 Min 9.00 ( 0.00%) 10.00 (-11.11%) 43.00 (-377.78%) 43.00 (-377.78%) 33.00 (-266.67%) Success 1 Mean 27.50 ( 0.00%) 25.30 ( 8.00%) 45.50 (-65.45%) 45.90 (-66.91%) 46.30 (-68.36%) Success 1 Max 36.00 ( 0.00%) 36.00 ( 0.00%) 47.00 (-30.56%) 48.00 (-33.33%) 52.00 (-44.44%) Success 2 Min 10.00 ( 0.00%) 8.00 ( 20.00%) 46.00 (-360.00%) 45.00 (-350.00%) 35.00 (-250.00%) Success 2 Mean 26.40 ( 0.00%) 23.50 ( 10.98%) 47.30 (-79.17%) 47.60 (-80.30%) 48.10 (-82.20%) Success 2 Max 34.00 ( 0.00%) 33.00 ( 2.94%) 48.00 (-41.18%) 50.00 (-47.06%) 54.00 (-58.82%) Success 3 Min 65.00 ( 0.00%) 63.00 ( 3.08%) 85.00 (-30.77%) 84.00 (-29.23%) 85.00 (-30.77%) Success 3 Mean 76.70 ( 0.00%) 70.50 ( 8.08%) 86.20 (-12.39%) 85.50 (-11.47%) 86.00 (-12.13%) Success 3 Max 87.00 ( 0.00%) 86.00 ( 1.15%) 88.00 ( -1.15%) 87.00 ( 0.00%) 87.00 ( 0.00%) 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp User 6437.72 6459.76 5960.32 5974.55 6019.67 System 1049.65 1049.09 1029.32 1031.47 1032.31 Elapsed 1856.77 1874.48 1949.97 1994.22 1983.15 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-nothp 3-nothp 4-nothp 5-nothp 6-nothp Minor Faults 253952267 254581900 250030122 250507333 250157829 Major Faults 420 407 506 530 530 Swap Ins 4 9 9 6 6 Swap Outs 398 375 345 346 333 Direct pages scanned 197538 189017 298574 287019 299063 Kswapd pages scanned 1809843 1801308 1846674 1873184 1861089 Kswapd pages reclaimed 1806972 1798684 1844219 1870509 1858622 Direct pages reclaimed 197227 188829 298380 286822 298835 Kswapd efficiency 99% 99% 99% 99% 99% Kswapd velocity 953.382 970.449 952.243 934.569 922.286 Direct efficiency 99% 99% 99% 99% 99% Direct velocity 104.058 101.832 153.961 143.200 148.205 Percentage direct scans 9% 9% 13% 13% 13% Zone normal velocity 347.289 359.676 348.063 339.933 332.983 Zone dma32 velocity 710.151 712.605 758.140 737.835 737.507 Zone dma velocity 0.000 0.000 0.000 0.000 0.000 Page writes by reclaim 557.600 429.000 353.600 426.400 381.800 Page writes file 159 53 7 79 48 Page writes anon 398 375 345 346 333 Page reclaim immediate 825 644 411 575 420 Sector Reads 2781750 2769780 2878547 2939128 2910483 Sector Writes 12080843 12083351 12012892 12002132 12010745 Page rescued immediate 0 0 0 0 0 Slabs scanned 1575654 1545344 1778406 1786700 1794073 Direct inode steals 9657 10037 15795 14104 14645 Kswapd inode steals 46857 46335 50543 50716 51796 Kswapd skipped wait 0 0 0 0 0 THP fault alloc 97 91 81 71 77 THP collapse alloc 456 506 546 544 565 THP splits 6 5 5 4 4 THP fault fallback 0 1 0 0 0 THP collapse fail 14 14 12 13 12 Compaction stalls 1006 980 1537 1536 1548 Compaction success 303 284 562 559 578 Compaction failures 702 696 974 976 969 Page migrate success 1177325 1070077 3927538 3781870 3877057 Page migrate failure 0 0 0 0 0 Compaction pages isolated 2547248 2306457 8301218 8008500 8200674 Compaction migrate scanned 42290478 38832618 153961130 154143900 159141197 Compaction free scanned 89199429 79189151 356529027 351943166 356326727 Compaction cost 1566 1426 5312 5156 5294 NUMA PTE updates 0 0 0 0 0 NUMA hint faults 0 0 0 0 0 NUMA hint local faults 0 0 0 0 0 NUMA hint local percent 100 100 100 100 100 NUMA pages migrated 0 0 0 0 0 AutoNUMA cost 0 0 0 0 0 Observations: - The "Success 3" line is allocation success rate with system idle (phases 1 and 2 are with background interference). I used to get stable values around 85% with vanilla 3.11. The lower min and mean values came with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") As explained in comment for patch 3, I don't think the commit is wrong, but that it makes the effect of compaction bugs worse. From patch 3 onwards, the results are OK and match the 3.11 results. - Patch 4 also clearly helps phases 1 and 2, and exceeds any results I've seen with 3.11 (I didn't measure it that thoroughly then, but it was never above 40%). - Compaction cost and number of scanned pages is higher, especially due to patch 4. However, keep in mind that patches 3 and 4 fix existing bugs in the current design of compaction overhead mitigation, they do not change it. If overhead is found unacceptable, then it should be decreased differently (and consistently, not due to random conditions) than the current implementation does. In contrast, patches 5 and 6 (which are not strictly bug fixes) do not increase the overhead (but also not success rates). This might be a limitation of the stress-highalloc benchmark as it's quite uniform. Another set of results is when configuring stress-highalloc t allocate with similar flags as THP uses: (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD) stress-highalloc 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp Success 1 Min 2.00 ( 0.00%) 7.00 (-250.00%) 18.00 (-800.00%) 19.00 (-850.00%) 26.00 (-1200.00%) Success 1 Mean 19.20 ( 0.00%) 17.80 ( 7.29%) 29.20 (-52.08%) 29.90 (-55.73%) 32.80 (-70.83%) Success 1 Max 27.00 ( 0.00%) 29.00 ( -7.41%) 35.00 (-29.63%) 36.00 (-33.33%) 37.00 (-37.04%) Success 2 Min 3.00 ( 0.00%) 8.00 (-166.67%) 21.00 (-600.00%) 21.00 (-600.00%) 32.00 (-966.67%) Success 2 Mean 19.30 ( 0.00%) 17.90 ( 7.25%) 32.20 (-66.84%) 32.60 (-68.91%) 35.70 (-84.97%) Success 2 Max 27.00 ( 0.00%) 30.00 (-11.11%) 36.00 (-33.33%) 37.00 (-37.04%) 39.00 (-44.44%) Success 3 Min 62.00 ( 0.00%) 62.00 ( 0.00%) 85.00 (-37.10%) 75.00 (-20.97%) 64.00 ( -3.23%) Success 3 Mean 66.30 ( 0.00%) 65.50 ( 1.21%) 85.60 (-29.11%) 83.40 (-25.79%) 83.50 (-25.94%) Success 3 Max 70.00 ( 0.00%) 69.00 ( 1.43%) 87.00 (-24.29%) 86.00 (-22.86%) 87.00 (-24.29%) 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp User 6547.93 6475.85 6265.54 6289.46 6189.96 System 1053.42 1047.28 1043.23 1042.73 1038.73 Elapsed 1835.43 1821.96 1908.67 1912.74 1956.38 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 3.13-rc2 2-thp 3-thp 4-thp 5-thp 6-thp Minor Faults 256805673 253106328 253222299 249830289 251184418 Major Faults 395 375 423 434 448 Swap Ins 12 10 10 12 9 Swap Outs 530 537 487 455 415 Direct pages scanned 71859 86046 153244 152764 190713 Kswapd pages scanned 1900994 1870240 1898012 1892864 1880520 Kswapd pages reclaimed 1897814 1867428 1894939 1890125 1877924 Direct pages reclaimed 71766 85908 153167 152643 190600 Kswapd efficiency 99% 99% 99% 99% 99% Kswapd velocity 1029.000 1067.782 1000.091 991.049 951.218 Direct efficiency 99% 99% 99% 99% 99% Direct velocity 38.897 49.127 80.747 79.983 96.468 Percentage direct scans 3% 4% 7% 7% 9% Zone normal velocity 351.377 372.494 348.910 341.689 335.310 Zone dma32 velocity 716.520 744.414 731.928 729.343 712.377 Zone dma velocity 0.000 0.000 0.000 0.000 0.000 Page writes by reclaim 669.300 604.000 545.700 538.900 429.900 Page writes file 138 66 58 83 14 Page writes anon 530 537 487 455 415 Page reclaim immediate 806 655 772 548 517 Sector Reads 2711956 2703239 2811602 2818248 2839459 Sector Writes 12163238 12018662 12038248 11954736 11994892 Page rescued immediate 0 0 0 0 0 Slabs scanned 1385088 1388364 1507968 1513292 1558656 Direct inode steals 1739 2564 4622 5496 6007 Kswapd inode steals 47461 46406 47804 48013 48466 Kswapd skipped wait 0 0 0 0 0 THP fault alloc 110 82 84 69 70 THP collapse alloc 445 482 467 462 539 THP splits 6 5 4 5 3 THP fault fallback 3 0 0 0 0 THP collapse fail 15 14 14 14 13 Compaction stalls 659 685 1033 1073 1111 Compaction success 222 225 410 427 456 Compaction failures 436 460 622 646 655 Page migrate success 446594 439978 1085640 1095062 1131716 Page migrate failure 0 0 0 0 0 Compaction pages isolated 1029475 1013490 2453074 2482698 2565400 Compaction migrate scanned 9955461 11344259 24375202 27978356 30494204 Compaction free scanned 27715272 28544654 80150615 82898631 85756132 Compaction cost 552 555 1344 1379 1436 NUMA PTE updates 0 0 0 0 0 NUMA hint faults 0 0 0 0 0 NUMA hint local faults 0 0 0 0 0 NUMA hint local percent 100 100 100 100 100 NUMA pages migrated 0 0 0 0 0 AutoNUMA cost 0 0 0 0 0 There are some differences from the previous results for THP-like allocations: - Here, the bad result for unpatched kernel in phase 3 is much more consistent to be between 65-70% and not related to the "regression" in 3.12. Still there is the improvement from patch 4 onwards, which brings it on par with simple GFP_HIGHUSER_MOVABLE allocations. - Compaction costs have increased, but nowhere near as much as the non-THP case. Again, the patches should be worth the gained determininsm. - Patches 5 and 6 somewhat increase the number of migrate-scanned pages. This is most likely due to __GFP_NO_KSWAPD flag, which means the cached pfn's and pageblock skip bits are not reset by kswapd that often (at least in phase 3 where no concurrent activity would wake up kswapd) and the patches thus help the sync-after-async compaction. It doesn't however show that the sync compaction would help so much with success rates, which can be again seen as a limitation of the benchmark scenario. This patch (of 6): Add two tracepoints for compaction begin and end of a zone. Using this it is possible to calculate how much time a workload is spending within compaction and potentially debug problems related to cached pfns for scanning. In combination with the direct reclaim and slab trace points it should be possible to estimate most allocation-related overhead for a workload. Signed-off-by: Mel Gorman Signed-off-by: Vlastimil Babka Cc: Rik van Riel Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit b33660ee981e0667fdd1c3ef3902dcb1684bf64c Author: Mel Gorman Date: Thu Aug 28 19:34:26 2014 +0100 x86/mm: Eliminate redundant page table walk during TLB range flushing commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream. When choosing between doing an address space or ranged flush, the x86 implementation of flush_tlb_mm_range takes into account whether there are any large pages in the range. A per-page flush typically requires fewer entries than would covered by a single large page and the check is redundant. There is one potential exception. THP migration flushes single THP entries and it conceivably would benefit from flushing a single entry instead of the mm. However, this flush is after a THP allocation, copy and page table update potentially with any other threads serialised behind it. In comparison to that, the flush is noise. It makes more sense to optimise balancing to require fewer flushes than to optimise the flush itself. This patch deletes the redundant huge page check. Signed-off-by: Mel Gorman Tested-by: Davidlohr Bueso Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Cc: Alex Shi Cc: Linus Torvalds Cc: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org Signed-off-by: Ingo Molnar Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 05f7ec4c8472b294cfec973540c9a583f66ee980 Author: Mel Gorman Date: Thu Aug 28 19:34:25 2014 +0100 x86/mm: Clean up inconsistencies when flushing TLB ranges commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream. NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the comparison with total_vm is done before taking tlb_flushall_shift into account. Clean it up. Signed-off-by: Mel Gorman Tested-by: Davidlohr Bueso Reviewed-by: Alex Shi Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Hugh Dickins Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org Signed-off-by: Ingo Molnar Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 1af7e30f9a0fdaa4e73579cf76ff14159fadb51c Author: Mel Gorman Date: Thu Aug 28 19:34:24 2014 +0100 mm, x86: Account for TLB flushes only when debugging commit ec65993443736a5091b68e80ff1734548944a4b8 upstream. Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman Tested-by: Davidlohr Bueso Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Cc: Hugh Dickins Cc: Alex Shi Cc: Linus Torvalds Cc: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit e961e1e65ada5c41c1f03280c9929bb8c5ac3651 Author: KOSAKI Motohiro Date: Thu Aug 28 19:34:23 2014 +0100 mm: __rmqueue_fallback() should respect pageblock type commit 0cbef29a782162a3896487901eca4550bfa397ef upstream. When __rmqueue_fallback() doesn't find a free block with the required size it splits a larger page and puts the rest of the page onto the free list. But it has one serious mistake. When putting back, __rmqueue_fallback() always use start_migratetype if type is not CMA. However, __rmqueue_fallback() is only called when all of the start_migratetype queue is empty. That said, __rmqueue_fallback always puts back memory to the wrong queue except try_to_steal_freepages() changed pageblock type (i.e. requested size is smaller than half of page block). The end result is that the antifragmentation framework increases fragmenation instead of decreasing it. Mel's original anti fragmentation does the right thing. But commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it. This patch restores sane and old behavior. It also removes an incorrect comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c: restructure free-page stealing code and fix a bug"). Signed-off-by: KOSAKI Motohiro Cc: Mel Gorman Cc: Michal Nazarewicz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 0a0809a87b57b648be1dec3c311a5bfb34ea3b14 Author: KOSAKI Motohiro Date: Thu Aug 28 19:34:22 2014 +0100 mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream. In general, every tracepoint should be zero overhead if it is disabled. However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate "new_type == start_migratetype" even if tracepoint is disabled. However, the code can be moved into tracepoint's TP_fast_assign() and TP_fast_assign exist exactly such purpose. This patch does it. Signed-off-by: KOSAKI Motohiro Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit cd5900b2c8e9b9a4c71541d7ec158a3b579cbb67 Author: Damien Ramonda Date: Thu Aug 28 19:34:21 2014 +0100 readahead: fix sequential read cache miss detection commit af248a0c67457e5c6d2bcf288f07b4b2ed064f1f upstream. The kernel's readahead algorithm sometimes interprets random read accesses as sequential and triggers unnecessary data prefecthing from storage device (impacting random read average latency). In order to identify sequential cache read misses, the readahead algorithm intends to check whether offset - previous offset == 1 (trivial sequential reads) or offset - previous offset == 0 (sequential reads not aligned on page boundary): if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) The current offset is stored in the "offset" variable of type "pgoff_t" (unsigned long), while previous offset is stored in "ra->prev_pos" of type "loff_t" (long long). Therefore, operands of the if statement are implicitly converted to type long long. Consequently, when previous offset > current offset (which happens on random pattern), the if condition is true and access is wrongly interpeted as sequential. An unnecessary data prefetching is triggered, impacting the average random read latency. Storing the previous offset value in a "pgoff_t" variable (unsigned long) fixes the sequential read detection logic. Signed-off-by: Damien Ramonda Reviewed-by: Fengguang Wu Acked-by: Pierre Tardy Acked-by: David Cohen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 56f7f3617f64dd1530671600faf44fbe90dd2407 Author: Dan Streetman Date: Thu Aug 28 19:34:20 2014 +0100 swap: change swap_list_head to plist, add swap_avail_head commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream. Originally get_swap_page() started iterating through the singly-linked list of swap_info_structs using swap_list.next or highest_priority_index, which both were intended to point to the highest priority active swap target that was not full. The first patch in this series changed the singly-linked list to a doubly-linked list, and removed the logic to start at the highest priority non-full entry; it starts scanning at the highest priority entry each time, even if the entry is full. Replace the manually ordered swap_list_head with a plist, swap_active_head. Add a new plist, swap_avail_head. The original swap_active_head plist contains all active swap_info_structs, as before, while the new swap_avail_head plist contains only swap_info_structs that are active and available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect the swap_avail_head list. Mel Gorman suggested using plists since they internally handle ordering the list entries based on priority, which is exactly what swap was doing manually. All the ordering code is now removed, and swap_info_struct entries and simply added to their corresponding plist and automatically ordered correctly. Using a new plist for available swap_info_structs simplifies and optimizes get_swap_page(), which no longer has to iterate over full swap_info_structs. Using a new spinlock for swap_avail_head plist allows each swap_info_struct to add or remove themselves from the plist when they become full or not-full; previously they could not do so because the swap_info_struct->lock is held when they change from full<->not-full, and the swap_lock protecting the main swap_active_head must be ordered before any swap_info_struct->lock. Signed-off-by: Dan Streetman Acked-by: Mel Gorman Cc: Shaohua Li Cc: Steven Rostedt Cc: Peter Zijlstra Cc: Hugh Dickins Cc: Dan Streetman Cc: Michal Hocko Cc: Christian Ehrhardt Cc: Weijie Yang Cc: Rik van Riel Cc: Johannes Weiner Cc: Bob Liu Cc: Paul Gortmaker Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit af1f48ee7af9d57dbcb3fc19a488eae4a22f4922 Author: Dan Streetman Date: Thu Aug 28 19:34:19 2014 +0100 lib/plist: add plist_requeue commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream. Add plist_requeue(), which moves the specified plist_node after all other same-priority plist_nodes in the list. This is essentially an optimized plist_del() followed by plist_add(). This is needed by swap, which (with the next patch in this set) uses a plist of available swap devices. When a swap device (either a swap partition or swap file) are added to the system with swapon(), the device is added to a plist, ordered by the swap device's priority. When swap needs to allocate a page from one of the swap devices, it takes the page from the first swap device on the plist, which is the highest priority swap device. The swap device is left in the plist until all its pages are used, and then removed from the plist when it becomes full. However, as described in man 2 swapon, swap must allocate pages from swap devices with the same priority in round-robin order; to do this, on each swap page allocation, swap uses a page from the first swap device in the plist, and then calls plist_requeue() to move that swap device entry to after any other same-priority swap devices. The next swap page allocation will again use a page from the first swap device in the plist and requeue it, and so on, resulting in round-robin usage of equal-priority swap devices. Also add plist_test_requeue() test function, for use by plist_test() to test plist_requeue() function. Signed-off-by: Dan Streetman Cc: Steven Rostedt Cc: Peter Zijlstra Acked-by: Mel Gorman Cc: Paul Gortmaker Cc: Thomas Gleixner Cc: Shaohua Li Cc: Hugh Dickins Cc: Dan Streetman Cc: Michal Hocko Cc: Christian Ehrhardt Cc: Weijie Yang Cc: Rik van Riel Cc: Johannes Weiner Cc: Bob Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 80e85acd070d4fce00ac5d75012e0182efbc066e Author: Dan Streetman Date: Thu Aug 28 19:34:18 2014 +0100 lib/plist: add helper functions commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream. Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to define and initialize a struct plist_head. Add plist_for_each_continue() and plist_for_each_entry_continue(), equivalent to list_for_each_continue() and list_for_each_entry_continue(), to iterate over a plist continuing after the current position. Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev and ->next, implemented by list_prev_entry() and list_next_entry(), to access the prev/next struct plist_node entry. These are needed because unlike struct list_head, direct access of the prev/next struct plist_node isn't possible; the list must be navigated via the contained struct list_head. e.g. instead of accessing the prev by list_prev_entry(node, node_list) it can be accessed by plist_prev(node). Signed-off-by: Dan Streetman Acked-by: Mel Gorman Cc: Paul Gortmaker Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Shaohua Li Cc: Hugh Dickins Cc: Dan Streetman Cc: Michal Hocko Cc: Christian Ehrhardt Cc: Weijie Yang Cc: Rik van Riel Cc: Johannes Weiner Cc: Bob Liu Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 75b1f2d3ed3169675b69b2f68217ebb839414657 Author: Dan Streetman Date: Thu Aug 28 19:34:17 2014 +0100 swap: change swap_info singly-linked list to list_head commit adfab836f4908deb049a5128082719e689eed964 upstream. The logic controlling the singly-linked list of swap_info_struct entries for all active, i.e. swapon'ed, swap targets is rather complex, because: - it stores the entries in priority order - there is a pointer to the highest priority entry - there is a pointer to the highest priority not-full entry - there is a highest_priority_index variable set outside the swap_lock - swap entries of equal priority should be used equally this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181 where different priority swap targets are incorrectly used equally. That bug probably could be solved with the existing singly-linked lists, but I think it would only add more complexity to the already difficult to understand get_swap_page() swap_list iteration logic. The first patch changes from a singly-linked list to a doubly-linked list using list_heads; the highest_priority_index and related code are removed and get_swap_page() starts each iteration at the highest priority swap_info entry, even if it's full. While this does introduce unnecessary list iteration (i.e. Schlemiel the painter's algorithm) in the case where one or more of the highest priority entries are full, the iteration and manipulation code is much simpler and behaves correctly re: the above bug; and the fourth patch removes the unnecessary iteration. The second patch adds some minor plist helper functions; nothing new really, just functions to match existing regular list functions. These are used by the next two patches. The third patch adds plist_requeue(), which is used by get_swap_page() in the next patch - it performs the requeueing of same-priority entries (which moves the entry to the end of its priority in the plist), so that all equal-priority swap_info_structs get used equally. The fourth patch converts the main list into a plist, and adds a new plist that contains only swap_info entries that are both active and not full. As Mel suggested using plists allows removing all the ordering code from swap - plists handle ordering automatically. The list naming is also clarified now that there are two lists, with the original list changed from swap_list_head to swap_active_head and the new list named swap_avail_head. A new spinlock is also added for the new list, so swap_info entries can be added or removed from the new list immediately as they become full or not full. This patch (of 4): Replace the singly-linked list tracking active, i.e. swapon'ed, swap_info_struct entries with a doubly-linked list using struct list_heads. Simplify the logic iterating and manipulating the list of entries, especially get_swap_page(), by using standard list_head functions, and removing the highest priority iteration logic. The change fixes the bug: https://lkml.org/lkml/2014/2/13/181 in which different priority swap entries after the highest priority entry are incorrectly used equally in pairs. The swap behavior is now as advertised, i.e. different priority swap entries are used in order, and equal priority swap targets are used concurrently. Signed-off-by: Dan Streetman Acked-by: Mel Gorman Cc: Shaohua Li Cc: Hugh Dickins Cc: Dan Streetman Cc: Michal Hocko Cc: Christian Ehrhardt Cc: Weijie Yang Cc: Rik van Riel Cc: Johannes Weiner Cc: Bob Liu Cc: Steven Rostedt Cc: Peter Zijlstra Cc: Paul Gortmaker Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit f14c889d082607a242c3f960c57937f6f659fb9d Author: Michal Hocko Date: Thu Aug 28 19:34:15 2014 +0100 mm: exclude memoryless nodes from zone_reclaim commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream. We had a report about strange OOM killer strikes on a PPC machine although there was a lot of swap free and a tons of anonymous memory which could be swapped out. In the end it turned out that the OOM was a side effect of zone reclaim which wasn't unmapping and swapping out and so the system was pushed to the OOM. Although this sounds like a bug somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction numactl on the said hardware suggests that the zone reclaim should not have been set in the first place: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 7168 MB node 2 free: 6019 MB node distances: node 0 2 0: 10 40 2: 40 10 So all the CPUs are associated with Node0 which doesn't have any memory while Node2 contains all the available memory. Node distances cause an automatic zone_reclaim_mode enabling. Zone reclaim is intended to keep the allocations local but this doesn't make any sense on the memoryless nodes. So let's exclude such nodes for init_zone_allows_reclaim which evaluates zone reclaim behavior and suitable reclaim_nodes. Signed-off-by: Michal Hocko Acked-by: David Rientjes Acked-by: Nishanth Aravamudan Tested-by: Nishanth Aravamudan Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit a32d86748d0b0de86eddeb4f544a41caa413000c Author: Nishanth Aravamudan Date: Thu Aug 28 19:34:14 2014 +0100 hugetlb: ensure hugepage access is denied if hugepages are not supported commit 457c1b27ed56ec472d202731b12417bff023594a upstream. Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan Reviewed-by: Aneesh Kumar K.V Acked-by: Mel Gorman Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 75858faa45de3bef7519b046f5453c2d7a3527c7 Author: Hugh Dickins Date: Thu Aug 28 19:34:13 2014 +0100 mm: fix bad rss-counter if remap_file_pages raced migration commit 887843961c4b4681ee993c36d4997bf4b4aa8253 upstream. Fix some "Bad rss-counter state" reports on exit, arising from the interaction between page migration and remap_file_pages(): zap_pte() must count a migration entry when zapping it. And yes, it is possible (though very unusual) to find an anon page or swap entry in a VM_SHARED nonlinear mapping: coming from that horrid get_user_pages(write, force) case which COWs even in a shared mapping. Signed-off-by: Hugh Dickins Tested-by: Sasha Levin sasha.levin@oracle.com> Tested-by: Dave Jones davej@redhat.com> Cc: Cyrill Gorcunov Cc: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 73210d0adac4c6f283eb825e85febdc12dbd3e36 Author: Han Pingtian Date: Thu Aug 28 19:34:12 2014 +0100 mm: prevent setting of a value less than 0 to min_free_kbytes commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream. If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing proc_dointvec() to proc_dointvec_minmax() in the min_free_kbytes_sysctl_handler() can prevent this to happen. mhocko said: : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make : your machine unusable but I agree that proc_dointvec_minmax is more : suitable here as we already have: : : .proc_handler = min_free_kbytes_sysctl_handler, : .extra1 = &zero, : : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references : to ctl_name and strategy from the generic sysctl table") has removed : sysctl_intvec strategy and so extra1 is ignored. Signed-off-by: Han Pingtian Acked-by: Michal Hocko Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit 3ddc614ca27c84a91d3b51ab57e50c68fdf4889b Author: Joonsoo Kim Date: Thu Aug 28 19:34:11 2014 +0100 slab: correct pfmemalloc check commit 73293c2f900d0adbb6a415b312cd57976d5ae242 upstream. We checked pfmemalloc by slab unit, not page unit. You can see this in is_slab_pfmemalloc(). So other pages don't need to be set/cleared pfmemalloc. And, therefore we should check pfmemalloc in page flag of first page, but current implementation don't do that. virt_to_head_page(obj) just return 'struct page' of that object, not one of first page, since the SLAB don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page, we first get a slab and try to get it via virt_to_head_page(slab->s_mem). Acked-by: Andi Kleen Signed-off-by: Joonsoo Kim Signed-off-by: Pekka Enberg Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit fc2dd02e0959cb07fd49ab57662ad755d5b3fed2 Author: Bob Liu Date: Thu Aug 28 19:34:10 2014 +0100 mm: thp: khugepaged: add policy for finding target node commit 9f1b868a13ac36bd207a571f5ea1193d823ab18d upstream. Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a hugepage which is allocated from the node of the first scanned normal page, but this policy is too rough and may end with unexpected result to upper users. The problem is the original page-balancing among all nodes will be broken after hugepaged started. Thinking about the case if the first scanned normal page is allocated from node A, most of other scanned normal pages are allocated from node B or C.. But hugepaged will always allocate hugepage from node A which will cause extra memory pressure on node A which is not the situation before khugepaged started. This patch try to fix this problem by making khugepaged allocate hugepage from the node which have max record of scaned normal pages hit, so that the effect to original page-balancing can be minimized. The other problem is if normal scanned pages are equally allocated from Node A,B and C, after khugepaged started Node A will still suffer extra memory pressure. Andrew Davidoff reported a related issue several days ago. He wanted his application interleaving among all nodes and "numactl --interleave=all ./test" was used to run the testcase, but the result wasn't not as expected. cat /proc/2814/numa_maps: 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098 The end result showed that most pages are from Node3 instead of interleave among node0-3 which was unreasonable. This patch also fix this issue by allocating hugepage round robin from all nodes have the same record, after this patch the result was as expected: 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722 The simple testcase is like this: int main() { char *p; int i; int j; for (i=0; i < 200; i++) { p = (char *)malloc(1048576); printf("malloc done\n"); if (p == 0) { printf("Out of memory\n"); return 1; } for (j=0; j < 1048576; j++) { p[j] = 'A'; } printf("touched memory\n"); sleep(1); } printf("enter sleep\n"); while(1) { sleep(100); } } [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()] Reported-by: Andrew Davidoff Tested-by: Andrew Davidoff Signed-off-by: Bob Liu Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Mel Gorman Cc: Yasuaki Ishimatsu Cc: Wanpeng Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby commit c3bd31a1700393488d631dce29e38a71bf1c413f Author: Bob Liu Date: Thu Aug 28 19:34:09 2014 +0100 mm: thp: cleanup: mv alloc_hugepage to better place commit 10dc4155c7714f508fe2e4667164925ea971fb25 upstream. Move alloc_hugepage() to a better place, no need for a seperate #ifndef CONFIG_NUMA Signed-off-by: Bob Liu Reviewed-by: Yasuaki Ishimatsu Acked-by: Kirill A. Shutemov Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Andrew Davidoff Cc: Wanpeng Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Jiri Slaby