performance_analysis_guide.pdf

(375 KB) Pobierz
Performance Analysis Guide
Performance Analysis Guide for Intel®
Core™ i7 Processor and Intel® Xeon™ 5500
processors
By Dr David Levinthal PhD.
Version 1.0
1
Performance Analysis Guide
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL®
PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS
PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR
PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT
DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE
INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH
MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without
notice. Designers must not rely on the absence or characteristics of any features or
instructions marked "reserved" or "undefined." Intel reserves these for future definition
and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them. The information here is subject to change without notice. Do not
finalize a design with this information.
Intel® Hyper-Threading Technology requires a computer system with an Intel® processor
supporting Hyper-Threading Technology and an Intel® HT Technology enabled chipset,
BIOS and operating system. Performance will vary depending on the specific hardware
and software you use. For more information, see
http://www.intel.com/technology/hyperthread/index.htm; including details on which
processors support Intel HT Technology.
The products described in this document may contain design defects or errors known as
errata which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo,
Intel Core 2 Extreme, Intel Pentium D, Itanium, Intel SpeedStep, MMX, Intel Atom, and
VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in
the United States and other countries.
*Other names and brands may be claimed as the property of others.
Contact your local Intel sales office or your distributor to obtain the latest specifications
and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or
other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting
Intel's Web
Site.
Copyright © 2008-2009 Intel Corporation
2
Performance Analysis Guide
Introduction......................................................................................................................... 4
Basic Intel® Core™ i7 Processor and Intel® Xeon™ 5500 Processor Architecture and
Performance Analysis ..................................................................................................... 5
Core Out of Order Pipeline ............................................................................................. 6
Core Memory Subsystem................................................................................................ 8
Uncore Memory Subsystem.......................................................................................... 10
Overview................................................................................................................... 10
Intel® Xeon™ 5500 Processor ................................................................................. 10
Core Performance Monitoring Unit (PMU).................................................................. 12
Uncore Performance Monitoring Unit (PMU).............................................................. 13
Performance Analysis and the Intel® Core™ i7 Processor and Intel® Xeon™ 5500
processor Performance Events: Overview ........................................................................ 13
Cycle Accounting and Uop Flow...................................................................................... 14
Branch mispredictions, Wasted Work, Misprediction Penalties and UOP Flow ......... 17
Stall Decomposition Overview ......................................................................................... 20
Measuring Penalties ...................................................................................................... 21
Core Precise Events .......................................................................................................... 23
Overview....................................................................................................................... 23
Precise Memory Access Events .................................................................................... 23
Latency Event ............................................................................................................... 26
Precise Execution Events.............................................................................................. 28
Shadowing..................................................................................................................... 29
Loop Tripcounts............................................................................................................ 30
Last Branch Record (LBR) ........................................................................................... 30
Non-PEBS Core Memory Access Events ......................................................................... 35
Bandwidth per core ....................................................................................................... 37
L1D, L2 Cache Access and More Offcore events ........................................................ 38
Store Forwarding ...................................................................................................... 42
Front End Events............................................................................................................... 43
Branch Mispredictions .................................................................................................. 43
FE Code Generation Metrics ........................................................................................ 44
Microcode and Exceptions............................................................................................ 45
Uncore Performance Events ............................................................................................. 45
The Global Queue ......................................................................................................... 46
L3 CACHE Events........................................................................................................ 51
Intel® QuickPath Interconnect Home Logic (QHL) ................................................... 52
Integrated Memory Controller (IMC)........................................................................... 53
Intel® QuickPath Interconnect Home Logic Opcode Matching ................................. 56
Measuring Bandwidth From the Uncore....................................................................... 63
Conclusion: ....................................................................................................................... 64
Intel® Core™ i7 Processors and Intel® Xeon™ 5500 Processors open a new class of
performance analysis capablitlies ..................................................................................... 64
Appendix 1.................................................................................................................... 64
Profiles .......................................................................................................................... 64
General Exploration .................................................................................................. 64
3
Performance Analysis Guide
Branch Analysis ........................................................................................................ 65
Cycles and Uops ....................................................................................................... 65
Memory Access ........................................................................................................ 66
False- True Sharing................................................................................................... 66
FE Investigation ........................................................................................................ 67
Working Set .............................................................................................................. 67
Loop Analysis with call sites .................................................................................... 67
Client Analysis with/without call sites ..................................................................... 68
Appendix II PMU Programming .................................................................................. 70
Introduction
With the introduction of the Intel® Core™ i7 processor and Intel® Xeon™ 5500
processors, mass market computing enters a new era and with it a new need for
performance analysis techniques and capabilities. The performance monitoring unit
(PMU) of the processor has progressed in step, providing a wide variety of new
capabilities to illuminate the code interaction with the architecture.
In this paper I will discuss the basic performance analysis methodology that applies to
Intel® Core™ i7 processor and platforms that support Non-Uniform Memory Access
(NUMA) using two Intel® Xeon 5500 processors based on the same microarchitecture as
Intel® Core™ i7 processor. The events and methodology that referred to Intel® Core™
i7 processor also apply to Intel® Xeon™ 5500 processors which are based on the same
microarchitecture as Intel® Core™ i7 processor. Thus statements made only about Intel®
core™ i7 processors in this document also apply to the Intel® Xeon™ 5500 processor
based systems. This will start with extensions to the basic cycle accounting methodology
outlined for Intel® Core™2 processors(1) and also include both the specific NUMA
directed capabilities and the large extension to the precise event based sampling (PEBS) .
Software optimization based on performance analysis of large existing
applications, in most cases, reduces to optimizing the code generation by the compiler
and optimizing the memory access. This paper will focus on this approach. Optimizing
the code generation by the compiler requires inspection of the assembler of the time
consuming parts of the application and verifying that the compiler generated a reasonable
code stream. Optimizing the memory access is a complex issue involving the bandwidth
and latency capabilities of the platform, hardware and software prefetching efficiencies
and the virtual address layout of the heavily accessed variables. The memory access is
where the NUMA nature of the Intel® Core™ i7 processor based platforms becomes an
issue.
Performance analysis illuminates how the existing invocation of an algorithm
executes. It allows a software developer to improve the performance of that invocation. It
does not offer much insight about how to change an algorithm, as that really requires a
better understanding of the problem being solved rather than the performance of the
existing solution. That being said, the performance gains that can be achieved on a large
4
Performance Analysis Guide
existing code base can regularly exceed a factor of 2, (particularly in HPC) which is
certainly worth the comparatively small effort required.
Basic Intel® Core™ i7 Processor and Intel® Xeon™ 5500
Processor Architecture and Performance Analysis
Performance analysis on a micro architecture is the experimental investigation of
the micro architecture’s response to a given instruction and data stream. As such, a
reasonable understanding of the micro architecture is required to understand what is
actually being measured with the performance events that are available.
This section will cover the basics of the Intel® Core™ i7 processor and Intel®
Xeon™ 5500 processor architecture. It is not meant to be complete but merely the
briefest of introductions. For more details the reader should consult the Software
Developers Programming Optimization Guide. This introduction is broken into sections
as
1) Overview
2) Core out of order pipeline
3) Core memory subsystem
4) Uncore overview
5) Last Level Cache and Integrated memory controller
6) Intel® QuickPath Interconnect (Intel QPI)
7) Core and Uncore PMUs
Overview
The Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors are multi
core, Intel® Hyper-Threading Technology (HT) enabled designs. Each socket has one to
eight cores, which share a last level cache (L3 CACHE), a local integrated memory
controller and an Intel® QuickPath Interconnect. Thus a 2 socket platform with quad
core sockets might be drawn as:
DDR3
C0
C1
C2
C3
C0
C1
C2
C3
DDR3
8M LLC
IMC
8M LLC
QPI
QPI
QPI
IMC
QPI
Discrete
Gfx
I/O Hub
Figure 1
5
Zgłoś jeśli naruszono regulamin