
Intel's XScale Microarchitecture processors provide a Performance
Monitoring Unit (PMU) that can be utilized to provide information
that can be useful for fine tuning of code.  This text file describes
the API that's been developed for use by Linux kernel programmers.
When I have some extra time on my hand, I will extend the code to
provide support for user mode performance monitoring (which is
probably much more useful).  Note that to get the most usage out
of the PMU, I highly reccomend getting the XScale reference manual
from Intel and looking at chapter 12.

To use the PMU, you must #include <asm/xscale-pmu.h> in your source file.

Since there's only one PMU, only one user can currently use the PMU
at a given time.  To claim the PMU for usage, call pmu_claim() which
returns an identifier.  When you are done using the PMU, call
pmu_release() with the identifier that you were given by pmu_claim.

In addition, the PMU can only be used on XScale based systems that
provide an external timer.  Systems that the PMU is currently supported
on are:

	- Cyclone IQ80310

Before delving into how to use the PMU code, let's do a quick overview
of the PMU itself.  The PMU consists of three registers that can be
used for performance measurements.  The first is the CCNT register with
provides the number of clock cycles elapsed since the PMU was started.
The next two register, PMN0 and PMN1, are eace user programmable to
provide 1 of 20 different performance statistics.  By combining different
statistics, you can derive complex performance metrics.

To start the PMU, just call pmu_start(pm0, pmn1).  pmn0 and pmn1 tell
the PMU what statistics to capture and can each be one of:

EVT_ICACHE_MISS
	Instruction fetches requiring access to external memory

EVT_ICACHE_NO_DELIVER
	Instruction cache could not deliver an instruction.  Either an
	ICACHE miss or an instruction TLB miss.

EVT_ICACHE_DATA_STALL
	Stall in execution due to a data dependency. This counter is
	incremented each cycle in which the condition is present.

EVT_ITLB_MISS
	Instruction TLB miss

EVT_DTLB_MISS
	Data TLB miss

EVT_BRANCH
	A branch instruction was executed and it may or may not have
	changed program flow

EVT_BRANCH_MISS
	A branch (B or BL instructions only) was mispredicted

EVT_INSTRUCTION
	An instruction was executed

EVT_DCACHE_FULL_STALL
	Stall because data cache buffers are full.  Incremented on every
	cycle in which condition is present.

EVT_DCACHE_FULL_STALL_CONTIG
	Stall because data cache buffers are full.  Incremented on every
	cycle in which condition is contigous.

EVT_DCACHE_ACCESS
	Data cache access (data fetch)

EVT_DCACHE_MISS
	Data cache miss

EVT_DCACHE_WRITE_BACK
	Data cache write back.  This counter is incremented for every
	1/2 line (four words) that are written back.

EVT_PC_CHANGED
	Software changed the PC.  This is incremented only when the
	software changes the PC and there is no mode change.  For example,
	a MOV instruction that targets the PC would increment the counter.
	An SWI would not as it triggers a mode change.

EVT_BCU_REQUEST
	The Bus Control Unit(BCU) received a request from the core

EVT_BCU_FULL
	The BCU request queue if full.  A high value for this event means
	that the BCU is often waiting for to complete on the external bus.

EVT_BCU_DRAIN
	The BCU queues were drained due to either a Drain Write Buffer
	command or an I/O transaction for a page that was marked as
	uncacheable and unbufferable.

EVT_BCU_ECC_NO_ELOG
	The BCU detected an ECC error on the memory bus but noe ELOG
	register was available to to log the errors.

EVT_BCU_1_BIT_ERR
	The BCU detected a 1-bit error while reading from the bus.

EVT_RMW
	An RMW cycle occurred due to narrow write on ECC protected memory.

To get the results back, call pmu_stop(&results) where results is defined
as a struct pmu_results:

	struct pmu_results
	{
		u32     ccnt;	/* Clock Counter Register */
		u32	ccnt_of; /
		u32     pmn0;	/* Performance Counter Register 0 */
		u32	pmn0_of;
		u32     pmn1;	/* Performance Counter Register 1 */
		u32	pmn1_of;
	};

Pretty simple huh?  Following are some examples of how to get some commonly
wanted numbers out of the PMU data.  Note that since you will be dividing
things, this isn't super useful from the kernel and you need to printk the
data out to syslog.  See [1] for more examples.

Instruction Cache Efficiency

	pmu_start(EVT_INSTRUCTION, EVT_ICACHE_MISS);
	...
	pmu_stop(&results);

	icache_miss_rage = results.pmn1 / results.pmn0;
	cycles_per_instruction = results.ccnt / results.pmn0;

Data Cache Efficiency

	pmu_start(EVT_DCACHE_ACCESS, EVT_DCACHE_MISS);
	...
	pmu_stop(&results);

	dcache_miss_rage = results.pmn1 / results.pmn0;

Instruction Fetch Latency

	pmu_start(EVT_ICACHE_NO_DELIVER, EVT_ICACHE_MISS);
	...
	pmu_stop(&results);

	average_stall_waiting_for_instruction_fetch =
		results.pmn0 / results.pmn1;

	percent_stall_cycles_due_to_instruction_fetch =
		results.pmn0 / results.ccnt;


ToDo:

- Add support for usermode PMU usage.  This might require hooking into
  the scheduler so that we pause the PMU when the task that requested
  statistics is scheduled out.

--
This code is still under development, so please feel free to send patches,
questions, comments, etc to me.

Deepak Saxena <dsaxena@mvista.com>

