Loop unrolling
Sum = 0;
For ( i = 0; i < 3; i ++)
Sum += 5;
===================
Mov R0, #0 @ i
Loop:
Cmp r0, #3
Beq next
Add r1, #5
Add r0, #1
B loop
=================== or
Unroll the loop
Add r1, #5
Add r1, #5
Add r1, #5
(if we already know that it will run 3 times)
Variation for arrays
Standard approach
For (i = 0; i < n; i ++)
Sum += d[i];
===================
Partially unrolled
For ( i = 0; i<n l i+=2)
{
Sum += d[i];
Sum += d[i+1];
}
// assume n is even number
Avoid:
Repeatedly calculating something that doesn’t change
For ( i = 0; i< strlen(s); i ++)
if( ‘0’ <= s[i] && s[i] <= ‘9’ )
Count ++;
Should do
Int len = strlen(s);
For ( i = 0; i< len; i ++)
if( ‘0’ <= s[i] && s[i] <= ‘9’ )
Count ++;
EX2
For ( i=0; i< count; i++)
Sum += i/sqrt(5);
Replace with:
Double s5= sqrt(5);
For ( i = 0; i < count; i++)
Sum += i / s5;
Reduce calls to memory
(Caching may reduce the cost)
For ( i=1; i<n; i++)
D[k] += i; //k isn't changing
//d[k] == *(d +k)
Replace with
Sum = 0;
for(i = 1; i<n; i++)
sum+=i;
D[k] = sum;
- Any non branch command treat as 1, else branch is 2, conditional branch is 3 or 1 depending on execution
Interrupts and exception
- Some authors treat the words as synonyms
- They alter the normal flow of the program
- Somewhat like a conditional branch, in terms of how they affect the pipeline
From Patterson p336
ARMv8 uses these terms
- System reset → Exception
- (not bad) → I/O Device Request → interrupt
- Invoke O/S from user program → exception (like creating or removing directories)
- Floating-point over/under-flow → exception
- Undefined instruction → exception
- Hardware malfunction → Exception or interrupt
From Sloss, p 317 (20 y)
“ARM defines an interrupt as a special kind of exception”
Stack
Common analogy → plate dispenser at buffet
Push
Push
Works the same way like how the functions being called looks
(this is just a visualization of data)
- ARM allows you to create your own stack with an array, but we will use The Stack
Stack models
- When something is pushed onto the stack, do addresses increase or decrease?
- Increasing → addresses get bigger (# go up)
- Decreasing → addresses get smaller (# go down)
- Where does the stack pointer point after a push? 3. A full stack, if SP points at top value of stack 4. A empty stack, if SP points at next available spot
- The Stack, ARM, is full, descending
Examples of stacks
Ex.s
.global_start
_start:
Mov r1, #99
Mov r2, #123
Mov r3, #444
Push {r1} @ this is a shortcut of str r1, [sp, #-4]! But dont actually write this
Push {r2}
Pop {r3} @ takes what SP is pointing at and puts it into r3, then after change sp
End:
Mov r7, #1
Swi 0
In gdb
Info r sp (tells us sp, address and val)
X /wd address (examine as a word at this address)
This is legal
Push
Equivalent to…
Push
Equal to pushing R8, then R5, then R1
Lowest register numbers are pushed to the lowest addresses
Same logic applies when you pop
.global_start
_start:
Mov r1, #99
Mov r2, #123
Mov r3, #444
Push {r1-r3} (equivlent to push r1, r2 r3)
Pop {R3} @ take r1 → r3
Pop {R1} @ r2 → r1
Pop {R2} @ r3 → r2
End:
Mov r7, #1
Swi 0
- func1(R0, R1, R2, R3, Stack(#5), Stack(#6))
- Which order is it?
- When you pop #5 it comes out first
- Then you can pop #6 to get to the last element
Hw8
How he did shift by 2 left
A =
Means that we are copying
A=
Shifts left by 2, and copies extra 1 and extra 2 where 3 and 4 were
If we have 4 values, how many are copied to a new position, total amt - shift amt
4 val - 2 = 2 copies
If we have a loop that goes through our array we only need to loop **copy amt of times **
In c terms
D[0] = d[[2]
D[1] = d[3]
d[2] = d[4]
We start at the beginning and have a count
A =
0 1 2 3
We go to 1, when we are at 1 we only have 1 copy more to go, 4 - 3 = 1 away from next copy, then add one to that and then take the copy value and put in pos 0, repeat for next inst
So basically Total amt -shift amt left + 1 = copy val
Second func.
Count matches
Given 2 strings starting from left, we want to count how many consecutive chars match. For 2 strings being compared.
(we dont include null to the match)
(we dont restart the count after consecutive chars have already gotten a number.)
“abcdFEG” == “avcDFEG” ⇒ 4 char matching
Example similar to hw
.global _start
_start:
bl sub
End:
Mov r7, #1
swi 0
sub:
Mov r1, #3000000
Mov r2, #0x5000
Push {r1, r2. Lr}
Cache
2012 values from Patterson
Tech | Access times | $ per gib ← Gib → 2^30 bytes | Space requirements per sq inch | |
cache | SRAM | .5-2.5ns | $500-$1000 | Highest |
mem | DRAM | 50-70ns | $10-$20 | |
usb/ssd | flash | 5k-50kns | $.75-$1.00 | |
HDD | Mag disk | 5m-20m ns | $.05-$.10 | lowest |
1980 | 64-Kibibit | $1.5million/gib | 250ns Access time |
2012 | 4-gigibit | $1/gib | 35ns |
2^10 = 1024 != 10^3 = 1000
Kibi kilo
Mibi mega
Gibi giga
Goal: appear to have,
fast, **cheap **memory
Tend to go in diff directions
How do we achieve our goal of appearing to have fast, cheap memory?
2 principles
- Spacial locality
- When you request #2, computer grabs all of those next to it in memory
- Move from main mem → cache
- Temporal locality
- When you keep on calling sum a lot, we put in cache from memory. Because we will use it for a certain amount of time.
- Try to figure out what you need in near future and have it on hand