So this is something that bit me in the ass pretty hard recently until someone far more experienced than me on Twitter was able to help me through optimizing my crappy code, and so I figured I’d make a blog post about it.
In Verilog, there’s a concept of a memory. A memory is basically an array of registers, it looks like this:
reg [n-1:0] myMemory [0:m-1];
It’s called a memory because, well, that’s pretty much exactly what it looks like: a big array of n-wide registers accessible by a memory address from 0 to m – 1. While this looks like a plain old array, there’s more to it than that and this is what gave me some heartburn.
The thing is, most synthesis software is actually pretty good at optimizing this. Normally, if they can get away with it, they will actually implement that above “myMemory” register array in on-chip RAM blocks. It’s fast, it doesn’t take up logic slices, and generally preferable for when you’ve got LOTS of those registers.
The problem is if you aren’t careful with how you access them, you can very easily break your software’s ability to prove that it CAN implement it in on-chip RAM, and it will instead resort to things like implementing as a HUGE array of flip-flops. This is exactly what I was experiencing while I was working on the VCP.
Thing is, my VCP draws one line at a time. Rather than try and have my VCP draw everything on-demand right as the pixel is needed, I used linebuffers so that my VCP can just prepare a whole line of pixels in advance. Actually, several linebuffers are used in my VCP implementation – multiple tile and sprite plane buffers are prepared and then all muxed together into the final linebuffer according to layering rules.
Here’s the thing about that: The way I had implemented my VCP, a single 16-bit word read from VRAM encodes four pixels, each a 4-bit index into a 16-color row in the color palette, so what I was doing was just blasting all four pixels directly into the target line buffer as soon as they were read from memory.
This was apparently a horrible idea.
In order for Quartus to be able to infer on-chip RAM, it needs to be able to safely infer that it can implement it using a single address and data port for the RAM – and that means only one write per clock cycle. Doing multiple writes like this completely violates that. Quartus instead resorted to huge flip flop dumps. That took up easily 60% of my FPGA’s logic resources. Bad bad bad.
So instead, I had to completely restructure how the writes were done. My approach was to stick the 16-bit word into one buffer (which, itself, could be synthesized as RAM due to only one write per cycle) and then read one 4-bit value per clock cycle into the target line buffer (once again, can be synthesized as RAM due to one read -> one write per clock cycle). This took a bit of effort and a bit of proving that nothing overlapped in bad ways, but I finally did get it working. New logic utilization: 15%. Whew! That’s WITH two independent scrolling tilemap layers and 128 sprites. I also haven’t fully optimized everything – I’m pretty sure the tile fetch buffer (which fetches 41 tiles each scanline) is being inferred as flops right now, as is the color palette – but this is still a huge savings and let me continue on with the project without worrying about running out of room just yet.
Anyway, let that be a lesson: pay attention to your memory usage patterns in Verilog! It just might save you a huge amount of flops.