So over the last couple of days I started work on the video co-processor (from here on I think I’ll refer to as the VCP) of the Athena. The goal of this co-processor is to get two independently scrolling tilemap layers, 128 sprites, and 256 onscreen colors.
This turned out to be a bit of an adventure. Not the least of my problems was trying to find a way to allow the CPU and the VCP to both access VRAM. The thing is, I need to let the CPU be able to write into VRAM so that we can change color palettes, scroll tilemap layers, modify the sprite table, upload image data and tilemap data, etc. And the VCP needs to be able to read from VRAM while drawing. And, unfortunately, a single RAM module can only do one of those things at a time.
My first attempt was to try and stick a priority arbiter in front of the VRAM chip. The idea was that I would use the DTACK line of the CPU to implement a sort of blocking wait state – if the VCP and the CPU both wanted to access VRAM at the same time, the arbiter would grant the VCP’s request and de-assert the CPU’s DTACK line which causes the CPU to halt, until the VCP was done and the DTACK line was re-asserted.
This failed miserably. The CPU mostly just locked up completely. Maybe I just didn’t implement it correctly, but I decided to try a different approach anyway.
My next attempt was to use a control port on the VCP. The idea is very similar to the way VRAM writes work on the Sega Genesis (and, probably, for extremely similar reasons): the CPU would write into an address and data port on the VCP, and the write a value into a control port to signal a write request. The VCP is then free to schedule this to be processed whenever it isn’t already interacting with VRAM.
So I set about trying to implement this and running into various odd problems, and that’s when I realized something pretty crucial: my CPU and my VCP run on different clocks, and I was running into clock domain crossing issues.
What is clock domain crossing? Basically, if you have one signal tied to one clock, and a process on another clock trying to sample that signal, you can end up with some very strange results in real hardware. There are various ways to deal with this problem. For very simple cases, you can just feed a value through a chain of two or three flip-flops on the target clock to (relatively) safely synchronize the value (the chance of an error is never mathematically zero in this case, but it is very very near zero). However, in this case I went with a different approach: a dual-clocked FIFO queue.
A FIFO queue is of course easy to grasp if you’re a software developer: you can enqueue multiple values, and then later dequeue them in the order they were enqueued. In this case, there’s a special kind of FIFO queue that’s specifically designed to run on two clock signals – one for reading, and one for writing, and as a result provides a way of synchronizing values from one clock domain to another.
The nice thing is that Quartus provides me with a built-in customizable FIFO construct that allows for this exact setup, so I’ll just go ahead and use that. I decided to make the FIFO width 32 bits – when I enqueue values onto this FIFO from the source clock domain (the clock my CPU runs on), I merge the 16-bit address and 16-bit data ports into one single packed 32-bit value. On the receiving end, when I dequeue values from this FIFO on the destination clock domain (the pixel clock my VCP is running on), I split them back out and use them to generate VRAM address and data port values. My FIFO also provides a write full signal, so I also decided to wire things up so that reading from the control port mentioned previously yields a 0 or 1 value, 0 if the FIFO still has room, 1 if it has become full. This is used so that on the CPU side you can manually check if a write has filled up the FIFO and wait until it empties a bit before adding more write requests. This generally shouldn’t happen though, since the CPU is running at 12.5MHz and the pixel clock is 25.127MHz (so it shouldn’t be possible for the CPU to outstrip the VCP’s read rate)
So that solves getting data from the CPU into VRAM, but what about actually rendering?
Well that, too, was an adventure.
At the moment, I’ve only got one single tilemap rendering, but I should be able to build on this and implement the rest of the VCP rendering features.
In the end, what I went with was a state machine that works like this:
- In STATE_idle, wait for a Horizontal Interrupt signal from the VGA generator. On receiving the signal, latch the current state of the scanline counter, flip the currently active line buffer, and transition to STATE_draw_line
- In STATE_draw_line, fill the current line buffer with the first color entry in the color palette and set up the first read for the next clock cycle (reads the tile intersecting the leftmost pixel of the line from the tilemap address given in tile plane A’s registers), transitions to STATE_fetch_tile_row
- In STATE_fetch_tile_row, the results of the read set up in the last cycle are fetched from VRAM into a row of tile data registers, sets up the next read address, and loops back into this state fetching one tile per clock cycle until 41 tiles have been fetched. It then sets up a VRAM read for the pixel data of the first tile and transitions into STATE_draw_tile_row
- In STATE_draw_tile_row, the results of the read set up in the last cycle are fetched. Each word encodes four pixels of data, indexing one 16-color row of the color palette at a time, so four pixels are then blit into the current line buffer, the next four pixel VRAM read is set up, and it loops back into this state until all 41 tiles have been blit into the current line buffer. It then transitions back into STATE_idle.
It took a little bit to get this working properly, but in the end I did succeed. You can see the results below:
The other thing I did in the above video was get external interrupts working so that I could make use of a vblank interrupt. This isn’t super tough if you read the M68K data sheets but I’ll outline the basics: first, you need to set the Status Register to indicate which interrupts you want to handle (by default it’s 7, which indicates that interrupts 6 and below will be ignored). I set the the interrupt mask portion of this to 0 so that all interrupts would be handled. Next, to assert a priority, you need to set the 3 IPL lines to encode an interrupt priority (you can encode a value between 1 and 7). After that, you check the 3 FC output lines. If they become 111, you know the CPU is about to handle the interrupt, and the interrupt it’s about to handle has been placed on the lowest 3 bits of the address lines. You can use that to de-assert that particular interrupt request. I personally put together a couple of modules to help with latching interrupt requests and unlatching interrupt requests in response to the interrupt acknowledge from the CPU.
So, what’s next? Well, since I’ve got one tilemap done, next I’ll probably work on sprite rendering, and then extend this to support multiple tile and sprite layers. And then it’ll probably be onto the audio subsystem!