Winter arrives once more, and like I mentioned in August’s progress report, I am extremely excited for what’s in store. In fact, many of the really big goodies I’ve decided to seperate to their own articles, which should be coming up in the next few weeks.
There’s also been many changes this month that improve the speed of emulation across the board, on top of the usual improvements in accuracy and features. And because of that, I’ve dubbed this month #Speedtember. Let’s dive right in.
Hello everyone! We’re all terribly sorry for the delay in getting this progress report out the door, but our main technical writer anodium, was just a bit busy surviving both Hurricane Irma and Maria. Although she’s a trooper and claims it’s not an excuse for the delay, we find that her personal safety is a tad more important. We’re all glad that she’s safe and sound, and in a state where she can keep pumping out quality articles for Citra!
Switchable Page Tables by MerryMage
Citra has a component called dynarmic, which recompiles ARM11 code to x86-64 code at run time, and then executes that generated code, rather than interpreting the ARM11 instructions directly.
Because the 3DS has a 32 bit address bus, it can address 2^32 unique memory locations. And because the 3DS can address data down to a byte, it can address up to 2^32 unique bytes, or about 4 gigabytes of memory. When considering that no 3DS has ever been released with more than 256 megabytes of memory, this sounds absurd! And it is… unless you consider that a 3DS uses chunks of that huge address space to address peripherals, among other things. This is called memory-mapped input/output (MMIO), and is a great use of millions of addresses that would otherwise have been ignored, plus it also allows handling IO the exact same way memory is handled, so the design can be a bit simpler as it doesn’t need special circuitry to handle IO.
Herein lies our problem. Because that code is now being run on a PC, those MMIO devices don’t actually exist anymore, so Citra needs to handle those reads and writes itself. There’s a few ways to go about it, but the simplest and most naïve is to replace every memory read or write with a function that checks if that address is mapped to memory or IO. Unfortunately, this is extremely slow, and we can’t afford to have extremely slow address translation when games can access memory upwards of a few hundred thousand times per second.
With this, MerryMage has changed this behaviour so that rather than replacing a read/write with a function, it instead translates the address using a page table, and then tries to access that address directly. On the page table, all addresses that map to memory simply have a memory address written down. But on addresses that map to IO, it has address 0 written down. Trying to read or write to memory address 0 on x86 is illegal for every process except the operating system… and Citra tries to do it anyways!
When an invalid memory address (or a memory address that that process doesn’t have permission to access) is read from or written to, x86 CPUs throw a page fault exception. Citra takes advantage of this behaviour by also registering an exception handler for page faults. If a page fault is thrown, Citra knows the game tried to access IO, and thus recompiles the memory read/write to a direct call to Citra’s IO functions. This makes the usual case (memory access) extremely fast, and the less usual case slow, but only the first time it happens. Subsequent IO accesses use the recompiled functions which are faster.
This technique is called fastmem, and is not new at all. In fact, Dolphin uses it extensively in its JIT recompiler to speed up memory access as well. And thanks to MerryMage’s hard work, this same technique is now used extensively by Citra.
Give each process its own page table by Subv
In order to support running multiple processes at the same time, like your computer, Citra implements virtual memory, in which each process has its own page table. The page table represents a translation from the process’ virtual addresses, to the 3DS’ physical (or “real”) addresses.
Before this, because Citra did not support multiple page tables, it also didn’t support running multiple processes at once, such as a game and the software keyboard applet. Now, thanks to Subv, Citra has an important building block in place.
Add support for loading application updates by shinyquagsire23
Nintendo 3DS titles are contained within
*.app files on the SD card or on the
game cartridge, in the NCCH container format.
This format is further divided into two formats, CXI and CFA, which stand for
__C__TR e__X__ecutable __I__mage and __C__TR __F__ile __A__rchive, respectively.
CXIs contain executable code, whereas CFAs cannot. CFAs usually accompany a CXI
to provide other features such as the digital instruction manual, the Download Play
child application, or in the case of game cartridges, system updates.
Both types of NCCH start with a header, and then followed by either an ExeFS image, a RomFS image, or both. The entire structure of an NCCH header may be best explained by a diagram:
Now, games and applications need updates from time to time, and 3DSes handle these by installing the update as a seperate title from the base game. From that point on, whenever the user tries to launch the game, instead of loading the extended header (or ExHeader for short) and ExeFS image from the base game’s NCCH, it replaces them with the update’s ExHeader and ExeFS on launch. As for RomFS, the 3DS System Software will actually load both the base game’s and the update’s RomFS image, rather than replacing one with the other. Games are left to their own devices on how to handle these, and so the methods used per game can vary, though they usually just replace changed files, picking files from the base game RomFS if they haven’t been modified.
Citra, before this PR, had the code for loading games and reading NCCH files all mixed into one big piece that fit in with everything else. With this patch, shinyquagsire23 has seperated the loader from the NCCH reader, allowing the loader to read multiple NCCHs at once. Additionally, whenever a game is loaded, the loader would also check if there is an update title installed on Citra’s virtual SD card. If there is, it would replace the update ExHeader and ExeFS, and load the update RomFS as well. Just like a real console!
Most games worked out of the box with updates, and because they wrote the code with accuracy in mind, this very same PR has also laid part of the foundation needed to handle other features such as DLC support or even using real 3DS SD cards! Though, do note that we don’t have any estimates on either those or any other features, as no one is actively working on either.
Implement geometry shader by wwylele
The PICA200 GPU has a pipeline similar to OpenGL’s pipeline for rendering 3D objects into a 2D display. I won’t go through them all here, only the optional geometry shader step. Just after the vertex shader step, if enabled, all the vertices are processed by a shader kernel (which is a small program that runs directly on a GPU), taking as many vertices as the kernel wants as input, and outputting as many vertices as the kernel wants.
Because the kernel in the geometry shader is allowed as many inputs and outputs as it wants, it is significantly more powerful and flexible than the vertex shader, whose kernel is restricted to only one vertex at a time, both for input and output. But for that same reason, geometry shaders are much more complex to program, and so many games simply disable it. The games that do not disable it though, tend to use it very extensively, to the point of completely breaking graphics if it’s not implemented.
Multiple uses have been found in the wild for geometry shaders, including but absolutely not limited to:
Taking one vertex as input, and outputting a rectangle of vertices which can be textured with a sprite. Pokémon uses this extensively to render particles whenever a move is used. Monster Hunter takes it a step further and renders all of its HUD and GUI with this kernel.
Taking a handful of vertices as input, and outputting even more vertices which are interpolations between the inputs, thus making the resulting mesh look smoother and less jagged when rendered.
At first glance, geometry shaders looked like an easy problem, since they use the same instruction set and format as vertex shaders, so a lot of the same code could be reused. At second glance, it turned out that configuring inputs and outputs for geometry shaders is much more complex than it is for vertex shaders.
There were actually three attempts to implement geometry shaders in Citra. The first was written by ds84182 about two years ago, only to be abandoned due to not knowing how the configuration of them was done. The second attempt was written by JayFoxRox, but was also abandoned for the same reason.
But, after extensive research on geometry shaders was made by fincs, the API was implemented in ctrulib and citro3d, and examples were written to demonstrate how to use it. Now that the community knew exactly how they worked, wwylele picked up where JayFoxRox left off, cleaned up the code he wrote, and added the missing pieces.
After almost three years, and three different attempts to make it work, Citra now has a full, complete, and correct implementation of geometry shaders!
Implement custom clip plane by wwylele
After the geometry shader (or the vertex shader, if it wasn’t enabled), the vertices are “assembled” into a collection of triangles. After that, to make rendering more efficient, the triangles are then compared to 6 planes that make up the cube in which objects are actually visible by the camera. Any triangles outside of that cube are deleted, and any triangles that are partially inside the cube are split by the sides of the cube, and the resulting triangle outside of the cube is also deleted.
But the 3DS allows games to add a 7th plane whose position is fully customizable. Although no games are known to use this feature right now, it is indeed a feature of the 3DS’ GPU. Because implementing it was fairly straightforward, wwylele decided to just go ahead and implement it, in case someone decided to use it in the future.
Optimized Morton by huwpascoe
Morton code is a function that interleaves multi-dimensional numbers into a one-dimensional number. Although it may seem like a very esoteric function, it’s actually extremely useful in fields like linear algebra, databases, and what the 3DS uses it for: texture mapping.
Computers have an intermediate chunk of memory between RAM and the CPU called a cache. Caches are seperated into lines, each of which can hold one data item. GPUs also have a cache, also seperated into lines. Because they are seperated like this, if a texture is loaded into the cache, it would have to span multiple cache lines, or even not fit into the cache completely, thus making transformations on it slow, as it would have to load and store pieces of it from RAM multiple times.
To avoid this, GPUs can Morton encode textures so that two-dimensional manipulations are more likely to only need data already in the cache. Textures that have been Morton coded are usually referred to as swizzled or twiddled textures.
In the function that Morton is implemented, there was a lookup table on Morton codes in the comments, and huwpascoe thought it’d be best if we just use the lookup table directly. It worked just as well as before, but required less than a third of the math. Because this function is called so often during emulation (a rough estimate from them is about “millions of times a second”), this change although small, made very big changes in CPU performance.
Add draw for immediate and batch modes by jroweboy
The 3DS’ GPU has two main modes for drawing to the screen, immediate and batch mode. In the former, the GPU takes and immediately draws every vertex as it is handed to it. In the latter, the GPU accepts vertices given to it, but doesn’t actually bother drawing them until absolutely necessary, saving a bit of time from not having to go through the drawing procedure for every individual vertex. Although most games don’t use immediate mode at all due to it being extremely slow, a handful do use it for a handful of visual effects, like New Super Mario Bros. 2.
About a year ago when the GPU code on Citra was rewritten, a handful of calls to the drawing routine were removed, as it was believed they were unnecessary. Turns out, one of the calls was actually needed for some effects in games, as it handled immediate mode drawing. This wasn’t noticed for a very long time, as most games appeared to carry on with no side-effects at all from the rewrite, but was eventually found after some research courtesy of ds84182.
Interpolate audio samples on a frame-by-frame basis by MerryMage
When a 3DS game needs some sort of audio processing, they can access the 3DS’ DSP, or __D__igital __S__ound __P__rocessor. It’s another processor, alongside the ARM9 and ARM11, that is given a firmware to run, which in turn is given a bunch of audio samples and parameters by the game. The DSP then plays back the buffer in chunks of about 5 milliseconds. Each one of these chunks is called an audio frame.
As of today, we don’t know how the DSP exactly works, and we don’t know how any of the firmwares exactly work. (Did I forget to mention earlier there’s multiple versions of the firmware?) But we do know how to use it, and from there we can reimplement its behaviour directly in Citra. Which is exactly what MerryMage did back in June of 2016, which in turn brought audio support for the first time in Citra.
This approach, although having the advantages of being easier to implement, easier to understand in code, and has a higher potential of being faster, it has the disadvantage that accuracy suffers significantly, especially when shortcuts are taken for the sake of speed. One of these shortcuts was in the audio interpolation, which is a way of inferring more audio samples from relatively very few existing samples.
On a real 3DS, games are allowed to interpolate different audio frames with different functions, even when in they’re in the same buffer. On the other hand, Citra interpolated the entire buffer with one function as soon as it was loaded. This led to various effects and music in games to sound strange or inaccurate in some way.
One example of this is Deku Link’s footsteps in The Legend of Zelda: Majora’s Mask 3D.
Here’s the output of a real 3DS console, for reference:
And here’s the output of Citra, before this was fixed:
Now that it’s been fixed, his footsteps sound a lot better:
Audio emulation in Citra is still somewhat inaccurate for now, though MerryMage is gradually working on fixing and improving it. Perhaps some day we may even be able to emulate the DSP firmware directly, which will be much more accurate than merely emulating its behaviour.
Use deque instead of vector for the audio buffer by Subv
Whenever the DSP consumes some frames from the audio buffer, Citra deletes them from it. This normally wouldn’t pose any problems, but because the buffer was being stored as a vector, this led to some uneccessary operations. Namely, the C++ standard requires that all the data of a standard vector be in one contiguous block of memory. Because deleting frames from the buffer breaks this rule, Citra would automatically (1) allocate a new block of memory, (2) copy the entire buffer into that new block of memory, and (3) deallocate the old block of memory, thus deleting the old buffer.
These steps are huge waste of time, as Citra doesn’t need to guarantee that the audio buffer is in one contiguous block. So Subv changed the type of the buffer from a vector to a deque, which is essentially a queue that you can remove data from both the beginning and end of it. Because the contiguity requirement doesn’t exist in deques, Citra doesn’t do the uneccessary copying, leading to huge speed boosts in audio bound titles like Super Mario 3D Land, and even the Home Menu. Now, both run significantly faster!
Add mingw64 compile support to appveyor by jroweboy
When a program is written in a high-level programming language, such as C++, Rust, or Go, before the program can be run on a machine, it must be translated or “compiled” to machine code. Although it is possible to do this translation by hand, it is usually extremely difficult to do so and very time consuming. So instead, we have a program called a compiler than can automatically do this translation for us.
This is also why a program compiled for an ARM machine cannot be run directly on an x86 machine, even when the source can work on either machine without issues. Instead this program must be translated, interpreted, or recompiled from source to x86. (In fact, this translation is exactly what dynarmic does to run code from a 3DS.)
Every statement in a program must have an exact, unambiguous definition of what it does (its semantics). But, in the same way that a statement that means one thing can be written many different ways, and different compilers can translate the same statement many different ways.
On Windows, there’s two popular C++ compilers available as of today: MSVC++, which
is the compiler Microsoft has written for Windows, and MINGW GCC, which is actually
a port of the Linux
gcc compiler to Windows. For better or worse, MINGW GCC
optimizes Citra a little better than MSVC++, and so jroweboy
has changed the Citra AppVeyor build script to add support for MINGW GCC as well
as MSVC++. Do note that the MSVC++ builds are only available through GitHub, since
they’re only useful for debugging, and MINGW GCC builds are faster in most, if
not all, cases, which is why the installer will only install those. This change
also has closed the gap in performance the new Nightly builds had compared to
the old Bleeding Edge builds.
Load different shared font depending on the region by wwylele
Remember that last month wwylele changed Citra so that instead of loading the shared font from a seperate file, it would load it from the system archive? This builds on top of that behaviour. You see, a 3DS doesn’t have a shared font, it has four. One contains glyphs for Latin script (for English, Spanish, Italian, French, etc.) and Japanese scripts, another contains glyphs for Traditional Chinese, the third font contains those for Simplified Chinese, and the last font contains the ones for Korean.
Before this PR, Citra would simply load the first shared font regardless of game
or region. This made non-Latin or non-Japanese script games display completely
incorrect characters at best, or crash at worst. Now Citra will load the appropriate
shared font from the system archive depending on the region selected, just like
a real console! Though, this will not work on machines that only have the
shared_font.bin file, because it only contains the shared font for the region
of the console it was dumped from. (e.g.: If you dump a Korean console, it’ll
only contain the Korean font.) If you want to use this feature, you must dump
the system archive using the latest version of
And of course, big thanks to everyone who’s contributed this September, because Citra as a whole would not be the same without everyone involved having placed their pieces, big or small.