The 3D audio in particular is interesting, because it seems it's all virtualised internally. There appear to be no "channels" to speak of, except for the individual sound sources themselves. This is a good sign they have a flexible internal representation that can be adapted to almost any output format or "channel" configuration.
Sinc resampling is high quality and will avoid aliasing for any sounds that are pitch shifted and / or compressed in terms of sample rate (vs. the output rate). It also avoids certain "ringing" effects in the FFT and convolution steps to come, and allowing the best use of their power.
The HRTF is sampled temporally - i.e. it is varied smoothly over time, which is good, as it means that it captures source and listener movement (with respect to the virtual sound field) much more crisply. The "ramps", then, are just cross-fades to this effect. Notice that the HRTF is also interpolated over the "listening sphere" co-ordinates, rather than just picking coefficients from the nearest neighbour, which also improves the crispness.
The crosses are convolution operations, because what the HRTF effectively represents is a frequency response. This is analogous to the impulse response you get with certain reverb plugins for example, but the HRTF is in the frequency domain and the reverb is in the time domain. The accumulator just finishes the job, separated likely for efficiency.
FFT and IFFT are Fast Fourier Transforms (and their inverse) to convert the sounds between the normal time domain that we listen to them in, and the frequency domain (like a spectrum plot, or spectrogram), where heavyweight filtering operations of this specific nature are just faster and easier.
Note that each output channel needs its own spatial mix (e.g. left or right headphone, surround etc.), and it's not actually clear here how (or if?) Sony will achieve this for constant computational cost, no matter the number of output channels.