# DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)



## Mahigan

**This is a work in progress and should be viewed as such**

It's been several weeks since the Ashes of the Singularity benchmarks hit the PC Gaming scene and brought a new feature into our collective vocabulary. Throughout these past few weeks, there has been a lot of confusion and mis-information spreading throughout the web as a result of the rather complex nature of this topic. In an effort, to combat this misinformation, @GorillaSceptre asked if a new thread could be started in order to condense a lot of the information which has been gathered on the topic. This thread is my no means final. This thread is set to change as new information comes to light. If you have new information, feel free to bring it to the attention of the Overclock.net community as a whole by way of commenting on this thread.

As things stand right now, Sept 6, 2015, we're waiting for a new driver, from nVIDIA, to rectify an issue which has inhibited the Maxwell 2 series from supporting Asynchronous Compute. Both Oxide, the developer of the Ashes of the Singularity video game and benchmark, and nVIDIA are working hard to implement a fix to this issue. While we wait for this fix, lets take an opportunity to break down some misconceptions.

nVIDIA HyperQ

nVIDIA implement Asynchronous Compute through what nVIDIA calls "HyperQ". HyperQ is a hybrid solution which is part software scheduling and part hardware scheduling. While little information is available as it pertains to its implementation in nVIDIAs Maxwell/Maxwell 2 architecture, we've been able to piece together just how it works from various sources.

*Now I'm certain many of you have seen this image floating around, as it pertains to Kepler's HyperQ implementation*:


Spoiler: Warning: Spoiler!



 Or thia updated one by Ext3h: 



What the image shows is a series of CPU Cores (indicated by blue squares) scheduling a series of tasks to another series of queues (indicated in black squares) which are then distributed to the various SMMs throughout the Maxwell 2 architecture. While this image is useful, it doesn't truly tell us what is going on. In order to figure out just what those black squares represent, we need to take a look at the nVIDIA Kepler White Papers. Within these white papers we find the HyperQ being defined as such:


Spoiler: Warning: Spoiler!







*Based on this diagram we can infer that HyperQ works through two software components what we now speculate to be a hardwired ARM processor built into the Maxwell die which is comprised of two hardware components before tasks are scheduled to the the GPU*:

Grid Management Unit
Work Distributor

*So far the scheduling works like this*:

The developer marks up a command list
This command list is sent to the nVIDIA software driver
The nVIDIA software driver translates the commands into ISA
The ISA commands are fed to a Grid Management Unit
The Grid Management Unit transfers 32 pending grids (32 Compute or 1 Graphic and 31 Compute) to the Work Distributor
The Work Distributor transfers the 32 Compute or 1 Graphic and 31 Compute tasks to the SMMs which are a hardware component within the nVIDIA GPU.
The components within the SMMs which receive the tasks are called Asynchronous Warp Schedulers and they assign the tasks to available CUDA cores for processing.
Quote:


> That's all fine and dandy but why doesn't it work?


New information has come to light which stipulates that Asynchronous compute + graphics does not work due to the lack of proper Ressource Barrier support under HyperQ. What is a resource barrier? Resource barriers - add commands to convert a resource (or resources) from one type to another (such as a render target to a texture), prevents further command execution until the GPU has finished doing any work needed to convert the resources as requested. Without this feature, nVIDIAs HyperQ implementation cannot be used by DirectX12 in order to execute Graphics and Compute commands in parallel. (However, Compute commands should be able to be executed in parallel to one another).


Spoiler: Warning: Spoiler!







An indepth explanation can be found here: http://ext3h.makegames.de/DX12_Compute.html

Quote:


> What is preemption?


David Kanter explains it here (starts around 1:18:00 into the video):


Spoiler: Warning: Spoiler!











Preemption is important for VR, GCN has finer-grained preemption which allows the ACEs to execute a compute task asynchronously and in parallel to other tasks, and at a lower latency, whenever a VR headset movement is detected. The compute task being executed is an Asynchronous Warp. It is a compute shader which alters the prior rendered frame slightly in order to adjust for the new angle of the VR headset. Without this feature operating at a low latency (20ms or less) motion sickness can ensue.

*What about Maxwell?*:

Say you have a graphic shader, in a frame, that is taking a particularly long time to complete, with Maxwell2, you have to wait for this graphic shader to complete before you can preempt it. If a graphic shader takes 16ms, you have to wait till it completes before executing another graphic or compute command. Thia is because NVIDIA do not support finer-grained preemption. They support coarse grained preemption. NVIDIA have made great strides with their VR implementation from Kepler to Maxwell. Going from 57ms latency down to 34ms but they're still not there.


Spoiler: Warning: Spoiler!







Quote:


> What is Slow Context Switching?


Slow context switching has to do with preemption is VR. Do you remember that 16ms graphics shader? Well if it were a Graphics task being executed instead of a shader and you needed to preempt it with an Asynchronous Time Warp shader, then you would be switching from a Graphics context to a compute context. Due to the shared resources within an SMM, between graphics and compute jobs such as the shared L1 cache for example, not only would you have to wait until the end of the execution of that graphics task (like mentioned in the preemption section) but you'd also need to perform a full "flush" of the SMM. A flush means emptying the caches etc in the SMM. This can incur latency of upwards of 1,000ms.

Quote:


> Why is this different than GCN?


GCN doesn't have this issue because GCN has built in hardware redundancy (and the power usage that goes along with it). Each CU has its own L1 cache as do the RenderBackEnds, Texture Units etc. GCN can switch contexts in a single cycle. With GCN, you can execute tasks simultaneously (without a waiting period), the ACEs will also check for errors and re-issue, if needed, to correct an error. You don't need to wait for one task to complete before you work on the next. So say, on GCN, a Graphic shader task takes 16ms to execute, in that same 16ms you can execute many other tasks in parallel (like the compute and copy command above). Therefore your frame ends up taking only 16ms because you're executing several tasks in parallel. There's little to no latency or lag between executions because they execute like Hyper-threading (hence the benefits to the LiquidVR implementation).


Spoiler: Warning: Spoiler!







Quote:


> What does this mean for gaming performance?


Developers need to be careful about how they program for Maxwell 2, if they aren't... then far too much latency will be added to a frame. This is true even once nVIDIA fix their driver issue. It's an architectural issue, not a software issue.

Quote:


> It's architectural? How so?


Well that's all in how a context switch is performed in hardware. In order to understand this, we need to understand something about the Compute Units found in every GCN based Graphics card since Tahiti. We already know that a Compute Unit can hold several threads, executing in flight, at the same time. The maximum amount of simultaneous threads executed concurrently, per CU, is 2,560 (40 Wavefronts @ 64 Threads ea). GCN can, within one cycle, switch to and begin the execution of a different Wavefront . While that's happening, GCN can also be working on Graphics tasks. This allows the entire GCN GPU to execute and process both Graphics and Compute tasks, simultaneously, with extremely low-latency being associated with switching between them. Idle CUs can be switched to a different task at a very fine grained level with a minimum performance penalty associated with the switch.


Spoiler: Warning: Spoiler!








On top of what the Compute Units can do, The ACEs are also far more flexible than the AWSs found in Maxwell 2. Each ACE can synchronize with other ACEs in order to execute large workloads requiring dependencies (enforced by fences on the developer side). On top of this, if one ACE dispatches work to on particular Compute Unit (say CU1) and the result of the shader computed is required by another Compute Unit (say CU2), then the intermediate result can be placed into the LDS (Local Data Share Cache) or the GDS (Global Data Share Cache). The result is then pulled straight from the LDS/GDS by the CU2 in order to complete the shader calculation. Each ACE can stop, start, pause or move intermediate data into memory. The image below explains just what ACEs can do:


Spoiler: Warning: Spoiler!







Quote:


> Well that's GCN's architecture... What about Maxwell 2 and that Slow Context Switching thing?


In terms of a Context Switch, Maxwell 2 can switch between a Compute and Graphics task in a coarse-grained fashion and pays a penalty (sometimes to the order of over a thousand cycles in worst case scenarios) for doing so. While Maxwell 2 excels at ordering tasks, based on priority, in a way which minimizes conflicts between Graphics and Compute tasks; Maxwell 2 doesn't necessarily gain a boost in performance from doing so. This is why it remains to be seen if Maxwell 2 will gain a performance boost from Asynchronous Compute. A developer would need to finely tune his/her code in order to derive any sort of performance benefits. From all the sources I've seen, Pascal will perhaps fix this problem (but wait and see as it could just be speculation). There is also evidence that Pascal will not fix this issue as you can see here:


Spoiler: Warning: Spoiler!






Source here: Page 23

nVIDIA may not have thought that the industry would jump on DX12 the way it is right now, or VR for that matter, but many AAA titles will be heading dowm the DX12 route in 2016. We'll even get a few titles in a few months in 2015. What's worse is that the majority of these titles have partnered with AMD. We can therefore be quite certain that Asynchronous Compute will be implemented, for AMD GCN at least, throughout the majority of DX12 titles to arrive in 2016.


Spoiler: Warning: Spoiler!







*TechReport: David Kanter discusses Asynchronous Compute*


Spoiler: Warning: Spoiler!











*Extra Goodies!*


Spoiler: Warning: Spoiler!















There is no underestimating nVIDIAs capacity to fix their driver. It will be fixed. As for the performance derived out of nVIDIAs solution? Best to wait and see.

Take Care









***If you spot a mistake, either PM me or post it in the comment section. Lets get this whole issue as factual as possible***


----------



## GorillaSceptre

Damn, nice work









We need an "Over 9 000!" REP+ button









Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.


----------



## Mahigan

Quote:


> Originally Posted by *GorillaSceptre*
> 
> Damn, nice work
> 
> 
> 
> 
> 
> 
> 
> 
> 
> We need an "Over 9 000!" REP+ button
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.


Thank You brother


----------



## axizor

Before everyone yells wrong section at you, just wanted to say well done. Even though I dont have nVidia, I'm finding this very interesting and informative.

+rep


----------



## Shivansps

From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.


----------



## Nehabje

Fantastic, no, awesome post!. +rep indeed







I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal.


----------



## GorillaSceptre

Quote:


> Originally Posted by *Nehabje*
> 
> Fantastic, no, awesome post!. +rep indeed
> 
> 
> 
> 
> 
> 
> 
> I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal.


Yup, solid research like this goes a long way in helping us make educated buying decisions. Something i used to count on tech websites for


----------



## Wuest3nFuchs

THX alot Mahigan !

I also follow this thread with big opened eyes http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/0_20 !

+rep
+


----------



## ImJJames

I love you.


----------



## Kourin

Quote:


> Originally Posted by *axizor*
> 
> Before everyone yells wrong section at you, just wanted to say well done. Even though I dont have nVidia, I'm finding this very interesting and informative.
> 
> +rep


Ditto~


----------



## GnarlyCharlie

Aren't DirectX 12 and Asynchronous Compute two different things? There seems to be a bit of confusion, posters declaring that only AMD can implement DX12 since only their hardware architecture supports ASC.


----------



## Mahigan

Quote:


> Originally Posted by *GnarlyCharlie*
> 
> Aren't DirectX 12 and Asynchronous Compute two different things? There seems to be a bit of confusion, posters declaring that only AMD can implement DX12 since only their hardware architecture supports ASC.


Asynchronous Compute is a feature, now made available, by DirectX 12. It serves as a form of optimization, in order to derive better performance out of a GPU, by executing tasks in parallel... keeping the available Graphic and Compute resources fed.

GCN was built, from the ground up, with this sort of usage scenario in mind. GCN wasn't too great at DX11, though not horrible, because most of its capabilities were not exposed to the DX11 API.

Maxwell 2 is a step towards the direction of a more GCN-like architecture. We can only assume that Pascal will take an even greater step in that direction.

Current DX11 performance will remain the same, with nVIDIAs Maxwell 2 reigning as top dog. Under DX12, however, things will likely be quite different. We should see AMD and nVIDIA trading blows at first, at least until Pascal and Greenland arrive... then who knows.


----------



## Nehabje

Quote:


> Originally Posted by *ImJJames*
> 
> I love you.


This comes to mind: https://www.youtube.com/watch?v=-7hjdC8-jbw









[Edit:] I'm sorry for being off topic, but I thought this was too much fun.


----------



## TopicClocker

Glad to see this thread!









Alot of people don't understand Asynchronous Compute or DX12 very well so this should be really helpful for them.

Before this whole DX12 thing blew up, I read alot about GCN's Asynchronous Compute Capabilities and I understand it very well; so I'm not unfamiliar with the subject at-all.

Infact, I'd also like to help other people understand Asynchronous Compute and DX12.


----------



## Devnant

Great stuff Mahigan! Easy to understand too


----------



## Mad Pistol

Quote:


> Originally Posted by *Mahigan*
> 
> Asynchronous Compute is a feature, now made available, by DirectX 12. It serves as a form of optimization, in order to derive better performance out of a GPU, by executing tasks in parallel... keeping the available Graphic and Compute resources fed.
> 
> *GCN was built, from the ground up, with this sort of usage scenario in mind. GCN wasn't too great at DX11, though not horrible, because most of its capabilities were not exposed to the DX11 API.*
> 
> Maxwell 2 is a step towards the direction of a more GCN-like architecture. We can only assume that Pascal will take an even greater step in that direction.
> 
> Current DX11 performance will remain the same, with nVIDIAs Maxwell 2 reigning as top dog. Under DX12, however, things will likely be quite different. We should see AMD and nVIDIA trading blows at first, at least until Pascal and Greenland arrive... then who knows.


How interesting. In this context, GCN was a better architecture than Kepler or Maxwell, but because DX11 does not support a feature of the architecture natively, it has never been fully realized.

That's quite eye-opening.


----------



## mtcn77

Does this qualify for an editorial? We should have more of these in house analyses.


----------



## Sleazybigfoot

Clears up a lot of misinformation, no doubt.

Personally I've thought Nvidia's driver fix like 120Hz virtual on TV's (doubling of the frames) and a native 120Hz monitor. Yes it'll work, but it'll only get you this far.

Nevertheless, nice post


----------



## HiTechPixel

Quote:


> Originally Posted by *Shivansps*
> 
> From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.


Is it possible to use 1 GPU for graphics and 1 GPU for compute WHILE AT THE SAME TIME rendering in SFR (split frame rendering)?


----------



## provost

Quote:


> Originally Posted by *mtcn77*
> 
> Does this qualify for an editorial? We should have more of these in house analyses.


My









I don't think there is any other site that can provide the breadth and exposure for talented independents to showcase their work in this rather niche "pc gaming"/"enthusiast" journalist segment.

There can be a lot of win-win opportunities, especially for the consumers, should this be expanded further.


----------



## lucasj1974

Thanks for your work on this subject. Top notch!


----------



## ToTheSun!

Quote:


> Originally Posted by *HiTechPixel*
> 
> Quote:
> 
> 
> 
> Originally Posted by *Shivansps*
> 
> From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.
> 
> 
> 
> Is it possible to use 1 GPU for graphics and 1 GPU for compute WHILE AT THE SAME TIME rendering in SFR (split frame rendering)?
Click to expand...

AsynX?


----------



## PostalTwinkie

Don't want to be a downer....

But are we allowed to make our own news source, and thread, when we become the cited source for all the news??

EDIT:

Is this a situation of "why buy a cow when you can get the milk free?"


----------



## Randomdude

Yeah, this should not be in news regardless of how you look at it.


----------



## Paul17041993

This wont be for some time though, but anyone have ideas they would like to share for a possible in-depth benchmark tool? I currently have plans for a circle-raster particle renderer and real-time ray-trace, AI is another idea but a little more complex. All of these would use async extensively and I intend to have it run on DX12, OGL + OCL and Vulkan when the API goes public.


----------



## Mad Pistol

Quote:


> Originally Posted by *Randomdude*
> 
> Yeah, this should not be in news regardless of how you look at it.


I agree, but in this case, it's quite a revelation, so I'm ok with it staying.


----------



## mtcn77

Quote:


> Originally Posted by *PostalTwinkie*
> 
> Don't want to be a downer....
> 
> But are we allowed to make our own news source, and thread, when we become the cited source for all the news??
> 
> EDIT:
> 
> Is this the a situation of "why buy a cow when you can get the milk free?"


He isn't an average amateur.


----------



## Mahigan

The mods and admins can move this thread if they wish... I honestly, my mistake, wasn't paying proper attention what I created it.

Sorry folks.


----------



## provost

Quote:


> Originally Posted by *Mahigan*
> 
> The mods and admins can move this thread if they wish... I honestly, my mistake, wasn't paying proper attention what I created it.
> 
> Sorry folks.


it would be a waste moving it anywhere else, if this is the "hot topic", which it is. But, of course its their call.

Quote:


> Originally Posted by *Paul17041993*
> 
> This wont be for some time though, but anyone have ideas they would like to share for a possible in-depth benchmark tool? I currently have plans for a circle-raster particle renderer and real-time ray-trace, AI is another idea but a little more complex. All of these would use async extensively and I intend to have it run on DX12, OGL + OCL and Vulkan when the API goes public.


I say why not









I am always up for learning something new , even if what you wrote above sounds like Greek to me at this time ... lol


----------



## Shivansps

Quote:


> Originally Posted by *HiTechPixel*
> 
> Is it possible to use 1 GPU for graphics and 1 GPU for compute WHILE AT THE SAME TIME rendering in SFR (split frame rendering)?


Its not needed, since DX12 supports Asymmetric Multi-GPU, if you can send a rendering work to a secondary card as will i guess you can send compute-only as well.


----------



## PostalTwinkie

I an kind of just waiting for the patch from Oxide and Nvidia, hopefully that shows up sooner rather than later.

EDIT:

Even if Nvidia only manages 50% of the gains that AMD has seen, do we expect further gains from AMD to be enough to surpass that?


----------



## Noufel

Many thnx Mahigan








DX12 is certainly a win-win for both AMD and Nvidia owners my only hope is that game developpers will take advantage from this Api and bring to the PC gamers community a large panel of console games


----------



## batman900

I don't usually ask many questions, to keep it short and sweet. Is Nvidia trying to do and in turn make developers do a software trick to match what AMD is doing in their hardware?

Sorry, just trying to dumb it down for myself in how I read it.


----------



## Hattifnatten

Quote:


> Originally Posted by *Shivansps*
> 
> Quote:
> 
> 
> 
> Originally Posted by *HiTechPixel*
> 
> Is it possible to use 1 GPU for graphics and 1 GPU for compute WHILE AT THE SAME TIME rendering in SFR (split frame rendering)?
> 
> 
> 
> Its not needed, since DX12 supports Asymmetric Multi-GPU, if you can send a rendering work to a secondary card as will i guess you can send compute-only as well.
Click to expand...

That would be wasting gpu-resources though. ASC (as I've understood it) allows you to do multiple (different types of) workloads on the gpu at the same time = using all the resources available. Splitting the workload up so it will run on two gpus is essentially taking what could be done on one gpu, and throw it at two instead for no performance increase. I might be wrong here though, but that's how I've (and I believe many others) have understood the issue.


----------



## Randomdude

Quote:


> Originally Posted by *Shivansps*
> 
> Its not needed, since DX12 supports Asymmetric Multi-GPU, if you can send a rendering work to a secondary card as will i guess you can send compute-only as well.


Somehow these comments make me mentally prepare myself for the next gen milking where you will only have "amazing" performance if you further buy into the ecosystem (i.e. another card outside of SLI with different functions, not required but... might as well get it or gimp the performance you'd have with only a gfx). Seems reasonable to prepare the crowds.

EDIT: Since all your posts are related to nVidia not really having an issue if compute tasks can be offloaded, is that the driver fix coming for Maxwell? Offloading work? Sounds reasonable if it can be done, makes up for the hardware deficiency as well. Pascal might not need to be a "parallel" arch after all, who knows.


----------



## Anna Torrent

I think a nice table comparing some key features of current NV and AMD architectures would be nice - can make stuff a lot more readable

Excellent work!


----------



## Clocknut

Expecting game developer to watch out how to program Maxwell 2 which has smaller market share than entire GCN? Well..... thats a lot to hope for..

infact that is a lot to hope for Nvidia to continue to maintain driver optimization for Maxwell 2, once Pascal is out.


----------



## 47 Knucklehead

Quote:


> Originally Posted by *Anna Torrent*
> 
> I think a nice table comparing some key features of current NV and AMD architectures would be nice - can make stuff a lot more readable
> 
> Excellent work!


That would be a great thing to have.









Since I'm on a multi-video card kick today ...

Crossfire scaled better than SLI. Meaning adding a second (or third or fourth) card will get you a higher FPS with each additional card. nVidia doesn't scale very well beyond 2 cards. Some will argue 3 cards is "ok", but pretty much no one argues that 4 nVidia cards is a waste.

SLi, unlike Crossfire, will work in non-Full Screen modes. As good as Crossfire is, when you are running in Windowed Mode or Full Screen Windowed mode, Crossfire just flat out does not work on cards 2, 3, or 4. SLi will take a hit when in non-Full Screen mode, about 50%, but at least cards 2, 3, and 4 do SOMETHING.


----------



## 47 Knucklehead

Quote:


> Originally Posted by *Clocknut*
> 
> Expecting game developer to watch out how to program Maxwell 2 which has smaller market share than entire GCN? Well..... thats a lot to hope for..
> 
> infact that is a lot to hope for Nvidia to continue to maintain driver optimization for Maxwell 2, once Pascal is out.


As opposed to what Developers had to do and watch out for from each card maker in the past, and do game optimizations for AMD and nVidia? Seriously?


----------



## Shivansps

Quote:


> Originally Posted by *Hattifnatten*
> 
> That would be wasting gpu-resources though. ASC (as I've understood it) allows you to do multiple (different types of) workloads on the gpu at the same time = using all the resources available. Splitting the workload up so it will run on two gpus is essentially taking what could be done on one gpu, and throw it at two instead for no performance increase. I might be wrong here though, but that's how I've (and I believe many others) have understood the issue.


Its far from ideal, but the thing with AC is the ability to process compute and graphics at the exact same time on a single card, but the big question is, you even need that when you can just send the compute task to another GPU? even a DX12 IGP may work well here.

It is wasting GPU resources on your card, but if a card suffers in performance because of it, it may be a option, there has been a lot of talk about DX12 Multi GPU and Asymmetric Multi-GPU, of using diferrent GPUs for rendering, this is simpler because you dont need the 2nd card to do graphics, and even a DX12 IGP may be enoght depending on the task.


----------



## Slaughterem

I believe there can be another solution for Nvidia, Having an internal gpu either an APU or Intel and assigning Async compute to the IGPU. Maybe that is what Kollock was eluding to when he wrote
Quote:


> Regarding SLI and cross fire situations, yes support is coming. However, those options in the ini file probablly do not do what you think they do, just FYI. Some posters here have been remarkably perceptive on different multi-GPU modes that are coming, and let me just say that we are looking beyond just the standard Crossfire and SLI configurations of today. We think that Multi-GPU situations are an area where D3D12 will really shine. (once we get all the kinks ironed out, of course). I can't promise when this support will be unvieled, but we are commited to doing it right.


----------



## looniam

Quote:


> Originally Posted by *Mahigan*
> 
> The mods and admins can move this thread if they wish... I honestly, my mistake, wasn't paying proper attention what I created it.
> 
> Sorry folks.


TBH when i first saw the title i had assumed it was an official MS explanation.

this might help:
*Consolidated News Category Guidelines*

but don't let me be a "debbie downer," this will still get exposure on the front page under the "hot topics" LATEST DISCUSSIONS header.


----------



## Mahigan

Quote:


> Originally Posted by *looniam*
> 
> TBH when i first saw the title i had assumed it was an official MS explanation.
> 
> this might help:
> *Consolidated News Category Guidelines*
> 
> but don't let me be a "debbie downer," this will still get exposure on the front page under the "hot topics" LATEST DISCUSSIONS header.


In that case I hope the admins move it to the proper location.









As for the posting itself, anyone have any changes, found any errors and/or things they'd like to add?


----------



## Hattifnatten

Quote:


> Originally Posted by *Shivansps*
> 
> Its far from ideal, but the thing with AC is the ability to process compute and graphics at the exact same time on a single card, but the big question is, you even need that when you can just send the compute task to another GPU? even a DX12 IGP may work well here.
> 
> It is wasting GPU resources on your card, but if a card suffers in performance because of it, it may be a option, there has been a lot of talk about DX12 Multi GPU and Asymmetric Multi-GPU, of using diferrent GPUs for rendering, this is simpler because you dont need the 2nd card to do graphics, and even a DX12 IGP may be enoght depending on the task.


Since HiTechPixel referred to SFR, I was thinking more along the lines of SLI/XFire-setups, meaning two or more identical cards









Having a weaker dGPU and/or iGPU alongside your primary one does indeed make interesting cases, and is also suprisingly "common". How many people here have had/experimented with a low(er)-end gpu dedicated for physx? Not quite what we're talking about here, but the concept in it's most simple form is already working, and have been for years. I also believe Thief was demoed on a Kaveri test-rig with a 290X, where the iGPU did some compute-work aloneside the 290X. Such scenarios would demand careful programming though, if the iGPU cannot keep up with the dGPU while handling crucial workloads, performance would take a nose-dive.


----------



## FastEddieNYC

Great job with condensing the material to a easy to understand summary.
I have a hard time understanding why Nvidia, with their large driver unit was caught completely unprepared for a core feature of DX12. They were anxious to demonstrate using conservative rasterization with Maxwell 2 but I have not found any public statements about async.


----------



## Paul17041993

Quote:


> Originally Posted by *Slaughterem*
> 
> I believe there can be another solution for Nvidia, Having an internal gpu either an APU or Intel and assigning Async compute to the IGPU. Maybe that is what Kollock was eluding to when he wrote


Using the iGPU to offload certain tasks is already a good option for us developers, however only certain iGPUs are really useful for this. The majority of intel iGPUs and older generation APUs have the same latency and non-global memory issues as using a dGPU.


----------



## Mahigan

Have you guys seen this?

http://megagames.com/news/nvidia-add-full-async-compute-support-driver
Quote:


> In a discussion on Overclock.com, Mahigan provided the following technical outline of the solution:


No, I did not provide a technical outline for the solution. I don't know what the solution to the issue is. I don't work for nVIDIA. All I did was explain the differences in HyperQ as well as AMDs Asynchronous Shading feature.

He goes on to say:
Quote:


> Of course this software-based solution will always be slower than native hardware implementation of the feature. The real question is: how considerable will the performance impact be?


This is what I want to avoid. I don't want people quoting me in press material. We don't even have all of the information, for both nVIDIA and AMD, as it pertains to their respective Asynchronous Compute implementations.

What we have are White papers and those haven't been updated for nVIDIAs Maxwell/2. We're relying on Kepler documentation. What we're attempting to do is to crowd source as much information as we can on the topic in order to better understand the issue at hand.

We have an overall idea of how both Maxwell2 and GCN implement the feature. We have David Kanter and Occulus discussing the issue. We have nVIDIA developer papers for Occulus VR revealing some pieces of the puzzle as well. It's shaping up to be a full explanation.

If anyone from the media wants to help out... you can reference folks to this thread instead of quoting the words in the thread. People can bring in the information they have in order to complete the pieces of the puzzle we're missing. What pieces? Well for one... what are the explicit features of an Asynchronous Warp Scheduler? That information isn't available in the White papers. If anyone has this information... this could shed more light on this issue.


----------



## Slaughterem

Quote:


> Originally Posted by *Paul17041993*
> 
> Using the iGPU to offload certain tasks is already a good option for us developers, however only certain iGPUs are really useful for this. The majority of intel iGPUs and older generation APUs have the same latency and non-global memory issues as using a dGPU.


Thank you for providing this feedback, this was a theory that I had but obviously latency is the key and would cause performance hit. It is great to have those in the know such as yourself and mahigan providing expertise in these matters.


----------



## provost

Quote:


> Originally Posted by *Mahigan*
> 
> Have you guys seen this?
> 
> http://megagames.com/news/nvidia-add-full-async-compute-support-driver
> No, I did not provide a technical outline for the solution. I don't know what the solution to the issue is. I don't work for nVIDIA. All I did was explain the differences in HyperQ as well as AMDs Asynchronous Shading feature.
> 
> He goes on to say:
> This is what I want to avoid. I don't want people quoting me in press material. We don't even have all of the information, for both nVIDIA and AMD, as it pertains to their respective Asynchronous Compute implementations.
> 
> What we have are White papers and those haven't been updated for nVIDIAs Maxwell/2. We're relying on Kepler documentation. What we're attempting to do is to crowd source as much information as we can on the topic in order to better understand the issue at hand.
> 
> We have an overall idea of how both Maxwell2 and GCN implement the feature. We have David Kanter and Occulus discussing the issue. We have nVIDIA developer papers for Occulus VR revealing some pieces of the puzzle as well. It's shaping up to be a full explanation.
> 
> If anyone from the media wants to help out... you can reference folks to this thread instead of quoting the words in the thread. People can bring in the information they have in order to complete the pieces of the puzzle we're missing. What pieces? Well for one... what are the explicit features of an Asynchronous Warp Scheduler? That information isn't available in the White papers. If anyone has this information... this could shed more light on this issue.


Well , since there isn't a real response from Nvidia, I think Crazy Elf's description is the most plausible:

http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/2340#post_24383714
Quote:


> I think that at this point, Nvidia has come to the conclusion that the damage they would do to themselves by saying something exceeds the damage by keeping quiet.
> 
> Admitting that their new architecture will do "x" is more or less an admission that their current architecture isn't going to do it well, especially given the scrutiny this is now getting. They likely won't say it until after they release a GPU better at parallel.


But, I am still hoping that we can hear something concrete directly from Nvidia soon, rather than through Oxide's second hand account of their version of it ...


----------



## Mahigan

Question guys/gals,

What do you get from... "long way off" in terms of implementation of fine grain preemption?

Long way off would seem to indicate "More than a years time" no?


----------



## Clocknut

Quote:


> Originally Posted by *Mahigan*
> 
> Question guys/gals,
> 
> What do you get from... "long way off" in terms of implementation of fine grain preemption?
> 
> Long way off would seem to indicate "More than a years time" no?


it probably mean more than 1 generation instead of just years.


----------



## Mahigan

Quote:


> Originally Posted by *batman900*
> 
> I don't usually ask many questions, to keep it short and sweet. Is Nvidia trying to do and in turn make developers do a software trick to match what AMD is doing in their hardware?
> 
> Sorry, just trying to dumb it down for myself in how I read it.


I think that what nVIDIA is trying to do is...

Implement Asynchronous Compute into their driver
Optimize the developers code for use with their Asynchronous Compute implementation
Make sure the developers limit the performance penalty associated with getting stuck behind a long draw call boundary
The nVIDIA hardware can perform Asynchronous Compute, but not on the same level as AMDs GCN.


----------



## Mahigan

Quote:


> Originally Posted by *provost*
> 
> Well , since there isn't a real response from Nvidia, I think Crazy Elf's description is the most plausible:
> 
> http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/2340#post_24383714
> But, I am still hoping that we can hear something concrete directly from Nvidia soon, rather than through Oxide's second hand account of their version of it ...


I'm hoping the same









Quote:


> Originally Posted by *Clocknut*
> 
> it probably mean more than 1 generation instead of just years.


Hmm....


----------



## PostalTwinkie

Quote:


> Originally Posted by *Mahigan*
> 
> Question guys/gals,
> 
> What do you get from... "long way off" in terms of implementation of fine grain preemption?
> 
> Long way off would seem to indicate "More than a years time" no?


Initially I didn't think much of it, until I realized how recent it was presented by Nvidia - not that long ago.

I would consider it at least one generation/architecture away. I took it as _"Pascal won't have it."_ or at least not in any meaningful way. Which goes against what I would think Nvidia would have prepared for. When I seen the slide I thought _"Well, maybe Pascal will be another pure 'gaming' GPU, like Maxwell"_. Are we now going to see a Pascal, and then a Pascal v2 that has true hardware compute?

Or one of the following,


ASC really isn't that big of a deal, this is an edge case, and we are just blowing this way up.
Nvidia was caught with their pants down.


----------



## Paul17041993

Quote:


> Originally Posted by *Mahigan*
> 
> Question guys/gals,
> 
> What do you get from... "long way off" in terms of implementation of fine grain preemption?
> 
> Long way off would seem to indicate "More than a years time" no?


3-5 years, from the time of the statement that is


----------



## Sammael7

Does anyone have any insight into whether pascal was altered early enough to better be able to leverage asynch compute?

How long does it take to go from the initial design stage of a gpu architecture to taping it out for production? 2 years? Less? More?

AMD made a big splash with mantle late 2013, that is when their intentions on how games ought to run were put out into the world in practice... well, perhaps early 2014 as some games needed updates.

Was the mantle like low overhead of dx12 something that was sparked by mantle or was it independently designed into dx12? I have heard it said that nvidia was in talks with microsoft about dx12 for YEARS before mantle arrived, is that true? Was it ever really a thing before mantle?

VR is another area where it becomes clear that lower latency was extremely important, when did that first become clear to the graphics parties from nvidia/amd? The kickstarter was from 2012 was it not? But when did nvidia/amd realize that lower latency was important enough to start building it into their future gpus?

If the answer to these questions was they knew early, then chances are pascal will be much better. IF not? ....


----------



## dogen1

Quote:


> Originally Posted by *Paul17041993*
> 
> This wont be for some time though, but anyone have ideas they would like to share for a possible in-depth benchmark tool? I currently have plans for a circle-raster particle renderer and real-time ray-trace, AI is another idea but a little more complex. All of these would use async extensively and I intend to have it run on DX12, OGL + OCL and Vulkan when the API goes public.


Are you planning on simulating a best case scenario for async compute?


----------



## Paul17041993

Quote:


> Originally Posted by *dogen1*
> 
> Are you planning on simulating a best case scenario for async compute?


Ideally (for at least one test) it'll stress the GPU dynamically and graph out the results.


----------



## Randomdude

Quote:


> Originally Posted by *Mahigan*
> 
> Question guys/gals,
> 
> What do you get from... "long way off" in terms of implementation of fine grain preemption?
> 
> Long way off would seem to indicate "More than a years time" no?


I'd say a minimum of 2 months to a year-year-and-a-half. See what I did there? hint hint: Ti


----------



## bluezone

I don't know if this will help any of he discussion on time of implementation, but, awhile back I ran into this on Redgamingtech.

http://www.redgamingtech.com/playstation-4-gpu-next-gen-amd-radeon-volcanic-island-gpu-compute-similarities/
http://www.redgamingtech.com/dx12-asynchronous-shader-command-buffer-analysis-amd-exclusive-interview-details/

or vids

https://www.youtube.com/watch?v=YWqoprZWDcY
https://www.youtube.com/watch?v=hUCyxmAVH64

P.S. I've been happily reading your thoughts and information on this subject.
Keep up the good work.


----------



## malitze

Hey guys,

first of many thanks to Mahigan for you efforts, I really enjoy this insightfull discussion here and in the other thread.









However, there are some things I have yet to understand, maybe you can help clarify a little. Concerning nvidia the slides mention the need for a more fine grained preemption. Wouldn't that on the other hand introduce more scheduling overhead and increase the number of (rather costly) context switches? And how would that help increasing parallelism (if it does)?

Sorry if I'm somehow missing the obvious.


----------



## HiTechPixel

Wouldn't surprise me if NVIDIA made it so you can offload compute to another GPU entirely and thus force people to use SLI in order to get better performance out of DX12. They're after your money, after all.


----------



## Paul17041993

Quote:


> Originally Posted by *malitze*
> 
> Hey guys,
> 
> first of many thanks to Mahigan for you efforts, I really enjoy this insightfull discussion here and in the other thread.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> However, there are some things I have yet to understand, maybe you can help clarify a little. Concerning nvidia the slides mention the need for a more fine grained preemption. Wouldn't that on the other hand introduce more scheduling overhead and the number of (rather costly) context switches? And how would that help increasing parallelism (if it does)?
> 
> Sorry if I'm somehow missing the obvious.


That's just the thing, we're not going to know what exactly will happen until they actually get to that stage. But for now in the terms of overhead increase it remains at "likely" but of low significance.
Quote:


> Originally Posted by *HiTechPixel*
> 
> Wouldn't surprise me if NVIDIA made it so you can offload compute to another GPU entirely and thus force people to use SLI in order to get better performance out of DX12. They're after your money, after all.


Well they cant I don't think, it would defy the core DX12 and Vulkan API behaviour.


----------



## flopper

Quote:


> Originally Posted by *Paul17041993*
> 
> 3-5 years, from the time of the statement that is


said a long time Nvidia sold old tech disguised as "new".









Good plan amd.


----------



## airfathaaaaa

i wonder if that means we will be able to use nvidia and amd cards together again..


----------



## Anna Torrent

Mahigan, I really think a basic table will be very very useful, comparing GCN vs M2 and maybe others


----------



## PostalTwinkie

Quote:


> Originally Posted by *airfathaaaaa*
> 
> i wonder if that means we will be able to use nvidia and amd cards together again..


If the developer wants to support it in a particular title, it is technically possible.


----------



## Mahigan

Quote:


> Originally Posted by *malitze*
> 
> Hey guys,
> 
> first of many thanks to Mahigan for you efforts, I really enjoy this insightfull discussion here and in the other thread.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> However, there are some things I have yet to understand, maybe you can help clarify a little. Concerning nvidia the slides mention the need for a more fine grained preemption. Wouldn't that on the other hand introduce more scheduling overhead and increase the number of (rather costly) context switches? And how would that help increasing parallelism (if it does)?
> 
> Sorry if I'm somehow missing the obvious.


The issue with context switches is that, while they do incur a performance penalty, they're almost inevitable. Therefore you do want to have the hardware in place in order to mitigate the performance penalty associated. A move to a more fine grained preemption implementation would at least alleviate some of the performance penalties associated.

As architectures move further down the road of parallelism and compute, it seems rather imperative that they do take these issues seriously. I don't think the developers want to be coding two different sets of games, one for the consoles and one for the PC. Microsoft is pushing for a merger between the two platforms (cross platform gaming) and developers are looking to make more money by investing less resources in porting games from the console platform to the PC platform. There's a market incentive, profits by controlling costs, behind this move.

If either GPU manufacturer thinks they can just sit on their laurels and ignore where the developers are heading in terms of what they're developing for the consoles... they're in for a shock imo.

Quote:


> Originally Posted by *PostalTwinkie*
> 
> Initially I didn't think much of it, until I realized how recent it was presented by Nvidia - not that long ago.
> 
> I would consider it at least one generation/architecture away. I took it as _"Pascal won't have it."_ or at least not in any meaningful way. Which goes against what I would think Nvidia would have prepared for. When I seen the slide I thought _"Well, maybe Pascal will be another pure 'gaming' GPU, like Maxwell"_. Are we now going to see a Pascal, and then a Pascal v2 that has true hardware compute?
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> Or one of the following,
> 
> 
> ASC really isn't that big of a deal, this is an edge case, and we are just blowing this way up.
> Nvidia was caught with their pants down.


That's the way I took it. I took it as "wait until Volta"... but I'm not ready to discount nVIDIA as they have enormous talent in terms of their engineering team(s). One of the things I'll be looking at, when Pascal launches, is for the White Papers. I'll be doing the same for Greenland. I don't feel comfortable buying a GPU based on the present benchmarking trends. Especially not considering the fact that I tend to purchase two high-end GPUs (if not more I mean I did have two GTX 680s and a GTX560 for PhysX a while back). I don't want to make these sorts of investments if they're only going to be for short term gains. That's my approach... everyone has their own subjective views on the matter.


----------



## PostalTwinkie

Quote:


> Originally Posted by *Mahigan*
> 
> That's the way I took it. I took it as "wait until Volta"... but I'm not ready to discount nVIDIA as they have enormous talent in terms of their engineering team(s). One of the things I'll be looking at, when Pascal launches, is for the White Papers. I'll be doing the same for Greenland. I don't feel comfortable buying a GPU based on the present benchmarking trends. Especially not considering the fact that I tend to purchase two high-end GPUs (if not more I mean I did have two GTX 680s and a GTX560 for PhysX a while back). I don't want to make these sorts of investments if they're only going to be for short term gains. That's my approach... everyone has their own subjective views on the matter.


I actually considered Volta, but thought _"They will have to still offer Quadro from Pascal, so it has to have heavy compute somewhere."_ Or, that is how I assume it would work! Either way, I was thinking that maybe they planned on Volta, and just planned on hacking Pascal's compute off for GeForce. Now they might not do that.

You built GPUs! You tell us how this works!


----------



## Mahigan

Quote:


> Originally Posted by *PostalTwinkie*
> 
> I actually considered Volta, but thought _"They will have to still offer Quadro from Pascal, so it has to have heavy compute somewhere."_ Or, that is how I assume it would work! Either way, I was thinking that maybe they planned on Volta, and just planned on hacking Pascal's compute off for GeForce. Now they might not do that.
> 
> You built GPUs! You tell us how this works!


Back then, we had fixed-function rendering pipelines. I have no idea how things work nowadays. It's like learning everything from scratch. I've written some articles since then, for some online publications, mainly articles around Cypress (HD 5000). That's also around the time I was frequenting the Beyond3D forums as well as the RealTech comment sections (David Kanter's website). GCN and Maxwell 2 (or Fermi for that matter) are extremely complex architectures. Most of the information, in the white papers, is basic at best. I'm hoping that this trend changes, but I understand that IP needs to be protected.

I had some pretty good exchanges with folks over at HardOCP on the topic, some claimed that I was looking to make some extra pocket change. Not the case at all of course. I figured I'd be as open and transparent as possible... seems this has backfired to a degree.


----------



## Tojara

Quote:


> Originally Posted by *PostalTwinkie*
> 
> I actually considered Volta, but thought _"They will have to still offer Quadro from Pascal, so it has to have heavy compute somewhere."_ Or, that is how I assume it would work! Either way, I was thinking that maybe they planned on Volta, and just planned on hacking Pascal's compute off for GeForce. Now they might not do that.


I think that even though workstation-type programs are labelled by some as compute the work there is still closer to what games (at least older ones) do. At least here the performance of the Quadro K5000 is reasonably competitive even though it has, at best, 1/20 of the DP capability of the other cards, aside from the specific FP64 tests. I'd imagine in most workloads today are very heavily biased towards either graphics or compute, not both.


----------



## Paul17041993

Quote:


> Originally Posted by *PostalTwinkie*
> 
> If the developer wants to support it in a particular title, it is technically possible.


I think it'd happen regardless, all GPUs in a system are fairly generic to DX12 so when they do multi-GPU support it'd likely end up scaling across vendors and models anyway. Though it would depend on how the developer codes it, if it specifically checks if each has the same amount of threads or memory then you'll need cards that are identical to eachother.
I haven't really gone that far into DX12 though so I'm not really sure.

Quote:


> Originally Posted by *Mahigan*
> 
> Back then, we had fixed-function rendering pipelines. I have no idea how things work nowadays.


Well we have instruction, function and data L1 caches and ALUs (or IPU+FPU) that can do both integer and floating operations, that's the major differences that are fairly obvious. Of course there's going to be a lot more smaller differences everywhere.

couple common notes are;
- shader code must be able to fit fully expanded into the L1 cache, which is small and not uncommon for massive for-loops to not compile or work correctly. Nvidia I believe still calls their caches registries.
- 32 + 32 integer mathematics will run as double-precision, making them very slow, so ideally use 24bit integers or smaller.


----------



## Klocek001

Quote:


> Originally Posted by *HiTechPixel*
> 
> Wouldn't surprise me if NVIDIA made it so you can offload compute to another GPU entirely and thus *force people to use SLI* in order to get better performance out of DX12. They're after your money, after all.


you can't force anyone into buying a card, let alone use SLI. That's just plain ridiculous if you think it's gonna happen like that. Frankly if anyone really felt like running sli only because async compute capabilities of geforce they deserve to be ripped off, they did have an alternative in the form of gcn cards and yet consciously made their decision not to use it.
But I am curious whether or not the async compute would run better in system with Pascal and 980ti in multi adapter mode even if both of these architectures supported async only via software/hardware like maxwell 2.


----------



## willibj

Spoiler: Warning: Spoiler!



Quote:


> Originally Posted by *Mahigan*
> 
> **This is a work in progress and should be viewed as such**
> 
> It's been several weeks since the Ashes of the Singularity benchmarks hit the PC Gaming scene and brought a new feature into our collective vocabulary. Throughout these past few weeks, there has been a lot of confusion and mis-information spreading throughout the web as a result of the rather complex nature of this topic. In an effort, to combat this misinformation, @GorillaSceptre asked if a new thread could be started in order to condense a lot of the information which has been gathered on the topic. This thread is my no means final. This thread is set to change as new information comes to light. If you have new information, feel free to bring it to the attention of the Overclock.net community as a whole by way of commenting on this thread.
> 
> As things stand right now, Sept 6, 2015, we're waiting for a new driver, from nVIDIA, to rectify an issue which has inhibited the Maxwell 2 series from supporting Asynchronous Compute. Both Oxide, the developer of the Ashes of the Singularity video game and benchmark, and nVIDIA are working hard to implement a fix to this issue. While we wait for this fix, lets take an opportunity to break down some misconceptions.
> 
> nVIDIA HyperQ
> 
> nVIDIA implement Asynchronous Compute through what nVIDIA calls "HyperQ". HyperQ is a hybrid solution which is part software scheduling and part hardware scheduling. While little information is available as it pertains to its implementation in nVIDIAs Maxwell/Maxwell 2 architecture, we've been able to piece together just how it works from various sources.
> 
> *Now I'm certain many of you have seen this image floating around, as it pertains to Kepler's HyperQ implementation*:
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> 
> What the image shows is a series of CPU Cores (indicated by blue squares) scheduling a series of tasks to another series of queues (indicated in black squares) which are then distributed to the various SMMs throughout the Maxwell 2 architecture. While this image is useful, it doesn't truly tell us what is going on. In order to figure out just what those black squares represent, we need to take a look at the nVIDIA Kepler White Papers. Within these white papers we find the HyperQ being defined as such:
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> 
> *Based on this diagram we can infer that HyperQ works through two software components before tasks are scheduled to the hardware*:
> 
> Grid Management Unit
> Work Distributor
> 
> *So far the scheduling works like this*:
> 
> The developer sends the batch to the nVIDIA software driver
> Within the nVIDIA software driver, the batch is held in a Grid Management Unit
> The Grid Management Unit transfers 32 pending grids (32 Compute or 1 Graphic and 31 Compute) to the Work Distributor
> The Work Distributor transfers the 32 Compute or 1 Graphic and 31 Compute tasks to the SMMs which are a hardware component withing the nVIDIA GPU.
> The components within the SMMs which receive the tasks are called Asynchronous Warp Schedulers and they assign the tasks to available CUDA cores for processing.
> While it is true that nVIDIAs HyperQ solution would incur a larger degree of CPU overhead, under DirectX12, than AMDs hardware scheduling implementation, it remains to be seen whether or not this will affect performance. It could very well affect performance on, say, a Core i3 or AMD FX processor but you don't see too many people pairing a powerful nVIDIA GeForce GTX 970/980/980 Ti with an under-powered CPU. If folks have the money to dish out on a powerful GPU, they tend to also have the money to dish out on a robust Quad Core processor (Core i5/i7 series).
> 
> Therefore there seems to be confusion as to what the "software scheduling'" means. *It doesn't mean that Maxwell 2 will be doing compute tasks over the CPU*. It means that there is a limitation added to the way nVIDIAs solution can execute tasks in parallel. It's all about what David Kanter was saying on preemption here (starts around 1:18:00 into the video):
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The main problem with nVIDIAs HyperQ implementation is in terms of *preemption*. It is in part due to the software scheduling aspect of nVIDIAs Maxwell 2 solution as well as the limitations, in terms of *Not being able to execute a second compute task until the end of a draw call boundary (lag or latency being introduced) and lack of error verification*, from the Asynchronous Warp Schedulers.
> 
> *Example*:
> 
> Say you have a graphic shader, in a frame, that is taking a particularly long time to complete, with Maxwell2, you have to wait for this graphic shader to complete before you can execute another task. If a graphic shader takes 16ms, you have to wait till it completes before executing another graphic or compute command. This is what is called "*slow context switching*". Every millisecond, brings down your FPS for that frame. Therefore if a graphic shader takes 16ms to execute, a compute task takes 20ms to execute and you've got a copy command taking 5 ms to execute, you end up with a frame which takes 41ms to execute. This introduces a delay, or lag/latency, between executions. While preemption wasn't important for DX11, and nVIDIA primarily designed their Kepler/Maxwell/Maxwell 2 architectures with DX11 in mind, it becomes quite important for DX12 Asynchronous Compute as well as VR.
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> With GCN, you can execute tasks simultaneously (without a waiting period), the ACEs will also check for errors and re-issue, if needed, to correct an error. You don't need to wait for one task to complete before you work on the next. So say, on GCN, a Graphic shader task takes 16ms to execute, in that same 16ms you can execute many other tasks in parallel (like the compute and copy command above). Therefore your frame ends up taking only 16ms because you're executing several tasks in parallel. There's little to no latency or lag between executions because they execute like Hyper-threading (hence the benefits to the LiquidVR implementation).
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> Developers need to be careful about how they program for Maxwell 2, if they aren't... then far too much latency will be added to a frame. This is true even once nVIDIA fix their driver issue. It's an architectural issue, not a software issue.
> Well that's all in how a context switch is performed in hardware. In order to understand this, we need to understand something about the Compute Units found in every GCN based Graphics card since Tahiti. We already know that a Compute Unit can hold several threads, executing in flight, at the same time. The maximum amount of simultaneous threads executed concurrently, per CU, is 2,560 (40 Wavefronts @ 64 Threads ea). GCN can, within one cycle, switch to and begin the execution of a different Wavefront . This allows the entire GCN GPU to perform both a Graphics and Compute task, simultaneously, with extremely low-latency being associated with switching between them. Idle CUs can be switched to a different task at a very fine grained level with a minimum performance penalty associated with the switch.
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> 
> In terms of a Context Switch, Maxwell 2 can switch between a Compute and Graphics task in a coarse-grained fashion and pays a penalty (sometimes to the order of over a thousand cycles in worst case scenarios) for doing so. While Maxwell 2 excels at ordering tasks, based on priority, in a way which minimizes conflicts between Graphics and Compute tasks; Maxwell 2 doesn't necessarily gain a boost in performance from doing so. The reason for this is that Maxwell 2 takes a rather large hit when switching contexts. This is why it remains to be seen if Maxwell 2 will gain a performance boost from Asynchronous Compute. A developer would need to finely tune his/her code in order to derive any sort of performance benefits. From all the sources I've seen, Pascal will perhaps fix this problem (but wait and see as it could just be speculation). There is also evidence that Pascal will not fix this issue as you can see here (provided by @Vesku):
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> Source here
> 
> nVIDIA may not have thought that the industry would jump on DX12 the way it is right now... pretty much every single title, in 2016, will be built around the DX12 API. We'll even get a few titles in a few months in 2015. What's worse is that the majority of titles, in 2015 and 2016, have partnered with AMD. We can therefore be quite certain that Asynchronous Compute will be implemented, for AMD GCN at least, throughout the majority of DX12 titles to arrive in 2015/2016.
> 
> 
> Spoiler: Warning: Spoiler!
> 
> 
> 
> 
> 
> 
> 
> There is no underestimating nVIDIAs capacity to fix their driver. It will be fixed. As for the performance derived out of nVIDIAs HyperQ solution? Best to wait and see.
> 
> Take Care
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ***If you spot a mistake, either PM me or post it in the comment section. Lets get this whole issue as factual as possible***






Absolutely bloody brilliant write up mate. +Rep and eternal thanks, this has been easily the best and most objective write up on the Async issue I've read so far, and there's been a lot of reading on the matter. Subbed too


----------



## Paul17041993

Quote:


> Originally Posted by *Klocek001*
> 
> you can't force anyone into buying a card, let alone use SLI.


weeelll the old nvidia chipsets kinda did, but that was your very early stuff and why ATI went for a special DVI cable trick instead...


----------



## Klocek001

Quote:


> Originally Posted by *Paul17041993*
> 
> weeelll the old nvidia chipsets kinda did, but that was you very early stuff and why ATI went for a special DVI cable trick instead...


no they did not force you into running sli. it worked with a single card as well.


----------



## Mahigan

New article up on Extreme Tech:
http://www.extremetech.com/extreme/213519-asynchronous-shading-amd-nvidia-and-dx12-what-we-know-so-far


----------



## Anna Torrent

I wonder what are the comparisons between fully utilized Maxwell II and latest GCN based Radeon GPU


----------



## Mahigan

My take is that Maxwell 2 is already being well utilized by the DX11 API. This of course carries over to DX12. GCN was not being fully utilized under DX11. DX12 frees up GCN, while fiji appears to still be under utilized under AotS at least.

We ought to get a better glimpse of fiji vs Maxwell 2 in some upcoming DX12 titles.


----------



## Anna Torrent

1. Well, some benchmarks did show higher results for a GTX 980 in some cases
2. I guess I should say - results for optimized code. In some cases the DX11 results were higher
3. If the GTX 980 is fully and optimally utilized, ACE-like functionality wouldn't have such a huge impact, right? Even with good context switching you wouldn't be able to squeeze more, given the code is optimized, unless there is some other parameter in this process like better caching. OR, an optimized code in this case is not optimized globally for this architecture.

Bottomline, like you've said, we'll have to wait

BTW, you know if GPUs like the mobile Radeon 8750M/8850M also have ACEs?


----------



## Mahigan

Quote:


> Originally Posted by *Anna Torrent*
> 
> 1. Well, some benchmarks did show higher results for a GTX 980 in some cases
> 2. I guess I should say - results for optimized code. In some cases the DX11 results were higher
> 3. If the GTX 980 is fully and optimally utilized, ACE-like functionality wouldn't have such a huge impact, right? Even with good context switching you wouldn't be able to squeeze more, given the code is optimized, unless there is some other parameter in this process like better caching. OR, an optimized code in this case is not optimized globally for this architecture.
> 
> Bottomline, like you've said, we'll have to wait
> 
> BTW, you know if GPUs like the mobile Radeon 8750M/8850M also have ACEs?


I'm not 100% sure on this but I do believe that the Mobile Radeon 8800M series is based on the AMD Cape Verde architecture. Something in between a Radeon HD 7730 and 7770.

Like the Radeon HD 7730:
24 texture units
32 z/stencil ROP units
8 color ROP units

Like the Radeon HD 7770:
10 compute units (640 stream processors)

Source.

Therefore it should, in theory, have retained two ACE units (16 Queues).


Spoiler: Warning: Spoiler!


----------



## Anna Torrent

AMD did pretty bad in the laptops zone even though you could get a 8850M + I5 for $500 (vs GT 840M, max) and indeed the, if I remember correctly, it was underutilized in some games I've tested.


----------



## Mahigan

Quote:


> Originally Posted by *Anna Torrent*
> 
> AMD did pretty bad in the laptops zone even though you could get a 8850M + I5 for $500 (vs GT 840M, max) and indeed the, if I remember correctly, it was underutilized in some games I've tested.


For Laptops, I've generally gone nVIDIA. Mostly because I couldn't find a decent laptop with AMD Graphics and an Intel CPU at a competitive price point. I think that, sometime next year of maybe in 2017, we might see AMD Zen based mobile APUs utilizing HBM memory. That's something I'd likely purchase (assuming Zen makes good on its IPC promises).


----------



## Anna Torrent

Well, Clevo had variants with 8970M/M290X. You can still get these for $1000-$1100 new. And for $500 you could get the deceased Dell Latitude 3540 with an I5 and 8850M (I think it was a DDR3 variant)


----------



## Mahigan

Quote:


> Originally Posted by *Anna Torrent*
> 
> Well, Clevo had variants with 8970M/M290X. You can still get these for $1000-$1100 new. And for $500 you could get the deceased Dell Latitude 3540 with an I5 and 8850M (I think it was a DDR3 variant)


A refurbished Dell? Might be a good idea since I still have contacts over at Dell


----------



## Paul17041993

Quote:


> Originally Posted by *Mahigan*
> 
> For Laptops, I've generally gone nVIDIA. Mostly because I couldn't find a decent laptop with AMD Graphics and an Intel CPU at a competitive price point. I think that, sometime next year of maybe in 2017, we might see AMD Zen based mobile APUs utilizing HBM memory. That's something I'd likely purchase (assuming Zen makes good on its IPC promises).


AMD dedicated GPUs in laptops are pretty common for low-mid range in AU, friend and his brother got a Dell each with AMD + intel and I haven't heard any complaints about them. My laptop however has a GT 540M and 2nd gen i5, went through 3 motherboard replacements and re-builds in its 3 year warranty and all 4 540M's were defective. As you might imagine I'm not very fond of nvidia since that ordeal...


----------



## Mahigan

Quote:


> Originally Posted by *Paul17041993*
> 
> AMD dedicated GPUs in laptops are pretty common for low-mid range in AU, friend and his brother got a Dell each with AMD + intel and I haven't heard any complaints about them. My laptop however has a GT 540M and 2nd gen i5, went through 3 motherboard replacements and re-builds in its 3 year warranty and all 4 540M's were defective. As you might imagine I'm not very fond of nvidia since that ordeal...


When I worked at Dell I worked for the Executive Team (Michael Dell's Office). My work consisted of analyzing consumer complaints, as it pertained to engineering issues which could potentially lead to lawsuits, etc and attempting to resolve those issues outside of the courts. While I can't comment on any specific cases, for legal reasons, I can say that there was an issue associated with vertical lines caused by defective soldier points on nVIDIA GPUs. This issue caused me a lack of sleep and a lot of headaches for over a year. I'm very much aware of the issues with nVIDIAs mobile products (at least from that point in time). I don't know if nVIDIA have rectified these issues, I'd assume they have, but I haven't experience that issue with my Quadro based Laptop.

On another point... regarding Fiji...

Author: David Kanter
Quote:


> AMD's Fiji GPU uses a tweaked GCN architecture and the new High Bandwidth Memory to deliver a sizable 20-30% performance boost on DX11 games, *with the promise of even greater benefits for DX12*. The new GPU powers the already shipping Fury X, the first graphics card to come standard with water cooling. Competitively, Fiji improves AMD's situation, but it still cannot rival the performance (or power efficiency) of Nvidia's Maxwell architecture on existing APIs. *With the release of Windows 10 and DX12, however, performance judgments require reevaluation.*
> 
> Fiji (and the smaller Tonga, known as the Radeon R9 285) is the company's third GPU generation developed in 28nm. For many foundry customers, including AMD and Nvidia, the latest process technology has proven to be far from the greatest. Both vendors are skipping 20nm; this planar process fails to offer superior performance compared with 28nm HKMG, and the theoretical cost reduction from geometric scaling is offset by cost increases from double patterning.
> 
> Rather than invest in a less attractive technology, AMD and Nvidia opted to wait for the 16/14nm FinFET node, which actually reuses the 20nm metal interconnect, and is expected in 2016. In retrospect, staying at 28nm was a wise move; we are unaware of any 20nm customers shipping large chips from TSMC (beyond low-volume Xilinx FPGAs), strongly suggesting the process has yield problems. To make up for TSMC's inability to deliver an attractive process technology, AMD is leaning on a new memory technology, whereas Nvidia chose to deliver a new architecture.
> 
> Fiji uses a new and highly efficient memory interface: High Bandwidth Memory (HBM). At a system level, the power and area advantages of HBM compared with GDDR5 enable AMD to increase the compute power of the shader array, thereby boosting performance and performance per watt significantly. The first Fiji-based products come with liquid cooling, which further improves performance per watt.


Source: http://www.linleygroup.com/mpr/article.php?id=11441

I believe he knew exactly what was coming with DirectX 12. My question is... where were the Tech Journalists when they reviewed Fiji? I haven't seen a mention, from any publication, as to the impact that DX12 would have on Fiji. What we got, instead, were the usual benchmarks (any reviewers using GameWorks titles) with no mention of what was to come 6 months down the line. David Kanter knew, his publication was available to any tech reviewer who pays for a subscription as well. I think that going forward, it would be wise to consider such information.

My 2 cents.


----------



## Forceman

Quote:


> Originally Posted by *Mahigan*
> 
> My work consisted of analyzing consumer complaints, as it pertained to engineering issues which could potentially lead to lawsuits, etc and attempting to resolve those issues outside of the courts.


So you were Edward Norton in Fight Club?
Quote:


> Originally Posted by *Mahigan*
> 
> where were the Tech Journalists when they reviewed Fiji? I haven't seen a mention, from any publication, as to the impact that DX12 would have on Fiji. What we got, instead, were the usual benchmarks (any reviewers using GameWorks titles) with no mention of what was to come 6 months down the line.


What are they supposed to say in a review? "AMD says the card will be faster at some point in the future"? They can only review what they can test, anything more would be speculation (or marketing) . Until we get DX12 games we don't know how current cards are going to perform in them. Especially considering no one knows which feature sets will be supported by which games.


----------



## GorillaSceptre

I've had the impression that Kanter holds his tong a bit when discussing AMD, on the Tech Report it always seems like he wants to say certain things but decides against it (or just gets interrupted







).

I guess he's also just waiting to see what happens, but it does seem like he thinks there's more to GCN as far as DX12 goes.


----------



## Mahigan

Quote:


> Originally Posted by *Forceman*
> 
> So you were Edward Norton from Fight Club?
> What are they supposed to say in a review? "AMD says the card will be faster at some point in the future"? They can only review what they can test, anything more would be speculation. Until we get DX12 games we don't know how current cards are going to perform in them. Especially considering no one knows which feature sets will be supported by which games.


I'm not familiar with Fight Club







I know I know, I should watch it... everyone tells me LOL

I think that a comparative architectural analysis coupled with what we already knew of DX12 (highly threaded, closer to metal etc) would have been sufficient to substantiate some forward looking statements. This is what David Kanter did in his analysis.

One way of doing this, without getting too speculative, is to:

Analyze industry trends in order to see in which direction the industry is headed (talking to developers, API folks etc)
Analyze GPU architectures (talking to hardware manufacturers or dissecting white papers)
Matching the Pro's and Con's of an Architecture to the direction of the software industry
Adding statements, at the end of an analysis/review, giving consumers an idea of what to expect in the near future
I mean the Benchmarks and all are great at informing customers as to present day performance. An analysis, on top of that, could be used to shed light as to the longer term viability of a product. You do both. Of course that might mean a far more in-depth review of a product but it could make for some interesting reading.


----------



## Forceman

Quote:


> Originally Posted by *Mahigan*
> 
> I'm not familiar with Fight Club
> 
> 
> 
> 
> 
> 
> 
> I know I know, I should watch it... everyone tells me LOL
> 
> I think that a comparative architectural comparison coupled with what we already knew of DX12 (highly threaded, closer to metal etc) would have been sufficient to substantiate some forward looking statements. This is what David Kanter did in his analysis.
> 
> One way of doing this, without getting too speculative, is to:
> 
> Analyze industry trends in order to see in which direction the industry is headed (talking to developers, API folks etc)
> Analyze GPU architectures (talking to hardware manufacturers or dissecting white papers)
> Matching the Pro's and Con's of an Architecture to the direction of the software industry
> Adding statements, at the end of an analysis/review, giving consumers an idea of what to expect in the near future
> I mean the Benchmarks and all are great at informing customers as to present day performance. An analysis, on top of that, could be used to shed light as to the longer term viability of a product. You do both. Of course that might mean a far more in-depth review of a product but it could make for some interesting reading.


His job was determining whether it was better for car companies to issue recalls, or just pay the resulting lawsuits.

Those are great topics of discussion, but I think they would be better in a feature piece, not in a review.


----------



## Anna Torrent

Quote:


> Originally Posted by *Mahigan*
> 
> A refurbished Dell? Might be a good idea since I still have contacts over at Dell


I don't know if they still sell those - almost two years old machine now. They've had these with a nice 1080p display, but I guess you can grab the 768p model and replace the screen, if you really want it


----------



## Mahigan

PontiacGTX shared these links with me: http://gamingbolt.com/dx12-will-allow-game-code-to-run-faster-on-xbox-onepc-better-post-processing-effects
Quote:


> "Direct X 12 is not something that allows you to play around with new shiny effects during the development. It's more of a big, structural development that gives the programmers a lot more opportunity by providing them easier access directly to the GPU," Producer, Zoltán Pozsonyi explained to GamingBolt. "This means that they'll have a lot more wiggle room for performance optimization and making the code run faster; if you're able to squeeze out higher performance out of the GPU, that can be translated into framerate or more beautiful content in the game. *DX12 supports asynchronous compute shaders, which for example allows you to use more and better quality special effects and post process stuff, a lot faster (screen space ambient occlusion, screen space reflection, better quality shadow mapping, translucency, tone mapping).*"
> 
> Warhammer 40,000: Inquisitor Martyr will be heading to the PS4 and Xbox One alongside PC and Mac when it launches next year.


http://www.tweaktown.com/news/47507/dice-teases-frostbite-engine-already-supports-directx-12/index.html
Quote:


> The news is coming directly from Technial Director Stefan Boberg on Twitter, where '@CentroXer' asked "when is DX12 going to be part of frosbite?" with Boberg replying that "it already is, no word on which game will be first though
> 
> 
> 
> 
> 
> 
> 
> ". With DICE working very closely with AMD on Mantle, which was used in Battlefield 4, it's only a few steps away from Battlefield 5 being announced on the latest Frostbite engine with DirectX 12 capabilities. Then we have to think about the amount of Frostbite-powered games in the next year or so, with Mass Effect 4 at the end of 2016, Need for Speed on November 3, and Star Wars Battlefront two weeks later on November 17. Let's not forget Mirror's Edge Catalyst on February 23, 2016. So we should expect a next-gen Battlefield game to be announced early next year, with a release date later next year, hopefully.


----------



## dogen1

Quote:


> Originally Posted by *Mahigan*
> 
> PontiacGTX shared these links with me: http://gamingbolt.com/dx12-will-allow-game-code-to-run-faster-on-xbox-onepc-better-post-processing-effects
> http://www.tweaktown.com/news/47507/dice-teases-frostbite-engine-already-supports-directx-12/index.html


I'd be very surprised if battlefront shipped without dx12 support.


----------



## Forceman

Quote:


> Originally Posted by *dogen1*
> 
> I'd be very surprised if battlefront shipped without dx12 support.


If it did support DX12 (at launch) I would think they'd have already said that, rather than keeping it secret. I'd say that points more toward it not having DX12 at launch - although I'm basing that on how they touted Mantle pre-launch. Why talk up Mantle support for BF4, and not talk up DX12 support for Battlefront?


----------



## dogen1

Quote:


> Originally Posted by *Forceman*
> 
> If it did support DX12 (at launch) I would think they'd have already said that, rather than keeping it secret. I'd say that points more toward it not having DX12 at launch - although I'm basing that on how they touted Mantle pre-launch. Why talk up Mantle support for BF4, and not talk up DX12 support for Battlefront?


Well, the engine supports it. Seems like it would be a pointless limitation to not release with either dx12, or at least mantle.


----------



## flopper

Quote:


> Originally Posted by *Forceman*
> 
> If it did support DX12 (at launch) I would think they'd have already said that, rather than keeping it secret. I'd say that points more toward it not having DX12 at launch - although I'm basing that on how they touted Mantle pre-launch. Why talk up Mantle support for BF4, and not talk up DX12 support for Battlefront?


Johan said it was a stretch for Battlefront so it wont have dx12.
If it did you see that they would mention it already.
They dont but they might patch it later.
BC3 and Bf5 will be dx12.


----------



## Paul17041993

Quote:


> Originally Posted by *dogen1*
> 
> Well, the engine supports it. Seems like it would be a pointless limitation to not release with either dx12, or at least mantle.


However Battlefront was in development before DX12 was close to ready, it's not just simply pulling the newest branch of the engine from the repo, you have to ensure that everything works correctly with the introduced changes, which is rarely the case for anything.

That being said, it shouldn't be too hard for the devs to have at least basic DX12 support in the game either before or after release.


----------



## Noshuru

Any updates on this yet?


----------



## Mahigan

Nothing new thus far. Just a lot of waiting.


----------



## Anna Torrent

Can we do other stuff in the meanwhile?


----------



## GorillaSceptre

http://wccftech.com/gears-war-ultimate-unlocked-frame-rate-devs-explain-dx12-async-compute/

Gears is using a bit of Async , they're looking to use it for more features too.

Also goes against some assumptions that the Unreal Engine is more suited for Nvidia, and therefore Unreal games won't bother using Async.

As another bonus, it looks like most of the games that MS is associated with will be using it.


----------



## Noufel

Quote:


> Originally Posted by *GorillaSceptre*
> 
> http://wccftech.com/gears-war-ultimate-unlocked-frame-rate-devs-explain-dx12-async-compute/
> 
> Gears is using a bit of Async , they're looking to use it for more features too.
> 
> Also goes against some assumptions that the Unreal Engine is more suited for Nvidia, and therefore Unreal games won't bother using Async.
> 
> As another bonus, it looks like most of the games that MS is associated with will be using it.


If it uses a bit of AC it won't be a problem with the software sheduler that nvidia will use in their dx12 driver or am i wrong ?


----------



## Paul17041993

Quote:


> Originally Posted by *Noufel*
> 
> If it uses a bit of AC it won't be a problem with the software sheduler that nvidia will use in their dx12 driver or am i wrong ?


Well there's no way to tell until it happens really...


----------



## semitope

nvidia will likely be better off turning it off anyway. depends on how its done maybe


----------



## Mahigan

Tech journalists quoting PR...









Example: http://wccftech.com/nvidia-recommends-geforce-gtx-980-ti-gtx-980-1080p-vr-gaming-highend-market-share-rises-apac-markets/
Quote:


> Regarding Preemption Context Switching, NVIDIA uses a feature called TimeWrap in the Oculus SDK to render an image and then perform a post process on that rendered image to adjust it for changes in head motion during rendering. With Async TimeWrap, NVIDIA can make timewarp not have to wait for the app to finish rendering. Timewarp should behave like a separate process running on the GPU, which wakes up and does its thing right before vsync, every vsync, regardless of whether the app has finished rendering or not. NVIDIA is enabling VR platform vendors to implement Async timewrap by exposing support in our driver for a feature called a high-priority graphics context. It's a separate graphics context that takes over the entire GPU whenever work is submitted to it - pre-empting any other rendering that may already be taking place. A recent report has suggested that Preemption Context Switching is by far best on AMD GPUs, good on Intel Gen9 chips and potentialy catastrophic on NVIDIA's GPUs. It could be possible that AMD hasn't yet enable support through the drivers since recent AOTS benchmark also showed NVIDIA having lower gains in DX12 compared to AMD GPUs. *NVIDIA recently commented on the news and said that they'll be fully implementing Async shaders support through drivers. Once NVIDIA features Async support through drivers, it will be possible to see better DX12 performance and Preemption Context Switching on NVIDIA hardware.*


No. The performance oriented preemption is not a driver related issue. It is a hardware related issue. Watch GDC 2015's nVIDIA presentation.. and I quote nVIDIAs representative "On future GPUs, we're working to enable Finer-grained preemption, but that's still a long way off". nVIDIA use coarse grained preemption which can cause delays of upwards of 1000ms. It is therefore not a performance oriented feature on Maxwell 2.

As for Async Compute, it can only be used at the end of Draw Call Boundaries on nVIDIA Maxwell 2 products. You're therefore not likely going to see any performance benefits from its usage (remains to be seen but I doubt it). See WCCF's own Async Time Warp image "previous image re-warped" meaning you cannot asynchronously render a new image (frame), you have to re-use the same image (frame) that was previously rendered.

*Source*:
The nVIDIA GDC 2015 PDF is available here (page 23): http://www.reedbeta.com/talks/VR_Direct_GDC_2015.pdf


Spoiler: Warning: Spoiler!


----------



## Paul17041993

Quote:


> Originally Posted by *Mahigan*
> 
> Quote:
> 
> 
> 
> A recent report has suggested that Preemption Context Switching is by far best on AMD GPUs, good on Intel Gen9 chips and potentialy catastrophic on NVIDIA's GPUs. It could be possible that *AMD hasn't yet enable support through the drivers* since recent AOTS benchmark also showed NVIDIA having lower gains in DX12 compared to AMD GPUs.
Click to expand...


----------



## Anna Torrent

Quote:


> Originally Posted by *Mahigan*
> 
> Tech journalists quoting PR...
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Example: http://wccftech.com/nvidia-recommends-geforce-gtx-980-ti-gtx-980-1080p-vr-gaming-highend-market-share-rises-apac-markets/
> No. The performance oriented preemption is not a driver related issue. It is a hardware related issue. Watch GDC 2015's nVIDIA presentation.. and I quote nVIDIAs representative "On future GPUs, we're working to enable Finer-grained preemption, but that's still a long way off". nVIDIA use coarse grained preemption which can cause delays of upwards of 1000ms. It is therefore not a performance oriented feature on Maxwell 2.
> 
> As for Async Compute, it can only be used at the end of Draw Call Boundaries on nVIDIA Maxwell 2 products. You're therefore not likely going to see any performance benefits from its usage (remains to be seen but I doubt it). See WCCF's own Async Time Warp image "previous image re-warped" meaning you cannot asynchronously render a new image (frame), you have to re-use the same image (frame) that was previously rendered.
> 
> *Source*:
> The nVIDIA GDC 2015 PDF is available here (page 23): http://www.reedbeta.com/talks/VR_Direct_GDC_2015.pdf
> 
> 
> Spoiler: Warning: Spoiler!


Quote:


> Originally Posted by *Mahigan*
> 
> Tech journalists quoting PR...
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Example: http://wccftech.com/nvidia-recommends-geforce-gtx-980-ti-gtx-980-1080p-vr-gaming-highend-market-share-rises-apac-markets/
> No. The performance oriented preemption is not a driver related issue. It is a hardware related issue. Watch GDC 2015's nVIDIA presentation.. and I quote nVIDIAs representative "On future GPUs, we're working to enable Finer-grained preemption, but that's still a long way off". nVIDIA use coarse grained preemption which can cause delays of upwards of 1000ms. It is therefore not a performance oriented feature on Maxwell 2.
> 
> As for Async Compute, it can only be used at the end of Draw Call Boundaries on nVIDIA Maxwell 2 products. You're therefore not likely going to see any performance benefits from its usage (remains to be seen but I doubt it). See WCCF's own Async Time Warp image "previous image re-warped" meaning you cannot asynchronously render a new image (frame), you have to re-use the same image (frame) that was previously rendered.
> 
> *Source*:
> The nVIDIA GDC 2015 PDF is available here (page 23): http://www.reedbeta.com/talks/VR_Direct_GDC_2015.pdf
> 
> 
> Spoiler: Warning: Spoiler!


Many sites are really tabloids.. including most of the biggest and most popular of them. Rarely do they really go in-depth and from these cases, only a fraction is a good journalism
It's not that people do not do mistakes, but you can see that many of them don't even bother correcting or fixing past information - they just put out the same article with a different new line

And another fact - all the praise for google, but really, if you are going to investigate into stuff and write only smart things, slowly and after full understanding is achieved, you won't get a lot of juice out of google. The bigger sites will have the spot light almost always, good info or not

Anyway, I think we all agree you should do the journalism in your new site


----------



## GorillaSceptre

https://www.youtube.com/watch?v=v_eRwxqhAGo

From The Tech Report.


----------



## Mahigan

Quote:


> Originally Posted by *GorillaSceptre*
> 
> https://www.youtube.com/watch?v=v_eRwxqhAGo
> 
> From The Tech Report.


Yep!

Adding this to the first page.


----------



## fellix

https://forum.beyond3d.com/posts/1872750/

That's probably the most comprehensive summary of the situation to the moment, from technical PoV.

In short, GCN implements hardware mapped async scheduling with independent queues, while Maxwell and Kepler rely on driver-managed resolutions from a single pool of tasks.


----------



## sblantipodi

great post, hope that nVidia will not trail AMD too much when DX12 will arrive.


----------



## GorillaSceptre

http://www.anandtech.com/show/9659/fable-legends-directx-12-benchmark-analysis

Why would the Fury X get higher frames with an i3 compared to an i7? Doesn't seem right..

But then again the Unreal engine has always done funny stuff on AMD hardware.

Edit:

https://www.extremetech.com/gaming/214834-fable-legends-amd-and-nvidia-go-head-to-head-in-latest-directx-12-benchmark

They show the Fury X "winning".


----------



## Noufel

Quote:


> Originally Posted by *GorillaSceptre*
> 
> http://www.anandtech.com/show/9659/fable-legends-directx-12-benchmark-analysis
> 
> Why would the Fury X get higher frames with an i3 compared to an i7? Doesn't seem right..
> 
> But then again the Unreal engine has always done funny stuff on AMD hardware.
> 
> Edit:
> 
> https://www.extremetech.com/gaming/214834-fable-legends-amd-and-nvidia-go-head-to-head-in-latest-directx-12-benchmark
> 
> They show the Fury X "winning".


They are using AMD results for their 390X comparison and results from cpu scalling with a furyX are weird


----------



## Mahigan

Extreme tech were suffering from a weird power issue. The CPU was downclocked to 1.7Ghz during their first run. They re-did their tests.

What we did see was that with a weaker CPU, NVIDIA's Maxwell 2 did showcase a higher degree of CPU overhead. This was mentioned, theoretically, in the first post of this thread.


----------



## Nehabje

I don't really like that... CPU overhead, especially on my old system isn't really a pleasure. I wonder how bad it will be. Is there any sufficient info on that yet?







Other than that, this thread has been great


----------



## GorillaSceptre

Interview with Chris Roberts from Gamers Nexus, pretty cool channel, they deserve more subs i think.









https://www.youtube.com/watch?v=XD9_L5o4mhQ

CR did use the words "Massively parallel" in regards to GPU architecture and rendering.

He also went on to say " We're working closely with AMD, and it's one of the things that's a big deal to them (DX12), it's also on the compute side where there's a whole bunch of stuff we're starting to lay off to compute to allow us to do some pretty cool stuff."

So i guess we can add Star Citizen to the AMD sponsored list.. This game really looks like it's going to push the boundaries on all fronts, i might even go back it.


----------



## mtcn77

GG stream.


----------



## Mahigan

Quote:


> Originally Posted by *GorillaSceptre*
> 
> Interview with Chris Roberts from Gamers Nexus, pretty cool channel, they deserve more subs i think.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> https://www.youtube.com/watch?v=XD9_L5o4mhQ
> 
> CR did use the words "Massively parallel" in regards to GPU architecture and rendering.
> 
> He also went on to say " We're working closely with AMD, and it's one of the things that's a big deal to them (DX12), it's also on the compute side where there's a whole bunch of stuff we're starting to lay off to compute to allow us to do some pretty cool stuff."
> 
> So i guess we can add Star Citizen to the AMD sponsored list.. This game really looks like it's going to push the boundaries on all fronts, i might even go back it.


Apparently he's not even texturing characters in the traditional way, Star Citizen will utilize shaders which represent "materials". This is what I mean't about Fury-X becoming much more powerful in the long run. That 8+TFLOPs in compute performance will really come in handy. Without all that texturing you also save on VRAM, as I mentioned before.

Basically... he's confirming everything I said in this regard









He seems to want to really push the boundaries of what the CryEngine can do. He's going where no one has gone before, in terms of trying new things and tricks with the compute pipeline.


----------



## GorillaSceptre

Quote:


> Originally Posted by *Mahigan*
> 
> Apparently he's not even texturing characters in the traditional way, Star Citizen will utilize shaders which represent "materials". This is what I mean't about Fury-X becoming much more powerful in the long run. That 8+TFLOPs in compute performance will really come in handy. Without all that texturing you also save on VRAM, as I mentioned before.
> 
> Basically... he's confirming everything I said in this regard
> 
> 
> 
> 
> 
> 
> 
> 
> 
> He seems to want to really push the boundaries of what the CryEngine can do. He's going where no one has gone before, in terms of trying new things and tricks with the compute pipeline.


Well, this thread you've made is one of the most detailed and thorough I've seen. It' gone from you being a supposed rabid fanboy, to devs confirming what you've been saying. Pretty funny.

I'm still annoyed that the big sites haven't gone in depth with it, not because i want to say haha Nvidia, but because there is finally some competition and impressive things happening on PC again.


----------



## p4inkill3r

Via the *Red Team Insider*:
Quote:


> Have questions about DirectX® 12 and the blazingly fast performance, higher frames per second and reduced latency you can get with AMD GCN powered hardware? We have you covered! Join us on Thursday, October 8th from 2 - 4PM CT, for our DirectX® 12 Q&A. Technical Marketing Manager Robert Hallock will be live on the Red Team forums answering your questions about asynchronous shading, multi-threaded command buffers, Ashes of the Singularity™ and Fable® Legends benchmarks and more!


----------



## Forceman

Quote:


> Originally Posted by *p4inkill3r*
> 
> Via the *Red Team Insider*:


So in other words, expect a bunch of new "AMD is the future" and "Nvidia is dead" hysteria starting on the 9th?


----------



## Robenger

Quote:


> Originally Posted by *Forceman*
> 
> So in other words, expect a bunch of new "AMD is the future" and "Nvidia is dead" hysteria starting on the 9th?


Way to be a Debbie Downer


----------



## p4inkill3r

Bolded and emphasized for the enjoyment of all.


----------



## mtcn77

Let them try.









Quote:


> Originally Posted by *Forceman*
> 
> So in other words, expect a bunch of new "AMD is the future" and "Nvidia is dead" hysteria starting on the 9th?


----------



## Robenger

Quote:


> Originally Posted by *mtcn77*
> 
> Let them try.


lol I have never seen that graph before, is it true?


----------



## Blameless

Quote:


> Originally Posted by *Robenger*
> 
> lol I have never seen that graph before, is it true?


Strictly speaking, yes, but there is more to it:

- DX12 includes feature levels ranging from 11_0 to 12_1 and three resource binding tiers.

- ALL of NVIDIA's DX11 GPUs going back to Fermi support at least DX12 at the 11_0 feature level and resource binding tier 1. Kepler supports feature level 11_1 and resource binding tier 2. Maxwell 1 supports 12_0 also binding tier 2. Maxwell 2 supports 12_1 + tier 2.

- AMD doesn't support any DX12 feature levels until GCN. GCN "1.0" (Tahiti) supports feature level 11_1. GCN "1.1" and "1.2" supports 12_0. All GCN cards of any generation support resource binding tier 3.

Very few DX12 titles should require anything over feature level 11_0 or 11_1 to run and probably nothing will mandate 12_1.

NVIDIA has the edge in the sense that old GeForce 400 and 500 series cards can run DX12 games in a DX12 render mode...how relevant this may be is another question. No AMD GPUs prior to the 7000 series will run DX12.


----------



## mtcn77

Quote:


> Originally Posted by *Robenger*
> 
> lol I have never seen that graph before, is it true?


Yeah, fantastic journalism by Steve Burke, always in the know. [Source]
Don't worry - _the green team is never gonna give you up for Directx 12_.


----------



## Robenger

Quote:


> Originally Posted by *mtcn77*
> 
> Yeah, fantastic journalism by Steve Burke, always in the know. [Source]
> Don't worry - _the green team is never gonna give you up for Directx 12_.


I really can't tell if you're being sarcastic. Because as far as I have seen more AMD cards support DX12 then Nvidia at this point.


----------



## mtcn77

Quote:


> Originally Posted by *Robenger*
> 
> I really can't tell if you're being sarcastic. Because as far as I have seen more AMD cards support DX12 then Nvidia at this point.


I am.
_Though, the green team is never gonna let you down_...
You can click the link, I didn't rick'roll it.


----------



## Robenger

Quote:


> Originally Posted by *mtcn77*
> 
> I am.
> _Though, the green team is never gonna let you down_...
> You can click the link, I didn't rick'roll it.


Praise the Lord.


----------



## Mahigan

Quote:


> Originally Posted by *p4inkill3r*
> 
> Via the *Red Team Insider*:


LOL









Now that ought to be interesting...


----------



## semitope

I find a lot of AMDs talks are nice. Lots of useful information. I expect good information there.


----------



## Robenger

Quote:


> Originally Posted by *Mahigan*
> 
> LOL
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Now that ought to be interesting...


You should call in!


----------



## Mahigan

Fable Legends,

Maxwell 2 still not doing Async compute. Game benchmark taking place 19% of the time in the compute queue as well as only using a single ACE.

http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/?utm_campaign=website&utm_source=sendgrid.com&utm_medium=email


----------



## Noufel

Quote:


> Originally Posted by *Mahigan*
> 
> Fable Legends,
> 
> Maxwell 2 still not doing Async compute. Game benchmark taking place 19% of the time in the compute queue as well as only using a single ACE.
> 
> http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/?utm_campaign=website&utm_source=sendgrid.com&utm_medium=email


That sounds bad especialy if a microsoft game only uses just one ACE and very little AC, its like you've said mahigan AC wont be fully used for now


----------



## Mahigan

I'm interested to see what Deus Ex will offer at this point


----------



## Noufel

I'm sure that deusEX will be the first game to realy take advantage of AC


----------



## GorillaSceptre

Media Molecule: "Massively untapped potential" in compute shaders.

https://www.youtube.com/watch?v=u9KNtnCZDMI

Anyone else annoyed that consoles are doing this stuff? While the "master race" are busy bickering over tessellation..

Around the 20 min mark he talks about stalls and doing things in parallel.

27:30 is the untapped potential of compute shaders.


----------



## Noufel

Quote:


> Originally Posted by *GorillaSceptre*
> 
> Media Molecule: "Massively untapped potential" in compute shaders.
> 
> https://www.youtube.com/watch?v=u9KNtnCZDMI
> 
> Anyone else annoyed that consoles are doing this stuff? While the "master race" are busy bickering over tessellation..


if the gaming industry wont use heavy AC in upcomming games i don't see why microsoft even bothered with DX12, i hope it wont be the case


----------



## semitope

During the session AMD is having soon I hope they answer on how easy it is to use asynchronous compute. What would make it easier, eg. a game already having tons of compute shaders. And what would make it harder


----------



## Xuper

Mahigan , up until now , IS your post valid ?

Quote:


> The Asynchronous Warp Schedulers are in the hardware. Each SMM (which is a shader engine in GCN terms) holds four AWSs. Unlike GCN, the scheduling aspect is handled in software for Maxwell 2. In the driver there's a Grid Management Queue which holds pending tasks and assigns the pending tasks to another piece of software which is the work distributor. The work distributor then assigns the tasks to available Asynchronous Warp Schedulers. It's quite a few different "parts" working together. A software and a hardware component if you will.
> 
> With GCN the developer sends work to a particular queue (Graphic/Compute/Copy) and the driver just sends it to the Asynchronous Compute Engine (for Async compute) or Graphic Command Processor (Graphic tasks but can also handle compute), DMA Engines (Copy). The queues, for pending Async work, are held within the ACEs (8 deep each)... and ACEs handle assigning Async tasks to available compute units.
> 
> Simplified...
> 
> Maxwell 2: Queues in Software, work distributor in software (context switching), Asynchronous Warps in hardware, DMA Engines in hardware, CUDA cores in hardware.
> GCN: Queues/Work distributor/Asynchronous Compute engines (ACEs/Graphic Command Processor) in hardware, Copy (DMA Engines) in hardware, CUs in hardware.


1- according to WCCF about fable bench , some claims that if they use all 8 ACE , this would kill AMD performance ? because there is no power left for rendering?

I mean how could kill Performance when other 7 ACE are idle?

Quote:


> It would kill the performance of AMD if they used all 8 aces... You wold have no power left for rendering. The aces themselves dont handle the compute, they just manage it.Every queue you run is a reduction of power on the GPU, as it still uses the GPU cores. So, if every queue was used, each one would only have 1/64th the power of the GPU at its disposal.


2- Here Screenshot from Geforce 980Ti under Batman arkham knight : https://a.disquscdn.com/uploads/mediaembed/images/2611/2051/original.jpg

does that mean scheduling is done in hardware? I mean Maxwell Async at hardware level ?
or like this : https://forum.beyond3d.com/posts/1870075/

Edit: it's worth to read

https://forum.beyond3d.com/posts/1870144/


----------



## Mahigan

Quote:


> Originally Posted by *Xuper*
> 
> Mahigan , up until now , IS your post valid ?
> 
> 1- according to WCCF about fable bench , some claims that if they use all 8 ACE , this would kill AMD performance ? because there is no power left for rendering?
> I mean how could kill Performance when other 7 ACE are idle?
> 2- Here Screenshot from Geforce 980Ti under Batman arkham knight : https://a.disquscdn.com/uploads/mediaembed/images/2611/2051/original.jpg
> does that mean scheduling is done in hardware? I mean Maxwell Async at hardware level ?
> 
> or like this : https://forum.beyond3d.com/posts/1870075/
> 
> Edit: it's worth to read
> https://forum.beyond3d.com/posts/1870144/


ACEs have the capacity to start, stop, pause and transfer content into LDS (Local Data Share Cache). ACEs utilize un-used or under used Compute resources, when available, and can, within a single cycle, pre-empt workloads for high priority jobs (finer-grained preemption).

It wouldn't kill GCN to use more than one ACE, that's why there are 8 ACEs in most GCN products. A developer would have to balance the use of Asynchronous Compute with the rest of the rendering pipeline. It is usually preferred to use Asynchronous compute over identical contexts (Compute + Compute) over (Graphic + Compute) as the latter will cost added latency of 1 cycle on GCN. The folks commenting about it killing GCN's performance don't know what they're talking about (1 cycle is not killing performance). It would kill Maxwell 2's performance (because a context switch can incur latency of upwards of 1000 cycles and the architecture employs coarse grained preemption).

As for the Arkham Knight screen shot. The scheduling, or execution of workloads, is done in software and the Asynchronous workloads themselves are processed in hardware for the Maxwell 2 architecture (so far Maxwell 2 throws everything into a single queue or the Graphics queue). You're limited in the functionality due to the aforementioned caveats of the Maxwell 2 architecture as it pertains to Asynchronous Compute (first post of this thread).


----------



## Xuper

Allright.thanks for reply,I have bigger question: What's definition for Asynchronous Compute?

Like I mentioned (Link) :

Quote:


> A better description for what people on this thread are interested in is "concurrent graphics and compute". Asynchronous compute for GPUs is as old as CUDA. But the ability to run graphics workloads concurrently with compute workloads is what this thread is really about, and is a relatively new thing.


Someone posted this (running a bench with PhysX) (Link).Now If GPUview shows that Both Hardware Queue 3D & compute are running then, This doesn't prove that they're doing Asynchronous Compute? am I right? I was confused between term "Asynchronous Compute" and "concurrent graphics and compute".

Edit: I think this might anwser my question : Link


----------



## Ext3h

Quote:


> or 1 Graphic and 31 Compute


That part isn't correct. Often quoted, never read correctly.

A single draw call consists of 5 active grids, one for each shader type (vertex, tesselation control, tesselation evaluation, geometry, fragment). GK110 and Maxwell GM20x have each 6 graphics command processors which each have 5 grids hardwired to them as long as the GPU is in 3D mode.

That "1" in that 1+31 isn't the draw call. It's the other way around. 30 grids for 6 draw calls, 1 grid for compute shaders, and one got bitten by the dogs.

I have also serious doubts whether the GMU is actually involved when providing the "Compute Engine", because even pre-GK110 hardware (which is lacking the GMU completely!) can already switch between "Compute Engine" and "3D Engine" mode.

It's always 1 active CS grid while in 3D mode. In compute mode, all available grids are available for compute kernels.

That "Work distributor" precedes the GK110. Only the "Grid Management Unit" was newly added. So the GMU isn't actually involved with the DX12 path.

The following is just a theory about that "missing" grid slot, don't take that for true yet:
It is there to provide high priority "preemption" for compute grids from the GMU while in 3D mode. I think it is e.g. used for dwm.exe.


----------



## Mahigan

Quote:


> Originally Posted by *Xuper*
> 
> Allright.thanks for reply,I have bigger question: What's definition for Asynchronous Compute?
> 
> Like I mentioned (Link) :
> 
> Someone posted this (running a bench with PhysX) (Link).Now If GPUview shows that Both Hardware Queue 3D & compute are running then, This doesn't prove that they're doing Asynchronous Compute? am I right? I was confused between term "Asynchronous Compute" and "concurrent graphics and compute".


With Arkham Knight, you have two "engines" running concurrently. There is the Arkham Knight engine as well as the PhysX libraries. It is evident that both would be running concurrently, though executed sequentially, from two separate sources. Asynchronous Compute is a single command within, a batch of commands, which instructs the GPU to execute two workloads at once (like Hyperthreading). This command would, for example, instruct the GPU to execute a Graphic task and concurrently execute a Compute task, within the same command line. This command could just as well instruct the GPU to execute two compute commands concurrently. Two commands from the same source. This requires a context switch (when executing a Graphic and Compute command in parallel). The commands are sent in parallel to the GPU.

With PhysX, in Arkham Knight, you have the game engine sending instructions to the GPU as well as the PhysX libraries based on two separate commands emanating from two separate sources. This does not require a context switch because the commands are not sent at the same time but are executed at the same time. The commands are sent sequentially to the GPU.


----------



## Xuper

@Ext3h Reply to who ?

Edit : @Mahigan Thanks a lot.Now I got it







!


----------



## Mahigan

Quote:


> Originally Posted by *Ext3h*
> 
> That part isn't correct. Often quoted, never read correctly.
> 
> A single draw call consists of 5 active grids, one for each shader type (vertex, tesselation control, tesselation evaluation, geometry, fragment). GK110 and Maxwell GM20x have each 6 graphics command processors which each have 5 grids hardwired to them as long as the GPU is in 3D mode.
> 
> *That "1" in that 1+31 isn't the draw call. It's the other way around. 30 grids for 6 draw calls, 1 grid for compute shaders, and one got bitten by the dogs.*
> 
> I have also serious doubts whether the GMU is actually involved when providing the "Compute Engine", because even pre-GK110 hardware (which is lacking the GMU completely!) can already switch between "Compute Engine" and "3D Engine" mode.
> 
> It's always 1 active CS grid while in 3D mode. In compute mode, all available grids are available for compute kernels.
> 
> That "Work distributor" precedes the GK110. Only the "Grid Management Unit" was newly added. So the GMU isn't actually involved with the DX12 path.
> 
> The following is just a theory about that "missing" grid slot, don't take that for true yet:
> It is there to provide high priority "preemption" for compute grids from the GMU while in 3D mode. I think it is e.g. used for dwm.exe.


That's entirely plausible though I was relying on Anandtech's information, as well as the information given to members of the press by nVIDIA, when I wrote that into the theory. If this is the case then the entire graph, from Anandtech, would be erroneous and the information presented to members of the press, by nVIDIA, would also be erroneous.

This would be questionable no?


----------



## Ext3h

@Xuper Reply to the original post.
Quote:


> Originally Posted by *Mahigan*
> 
> With PhysX, in Arkham Knight, you have the game engine sending instructions to the GPU as well as the PhysX libraries based on two separate commands emanating from two separate sources. This does not require a context switch because the commands are not sent at the same time but are executed at the same time. The commands are sent sequentially to the GPU.


That isn't true. PhysX utilizes HyperQ the same way as any other CUDA application does. It's only going via the 3D queue on pre-GK110 GPUs. So they are in fact send to the GPU asynchronously.

I wouldn't be so sure about that "unnecessary" context switch either. It is definitely necessary on pre-GK110. And even past GK110, it would require that PhysX is only requiring a single concurrent grid, assuming that that one missing grid slot is actually for HyperQ.

Only under these premises, PhysX was possible without causing a context switch. Well, at least no context switch in the scheduling frontend. There is still another one happening inside the SMM units.


----------



## Mahigan

Quote:


> Originally Posted by *Ext3h*
> 
> @Xuper Reply to the original post.
> That isn't true. PhysX utilizes HyperQ the same way as any other CUDA application does. It's only going via the 3D queue on pre-GK110 GPUs. So they are in fact send to the GPU asynchronously.
> 
> I wouldn't be so sure about that "unnecessary" context switch either. It is definitely necessary on pre-GK110. And even past GK110, it would require that PhysX is only requiring a single concurrent grid, assuming that that one missing grid slot is actually for HyperQ.
> 
> Only under these premises, PhysX was possible without causing a context switch. Well, at least no context switch in the scheduling frontend. There is still another one happening inside the SMM units.


Hmm... but PhysX is a set of libraries executed separately from the Game Engine itself no? (using CUDA or HyperQ) If that is the case then both the game engine, as well as the PhysX libraries, would be sending tasks to the GPU synchronously (though I suppose they could be executed Asynchronously in the nVIDIA driver). The asynchronicity would simply be the result of the GPU receiving workloads from two separate sources or would it be the result of the Work Distributor arranging and assigning tasks based on a priority level?

As for the Context Switch, I am referring to Maxwell 2 as this was the GPU the graph was testing on.


----------



## Ext3h

Quote:


> Originally Posted by *Mahigan*
> 
> That's entirely plausible though I was relying on Anandtech's information, as well as the information given to members of the press by nVIDIA, when I wrote that into the theory. If this is the case then the entire graph, from Anandtech, would be erroneous and the information presented to members of the press, by nVIDIA, would also be erroneous.
> 
> This would be questionable no?


Wouldn't be the first time that Nvidia is happily accepting wrong interpretations as good PR.

And I have yet to see a second source for that claim. No other tech review site has backed that information so far. They are all relying on a single source. At least two sites I know tried to get that information confirmed and didn't get an answer.

The +31 part on Maxwell v2 couldn't be replicated experimentally at all.

Scratch that. I think I get what it means. Nvidia wasn't lying per se. Only happily accepting a misunderstanding.

I think they intended to make the hard wired grid locks held by the GPCs easier to release selectively. This only has an effect to workload scheduled via the GMU though. They intended to say that 1 grid slot remains reserved to the GPCs and the other 31 can be rebound lazily. I have no clue how Anandtech managed to mess up "grids" with "queues".

Would have made a lot of sense, if, but only if, the GMU was actually used for compute shaders. I don't think that was even tested yet.

It is still unknown why the GMU isn't used in DX12. It looks as if they got caught redhanded by a minor API spec. Possibly something along the sort of "Compute Engines can have resource barriers".
Quote:


> Originally Posted by *Mahigan*
> 
> Hmm... but PhysX is a set of libraries executed separately from the Game Engine itself no? If that is the case then both the game engine, as well as the PhysX libraries, would be sending tasks to the GPU synchronously. The asynchronicity would simply be the result of the GPU receiving workloads from two separate sources no?


No, the asynchronicity refers to scheduling work ahead which is only blocked by fences, nothing else. And the GPU/driver deciding autonomously when to execute it. Nvidia does this part just fine, even though there are some mode transitions involved which can sum up quite badly if you provoke it.

Efficient asynchronicity is when concurrency gets involved. This part isn't working, respectively only for work committed to different queues. These additional queues are only available to CUDA based applications though. Which PhysX is.

Normally, the compute engines should have had independent queues, so the hardware would have been able to use at least draw call level preemption. But since everything goes through the 3D queue, and even involves a full pipeline flush, there is no concurrency with pure DX12. And that can cause significant differences between AMDs and Nvidias performance levels if the engine designer was expecting concurrency.


----------



## Mahigan

Quote:


> Originally Posted by *Ext3h*
> 
> Normally, the compute engines should have had independent queues, so the hardware would have been able to use at least draw call level preemption. But since everything goes through the 3D queue, and even involves a full pipeline flush, there is no concurrency with pure DX12. *And that can cause significant differences between AMDs and Nvidias performance levels if the engine designer was expecting concurrency.*


Ahhh,

Ashes of the Singularity's Kollock mentioned that he was expecting concurrency, as the nVIDIA driver exposed it, but once he attempted to make use of it... performance was dreadful.

I wonder what this will mean for titles such as Deus Ex, if this pans out to be true. Deus Ex is supposed to make heavy use of Asynchronous Compute, relative to other titles, I would imagine that much like Oxide, the developers for Deus Ex will be required to program a separate Vendor ID specific path for nVIDIAs GPUs.

This also makes me think of other titles on the horizon, I wonder if all of this will incur further delays for Hitman, Mirror's Edge, Tomb Raider as well as Star Citizen (an ambitious, perhaps too ambitious, title).

I suppose a workaround would be to use CUDA rather than pure DX12.


----------



## Xuper

@Ext3h I highly doubt , hard to believe for second source.

Edit : So Nvidia has to implant Cuda inside DX12 for specific path?


----------



## Ext3h

Quote:


> Originally Posted by *Mahigan*
> 
> Ashes of the Singularity's Kollock mentioned that he was expecting concurrency, as the nVIDIA driver exposed it, but once he attempted to make use of it... performance was dreadful.


Precisely.

He took a long running, heavily stalling job, converted it into a compute shader, and tried offloading into to an Compute Engine. Instant boost on GCN as it stopped blocking barriers and signals.

Instant failure on Nvidia as the job was now running in solitary, which caused the stalls to become even more severe.
Quote:


> Originally Posted by *Mahigan*
> 
> I wonder what this will mean for titles such as Deus Ex, if this pans out to be true. Deus Ex is supposed to make heavy use of Asynchronous Compute, relative to other titles, I would imagine that much like Oxide, the developers for Deus Ex will be required to program a separate Vendor ID specific path for nVIDIAs GPUs.
> [...]
> I suppose a workaround would be to use CUDA rather than pure DX12.


Check your PNs









For the public: It's working partially in Fable. They are using a single path for both platforms. It isn't perfect for either platform (especially AMD could have been utilized better), but it is working.

Funny enough, async compute in Fable has different effects on AMD and Nvidias hardware. Both result in a speedup, but the reason why there is a speedup differ greatly.


----------



## Mahigan

Quote:


> Originally Posted by *Xuper*
> 
> @Ext3h I highly doubt , hard to believe for second source.


Well, it does point to nVIDIA having subpar hardware in several respects. While on the Graphics side, nVIDIA hardware is superior on several fronts, as it pertains to mixed-mode or pure compute usage, nVIDIAs current architectures are, as Simon Cowell would say, "dreadful" (though pure compute is acceptable though nowhere near that of Fiji's parallel compute capabilities).

Of course I wouldn't go repeating this throughout the interwebs as mentioning, what appears to be factual information, would have you relegated as a "fanboi".

What I find disheartening is that AMD hardware opens us all up to a world of possibilities. Of innovation in terms of Gaming engine design that just isn't possible with nearly 80% of the market share controlled by nVIDIAs architectural designs (I hope Volta rectifies this).

I'm not sure how everyone feels but having a cape waving around in the wind isn't what I call "Cinematic gaming". Though I do realize that most Physics based approaches could be entirely programmed in a game engine not needing or requiring PhysX. My only hope is that Microsoft purchasing Havok from Intel will open up the PhysX front to GCN accelerated Havok capabilities which could make their way onto several PC titles in the near future. I would like to see Havok becoming an Open Standard tbh.


----------



## Mahigan

Quote:


> Originally Posted by *Ext3h*
> 
> Precisely.
> 
> He took a long running, heavily stalling job, converted it into a compute shader, and tried offloading into to an Compute Engine. Instant boost on GCN as it stopped blocking barriers and signals.
> 
> Instant failure on Nvidia as the job was now running in solitary, which caused the stalls to become even more severe.
> Check your PNs
> 
> 
> 
> 
> 
> 
> 
> 
> 
> For the public: It's working partially in Fable. They are using a single path for both platforms. It isn't perfect for either platform (especially AMD could have been utilized better), but it is working.
> 
> Funny enough, async compute in Fable has different effects on AMD and Nvidias hardware. Both result in a speedup, but the reason why there is a speedup differ greatly.


Ahhh... I've got mail.

Ok.. having a look


----------



## Xuper

Quote:


> Originally Posted by *Ext3h*
> 
> I have also serious doubts whether the GMU is actually involved when providing the "Compute Engine", because even pre-GK110 hardware (which is lacking the GMU completely!) can already switch between "Compute Engine" and "3D Engine" mode.
> 
> It's always 1 active CS grid while in 3D mode. In compute mode, all available grids are available for compute kernels.
> 
> *That "Work distributor" precedes the GK110*. Only the "Grid Management Unit" was newly added. So the GMU isn't actually involved with the DX12 path.


Quote:


> Simplified...
> 
> Maxwell 2: Queues in Software, work distributor in software (context switching), Asynchronous Warps in hardware, DMA Engines in hardware, CUDA cores in hardware.
> GCN: Queues/Work distributor/Asynchronous Compute engines (ACEs/Graphic Command Processor) in hardware, Copy (DMA Engines) in hardware, CUs in hardware.


strange...So work distributor is pure hardware or Software ?


----------



## Mahigan

Quote:


> Originally Posted by *Xuper*
> 
> strange...So work distributor is pure hardware or Software ?


Software but the context switch is processed by the SMM/SMX which requires a complete flush, meaning that all work the SMM/SMX was working on is lost when a high priority/preemption request is sent, which is worse than I had originally thought. This is what results in the added latency. Muti-engine concurrency (working on both Graphic and Compute tasks concurrently) is also problematic and If this is true then it does mean that pure D3D12 programming is not well suited when working with mixed-mode loads under Maxwell 2. You'd be better off resorting to CUDA/HyperQ. This last part could, perhaps, still be fixed in the driver, assigning/converting DX12 requests to CUDA requests, and this is probably what nVIDIA are working on in their driver (mentioned by Kollock).

In many ways, I was being kind to nVIDIA


----------



## HalGameGuru

2 Things.

1. I was under the impression most PhysX implementations in the wild right now were CPU side not GPU side. There are, from what I had heard, very few old school, full bore PhysX implementations that make use of bespoke hardware.

2. Is there perhaps an unfair conflation of ideas under Async Compute going on? There is working out of order, there is parallelism, there is concurrent operations of GFX and Compute, etc. I get the feeling we are often arguing and debating different things against each other without realizing it. There is a difference between making computations out of order or on pieces of data that are not yet mature between operations and doing multiple interconnected computations concurrently.


----------



## drSeehas

Quote:


> Originally Posted by *mtcn77*
> 
> ... Steve Burke, ... [Source] ...


Quote:


> NVidia highlighted that they've been working with Microsoft for *over four years* on DirectX 12, which almost came across as a bit sore -- as in, "we were doing this before Mantle was a thought."


And still no working Asynchronous Compute (concurrent graphics and compute)?


----------



## Tivan

Quote:


> Originally Posted by *drSeehas*
> 
> And still no working Asynchronous Compute (concurrent graphics and compute)?


Maybe Async Compute wasn't that major a feature of DX12 till AMD came along, so Nvidia are a little sore over AMD 'hijacking' DX12.

Remember Nvidia did suggest some DX12 features that are present in hardware on Nvidia cards but not on AMD. They probably worked on that with MS.


----------



## Xuper

Quote:


> Originally Posted by *Mahigan*
> 
> Software but the context switch is processed by the SMM/SMX which requires a complete flush, meaning that all work the SMM/SMX was working on is lost when a high priority/preemption request is sent, which is worse than I had originally thought. This is what results in the added latency. Muti-engine concurrency (working on both Graphic and Compute tasks concurrently) is also problematic and If this is true then it does mean that pure D3D12 programming is not well suited when working with mixed-mode loads under Maxwell 2. You'd be better off resorting to CUDA/HyperQ. This last part could, perhaps, still be fixed in the driver, assigning/converting DX12 requests to CUDA requests, and this is probably what nVIDIA are working on in their driver (mentioned by Kollock).
> 
> In many ways, I was being kind to nVIDIA


Quote:


> Originally Posted by *Ext3h*
> 
> No, the asynchronicity refers to scheduling work ahead which is only blocked by fences, nothing else. And the GPU/driver deciding autonomously when to execute it. Nvidia does this part just fine, even though there are some mode transitions involved which can sum up quite badly if you provoke it.
> 
> Efficient asynchronicity is when concurrency gets involved. This part isn't working, respectively only for work committed to different queues. These additional queues are only available to CUDA based applications though. Which PhysX is.
> 
> Normally, the compute engines should have had independent queues, so the hardware would have been able to use at least draw call level preemption. But since everything goes through the 3D queue, and even involves a full pipeline flush, there is no concurrency with pure DX12. And that can cause significant differences between AMDs and Nvidias performance levels if the engine designer was expecting concurrency.


So nvidia was able to use Async Compute with Cuda (PhysX) but couldn't do with Pure DX12? if it's true that Nvidia was able to use Async, then i can say that Async compute is doing at hardware level?
still I'm confused.It's possible to use Async compute without using context switching ? or still required?

Quote:


> Originally Posted by *Mahigan*
> 
> With Arkham Knight, you have two "engines" running concurrently. There is the Arkham Knight engine as well as the PhysX libraries. It is evident that both would be running concurrently, though executed sequentially, from two separate sources. Asynchronous Compute is a *single command* within, a batch of commands, which instructs the GPU to execute two workloads at once (like Hyperthreading). This command would, for example, instruct the GPU to execute a Graphic task and concurrently execute a Compute task, within the same command line. This command could just as well instruct the GPU to execute two compute commands concurrently. *Two commands from the same source*. This requires a context switch (when executing a Graphic and Compute command in parallel). The commands are sent in parallel to the GPU.


I read somewhere that "async isnt a single command, its multiple commands spread across multiple command queues"

and here https://msdn.microsoft.com/en-us/library/windows/desktop/dn899124%28v=vs.85%29.aspx

I think :

Quote:


> 1. A resource can be accessed for read and write from multiple command queues simultaneously, including across processes, only if it is in the state *D3D12_RESOURCE_STATE_UNORDERED_ACCESS*.
> 
> 2. A resource can be read from multiple command queues simultaneously, including across processes, only if it is in the state *D3D12_RESOURCE_STATE_GENERIC_READ*.
> 
> 3. All write operations, except for the unordered access case described in case 1 above, must be done exclusively by a single command queue at a time. When a resource has transitioned to a writeable state on a queue, it is considered exclusively owned by that queue and it must transition to *D3D12_RESOURCE_STATE_GENERIC_READ* before it can be accessed by another queue.


----------



## Ext3h

Quote:


> Originally Posted by *Xuper*
> 
> So nvidia was able to use Async Compute with Cuda (PhysX) but couldn't do with Pure DX12? if it's true that Nvidia was able to use Async, then i can say that Async compute is doing at hardware level?
> 
> still I'm confused.It's possible to use Async compute without using context switching ? or still required?


Context switching can't be avoided in full.

It's happening on multiple levels, once in the scheduling frontend, where the GPU is switching between 3D and pure compute mode in a whole. And once again on the SMM/SMX level where each individual units needs to switch once again.

That first mode switch can be avoided, if it was just working correctly. The second one not yet. Not with Maxwell at least.
Quote:


> Originally Posted by *Xuper*
> 
> I read somewhere that "async isnt a single command, its multiple commands spread across multiple command queues"


One step further, and it would be complete. Async also requires that the queues can move freely in relation to each other. There is no "happens before" or "happens after" relation unless explicitly modelled with signals and fences. The execution order is just fixed inside each queue.

But that alone doesn't get you much, except for some freedom in the execution schedule. In order to take full advantage of that freedom, you need to be responsive in terms of using every possible idle phase for interleaved execution. If you don't, your possible gains are limited to solely reducing context switching, nothing else.
Quote:


> Originally Posted by *Mahigan*
> 
> Software but the context switch is processed by the SMM/SMX which requires a complete flush, meaning that all work the SMM/SMX was working on is lost when a high priority/preemption request is sent, which is worse than I had originally thought. This is what results in the added latency.


A flush doesn't necessarily imply data loss, as it can easily wait until the SMM underruns its current queue, plus this only goes for mixing compute/3D kernels. It's really just a minor issue, compared to the rest.

Maxwell doesn't even have such fine grained preemption that it would be possible to evict running jobs in any way. It can only preempt jobs which haven't started execution yet. That's what NV meant with "Preemption at draw call boundaries" on this years GDC talk.

It's not loosing any progress either, in fact, it looks like Nvidia is abusing regular signals to communicate to the GPU how far the work in each queue has advanced. You can even see these signals in GPUView, in the device context. You will notice that there is an additional fence placed on all command buffers which values correlates with the length of that specific command buffer.

Well, actual "preemption" with dataloss does happen. Whenever the drivers decides to commit suicide because something timed out. That even causes duplicated work. But it's not the regular case, by no means. It's more like a reset button on the GPU.


----------



## PontiacGTX

Another game using Async. compute
http://www.overclock.net/t/1590939/wccf-hitman-to-feature-best-implementation-of-dx12-async-compute-yet-says-amd/


----------



## Mahigan

It would also appear that we've been comparing apples to oranges all along in AotS. AMD cards are producing more effects and rendering more content than nvidia cards.
http://forums.anandtech.com/showthread.php?t=2462951&page=6

Dynamic lighting is missing on nvidia cards. Dynamic lighting uses Async Compute. Smoke effects, like the smoke clouds, are also missing on nvidia.


----------



## GorillaSceptre

No way! Not surprising though..

Edit:

Tech sites doing their jobs as usual..









As great as this thread is, i think that info needs a new one..


----------



## p00q

New patch, new tests from AoS - http://www.nordichardware.se/Grafikkort-Recensioner/ashes-of-the-singularity-beta-2-vi-testar-directx-12-och-multi-gpu/Prestandatester.html#content

If it doesn't work, try using Google cache.


----------



## Roboyto

@Mahigan

http://www.anandtech.com/show/10067/ashes-of-the-singularity-revisited-beta

They specifically tested the effects of Async, which has been increased in the latest bench apparently:





Things are looking good for us GCN card owners


----------



## PontiacGTX

Quote:


> Originally Posted by *Roboyto*
> 
> @Mahigan
> 
> http://www.anandtech.com/show/10067/ashes-of-the-singularity-revisited-beta
> 
> They specifically tested the effects of Async, which has been increased in the latest bench apparently:
> 
> 
> 
> 
> 
> 
> 
> Things are looking good for us GCN card owners


GCN2.0 and 3.0 more than 1.0


----------



## Roboyto

Quote:


> Originally Posted by *PontiacGTX*
> 
> GCN2.0 and 3.0 more than 1.0


1.1 & 1.2?


----------



## PontiacGTX

Quote:


> Originally Posted by *Roboyto*
> 
> 1.1 & 1.2?


yeah but AMD uses 1.0,2.0,3.0 iteration as their official nomenclature


----------



## Roboyto

Quote:


> Originally Posted by *PontiacGTX*
> 
> yeah but AMD uses 1.0,2.0,3.0 iteration as their official nomenclature


Nomenclature aside, it is nice to see the hardware will be able to be utilized to its full potential; Some more than others obviously.

Even though they added more Async operations to the current benchmark, AotS as a whole doesn't use a 'ton' of it. There are more games to come that will likely rely on it more...exciting to see if this trend continues where we are currently seeing *up to 20% improvement under the most demanding circumstances* in resolution and settings.


----------



## NightAntilli

I think we can potentially see way more than 20%. In fact, I would be surprised if we can't surpass 50% with it.


----------



## Mahigan

Quote:


> Originally Posted by *Roboyto*
> 
> Nomenclature aside, it is nice to see the hardware will be able to be utilized to its full potential; Some more than others obviously.
> 
> Even though they added more Async operations to the current benchmark, AotS as a whole doesn't use a 'ton' of it. There are more games to come that will likely rely on it more...exciting to see if this trend continues where we are currently seeing *up to 20% improvement under the most demanding circumstances* in resolution and settings.


Wait for Hitman







March 11.


----------



## Roboyto

Quote:


> Originally Posted by *NightAntilli*
> 
> I think we can potentially see way more than 20%. In fact, I would be surprised if we can't surpass 50% with it.


I am still being cautiously optimistic. Though that optimism grows with each piece of supporting evidence.

Quote:


> Originally Posted by *Mahigan*
> 
> Wait for Hitman
> 
> 
> 
> 
> 
> 
> 
> March 11.


I'm awares







I've been sitting here patiently with my 'lowly' R9 290 that I bought so long ago...happily watching as it has surpassed 780Ti performance, usually wins over a 970, and who knows...if DX12 pans out that great in my favor it may even be tackling the 980 next judging by how it is trailing the Ti in AotS. I will still want a Polaris for my main rig so I know I will be shredding new titles in 3K Eyefinity, but my HTPC 1080P might be good for a little while longer.

Too bad DX12 was canned for the release of RotR...although it looks like it may be coming sooner than anticipated: http://www.techtimes.com/articles/131968/20160209/rise-of-the-tomb-raider-for-pc-has-dx12-option-its-just-not-enabled-yet.htm


----------



## Forceman

Quote:


> Originally Posted by *Mahigan*
> 
> Wait for Hitman
> 
> 
> 
> 
> 
> 
> 
> March 11.


I seem to recall hearing the same thing about Fable: Legends.


----------



## Roboyto

Quote:


> Originally Posted by *Forceman*
> 
> I seem to recall hearing the same thing about Fable: Legends.


Fable optimized to run on XBONE which has a fraction of the capabilities of big GCN hardware. I'm not sure of the amount of DX12 features in Fable, but Hitman is looking good on a 390.

Hitman preview 970 VS 390 1080P all max settings. AMD nearly always leading in this short snip...and it's no secret the AMD cards get better with higher resolution:

https://www.youtube.com/watch?v=fq6GyUzyuJQ


----------



## Forceman

Quote:


> Originally Posted by *Roboyto*
> 
> Fable optimized to run on XBONE which has a fraction of the capabilities of big GCN hardware. I'm not sure of the amount of DX12 features in Fable, but Hitman is looking good on a 390.
> 
> Hitman preview 970 VS 390 1080P all max settings. AMD nearly always leading in this short snip...and it's no secret the AMD cards get better with higher resolution:
> 
> https://www.youtube.com/watch?v=fq6GyUzyuJQ


My point is more in the vein of it always being "wait for X" with AMD.

Personally I think it's a bad sign that the Hitman beta/demo that released (and is shown in that video) is DX11, not DX12. If the are releasing a major DX12 title in 3 weeks, why was this demo DX11?


----------



## Roboyto

Quote:


> Originally Posted by *Forceman*
> 
> My point is more in the vein of it always being "wait for X" with AMD.
> 
> Personally I think it's a bad sign that the Hitman beta/demo that released (and is shown in that video) is DX11, not DX12. If the are releasing a major DX12 title in 3 weeks, why was this demo DX11?


Didn't realize it was DX11, but that could pan out even better.

I don't know why it's DX11, but an equally pressing question is why are they now going to release a patch for RotR to enable DX12? Is the same situation going to happen for Hitman?

A waiting game...yeah it's happened, but no one was waiting when the 290 dethroned the mighty Titan for $400. The wait for FX chips was disappointing, and consequently pushed me to buy an i5. They are capable of playing games just fine...they obviously don't bench as well. The wait for Fury was also disappointing...I was very excited to purchase one...and haven't...yet. But maybe the wait for Fury's shining moment is coming to a close; a moment the FX chips never got to see.

My wait, and investment, for a 290 has panned out better than anticipated with the GPU aging like a fine wine. The same can probably be said with most/all of AMDs rebranding of the 7XXX GPUs which are still competing with Nvidias latest offerings. I'm not talking about power consumption, but I will mention the Nano which is freaking fantastic in that respect. If the nano is a hint at what's to come in the efficiency department, that's a wait to be pumped about IMO.

Maybe this particular wait for DX12 implementation will be worth it.

Either way I'll be sitting here watching the 'show', with my 2 year old GPU, where the 'previews' are hinting at things still getting better.


----------



## Mahigan

I think people will start to realize why AMD was able to re-brand Hawaii into Grenada. It's a very sturdy architecture.

Probably the best architecture ever. I mean its been competitive with two NVIDIA generations. The 8800 GTX was impressive but Hawaii has it beat.

2900XT/3870 were low ball GPUs, ATi reclaimed the throne with the 4870.

The 290/290x, on the other hand, have supplanted their Kepler competitors and now their Maxwell competitors.


----------



## ronnin426850

Very nice thread, op, you have a typo there:
Quote:


> 4.The Work Distributor transfers the 32 Compute or 1 Graphic and 31 Compute tasks to the SMMs which are a hardware component within*G* the nVIDIA GPU.


I only noticed it because I always do it too


----------



## Mahigan

I've updated the guide.


----------



## sawe

Let the discovery work continue. Async on Geforce but how it works. http://www.dsogaming.com/news/directx-12-async-compute-supported-in-latest-nvidia-drivers-steam-overlay-works-in-dx12-mode/


----------



## p4inkill3r

I don't know how much of the difference lies in async comp, but now that AoTS is out, there are some more reviews coming in.
[H] shows Radeon with a significant lead in dx11 & dx12 over nvidia.

http://www.hardocp.com/article/2016/04/01/ashes_singularity_day_1_benchmark_preview/1#.Vv85KaQrIuU


----------



## GnarlyCharlie

Wonder if the cards were stock clocks or what during the [H] tests?


----------



## mtcn77

[Source]


----------



## bluezone

Quote:


> Originally Posted by *mtcn77*
> 
> 
> 
> [Source]


makes me want to re-download star swarm to see if it improves.


----------



## mtcn77

Quote:


> Originally Posted by *bluezone*
> 
> makes me want to re-download star swarm to see if it improves.


Good point, it says Hitman and Ashes of the Singularity already uses it. Makes you wonder about Star Swarm for sure...
Bring us the details AMD.


----------



## PontiacGTX

Quote:


> Originally Posted by *mtcn77*
> 
> Good point, it says Hitman and Ashes of the Singularity already uses it. Makes you wonder about Star Swarm for sure...
> Bring us the details AMD.


it doesnt use asynchronous compute? it just had the basic DX12 features


----------



## bluezone

Maybe I'm taking this part of an old interview with oxide out of context? (not sarcasm) It took me a while to find the quote.

"On the GPU side, the graphics core runs through a list of commands, but what will happen is that you can get bubbles within those lists, in terms of whether the rasterizer is busy or the AOUs are busy. There's a bunch of compute units working within that graphics card. Mantle opens up and allows us to submit commands into that asynchronously, so the GPU can keep itself full the whole time. You don't have to wait for those in a serial fashion."

from: http://venturebeat.com/2014/01/21/a-deep-dive-into-the-making-of-the-eye-popping-star-swarm-demo-interview/.


----------



## PontiacGTX

Quote:


> Originally Posted by *bluezone*
> 
> Maybe I'm taking this part of an old interview with oxide out of context? (not sarcasm) It took me a while to find the quote.
> 
> "On the GPU side, the graphics core runs through a list of commands, but what will happen is that you can get bubbles within those lists, in terms of whether the rasterizer is busy or the AOUs are busy. There's a bunch of compute units working within that graphics card. Mantle opens up and allows us to submit commands into that asynchronously, so the GPU can keep itself full the whole time. You don't have to wait for those in a serial fashion."
> 
> from: http://venturebeat.com/2014/01/21/a-deep-dive-into-the-making-of-the-eye-popping-star-swarm-demo-interview/.


they didnt mention Directx 12,maybe that why their game was better on Mantle than DX12


----------



## bluezone

Quote:


> Originally Posted by *PontiacGTX*
> 
> they didnt mention Directx 12,maybe that why their game was better on Mantle than DX12


Mantle was created because DX12 didn't exist as far as anyone knew at the time.

http://www.pcworld.com/article/2109596/directx-12-vs-mantle-comparing-pc-gamings-software-supercharged-future.html


----------



## airfathaaaaa

Quote:


> Originally Posted by *bluezone*
> 
> Mantle was created because DX12 didn't exist as far as anyone knew at the time.
> 
> http://www.pcworld.com/article/2109596/directx-12-vs-mantle-comparing-pc-gamings-software-supercharged-future.html


pretty sure dx11.3 is essentially dx12 with the added commands that they took from mantle


----------



## bluezone

Quote:


> Originally Posted by *airfathaaaaa*
> 
> pretty sure dx11.3 is essentially dx12 with the added commands that they took from mantle


I had not heard of dx11.3. good to know. I haven't been at this long.

http://wccftech.com/directx-11-3-api/

I couldn't find anything about Asynchronous Compute in dx11.3. Only a few features of dx12. Did it have Asynchronous Compute?


----------



## L36

Quote:


> Originally Posted by *airfathaaaaa*
> 
> pretty sure dx11.3 is essentially dx12 with the added commands that they took from mantle


Source? DirectX 11.3 is just a simplified version of DX12 that still has the simple API layer similar to what is present in DX11 - 11.2 for less experienced developers.
Essentially created to help inexperienced developers utilize the new features that came with DX12 but without the in depth technical know to really utilize DX12 to its full potential.


----------



## Mahigan

Just for a bit of fun...

*Maxwell vs. Fiji*

*ROPs*

Maxwell:

_A few corrections for GM204/200. The ROp ratio is up from 8 to 16 and the available L2 cache is down from 1MB to 512KB from GM107 pictured here_

Fiji:

_A few corrections for Fiji. The L2 cache ratio is up from 128KB to 256KB per 8 ROps from Hawaii pictured here_


_Depth test (Z)/stencil ops and color ops both have dedicated caches_

*Notes*:
1. Fiji's ROPs are fully pipelined and work together whereas Maxwell's are not and must communicate through a bi-directional crossbar.
2. Fiji's ROPs have access to an extra cache layer being the Global Data Share cache.
3. Fiji has more dedicated color cache than Maxwell. One pool of color cache per 4 ROPs as opposed to a single pool of color cache per 16 ROPs on Maxwell.

*Conclusion*:
Fiji's ROPs are individually more powerful than Maxwell's ROPs when pressure is appplied to the L2 cache by other logical units within the GPU. As we will see in my next post, that is very much the case when we look at Maxwell's SMMs and their caching hierarchy.

Under synthetic tests, Maxwell will appear to be much more powerful than Fiji ROP wise but once pressure is applied to Maxwell's L2 cache this synthetic advantage disappears. Ex: 4K gaming scenarios.


----------



## Mahigan

To be continued... Saving this for next post.

*Note*: _A Wavefront is 64 threads wide and a Warp is 32 threads wide_

*Shader Multiprocessors (CU vs. SMM)*

Maxwell SMM:


Maximum of 64 concurrent Warps (64 x 32 threads = 2,048 threads).
96KB L1 cache shared between: 128 CUDA Cores, 8 Texture Mapping Units, 32 Load/Store Units, 32 Special Function Units.
Access to L2 cache when L1 is not enough (cache spill).
*Notes*:
1. Though there is a maximum of 64 concurrent Warps pert SMM, performance drops off after 16 concurrent Warps due to a cache spill into L2 cache. That sets a maximum of 512 threads, per SMM, before performance falls off. See bellow:

2. Beyond3D easily achieved Maxwell's limitations leading to system crashes and driver/windows safety features terminating testing.
3. Not a good candidate of Asynchronous Compute + Graphics due to weak threading capabilities.

Fiji CU:


Maximum of 40 concurrent Wavefronts (40 x 64 = 2,560 threads).
88KB L1 cache shared between: 64 Stream Processors, 4 Texture Mapping Units, 1 Scalar Unit.
Access to L2 cache when L1 is not enough (cache spill).
*Notes*:
1. The maximum of 40 concurrent Wavefronts per CU is achievable without a performance hit provided there is available register space. Register space allows for 32-40 concurrent Wavefronts.
2. Beyond3D were not able to push GCN beyond its limitations.
3. Perfect candidate for Asynchronous Compute + Graphics due to untapped parallel/threading resources.

*Conclusion*:
Maxwell is an incredible architecture for light compute loads mixed with strong Graphics rendering demands. Fiji is an incredible architecture for medium-heavy Graphics rendering demands and heavy compute loads. We will most likely see Fiji surpass Maxwell GM200 in around 6 months time as more DX12 titles and console ports make their way to the PC provided that developer's take note of Fiji's 4GB frame buffer limitations and program accordingly. This is primarily due to the increase ratio of compute to Graphics rendering loads brought on by the console effect. The console effect is due to the console APUs having run out of Graphics Rendering steam by the first half of 2015 title releases. Developer's have had to find ways to use the compute capabilities of the console's APUs in order to process what was traditionally Graphics Rendering pipeline work. The end result has been an increase in GCNs capabilities and performance, relative to Maxwell and Kepler, when running the subsequent console to PC ports.

ROPs should not be a limiting factor for Fiji, at least no more than they are for Maxwell. Increased L2 cache strain on Maxwell, due to SMM cache spills brought on by heavier compute demands, really places a burden on its real world ROP throughput.

Asynchronous Compute + Graphics might still yield beyond 10% performance improvements for Fiji varying on the implementation. Consoles are seeing +30%. Asynchronous Compute + Graphics remains hard to tune on the PC due to the varying GCN configurations.

Maxwell may very well end up like Kepler is relative to Maxwell once Pascal launches. Expect to see more cache redundancy in Pascal relative to Maxwell. HBM2 should also play a large role in alleviating some of these bottlenecks on NVIDIAs uarch. Particularly ROP throughput. Expect to see GP100 perform admirably.


----------



## PontiacGTX

How exactly ROPs get some impact in performance with Asynchronous compute? Because it seems that lower resolution makes higher benefit for Async compute, or this is explaining the Rop bottleneck in Fiji?


----------



## Mahigan

Quote:


> Originally Posted by *PontiacGTX*
> 
> How exactly ROPs get some impact in performance with Asynchronous compute? Because it seems that lower resolution makes higher benefit for Async compute, or this is explaining the Rop bottleneck in Fiji?


The added compute occupancy per SMM leads to L2 cache occupancy of compute and ROPs work. You have less frame time latency and higher SMM occupancy. A higher SMM occupancy means more concurrent Warps.

Without Asynchronous Compute + Graphics, we have synchronous execution of compute tasks which leads to less SMM occupancy but longer frame time latency. Less concurrent Warps.


----------



## Mahigan

And look here... This confirms the above. Funny I just posted the above before NV announced Pascal.

Pascal maps directly to one Wavefront (like GCN). It is more GCN-like. There are two Warp schedulers per SM. Each Warp is 32 threads wide for a total of 64 threads which is a Wavefront.


So yeah, I stated that NV would have to be more GCN-like to compete due to the Console effect and that's where they're headed. GP100 is a 300Watt monster and a 610mm2 die.
Quote:


> *Overall shared memory across the GP100 GPU is also increased due to the increased SM count, and aggregate shared memory bandwidth is effectively more than doubled. A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).*


Source: https://devblogs.nvidia.com/parallelforall/inside-pascal/


----------



## GnarlyCharlie

Quote:


> Originally Posted by *p4inkill3r*
> 
> I don't know how much of the difference lies in async comp, but now that AoTS is out, there are some more reviews coming in.
> [H] shows Radeon with a significant lead in dx11 & dx12 over nvidia.
> 
> http://www.hardocp.com/article/2016/04/01/ashes_singularity_day_1_benchmark_preview/1#.Vv85KaQrIuU


Quote:


> Originally Posted by *GnarlyCharlie*
> 
> Wonder if the cards were stock clocks or what during the [H] tests?


It appears that the [H] cards were reference cards at stock reference clocks, so 1050 for Fury X, 1075 for 980Ti FWIW.


----------



## PontiacGTX

]
Quote:


> Originally Posted by *GnarlyCharlie*
> 
> It appears that the [H] cards were reference cards at stock reference clocks, so 1050 for Fury X, 1075 for 980Ti FWIW.


There is no Problem a R9 390X still beats a 980 Ti


----------



## Paul17041993

Quote:


> Originally Posted by *Mahigan*
> 
> 2. Beyond3D easily achieved Maxwell's limitations leading to system crashes and driver/windows safety features terminating testing.


I've been able to crash and hang nvidia cards through OGL and OCL/HLSL simply because nvidia's driver team don't understand the concepts of scheduling safety, concurrency/multi-tasking and undefined/unimplemented operations (eg; integer division).

That or they're so obsessed with performance that the majority of the code is just a hack-job...


----------



## amd955be5670

Long time lurker here, infact past 2 days at work I've been reading this thread, lol.

So I have a question, if Maxwell is fail at async, cant nvidia disable it properly to not incur performance loss when requested by dx12 api?


----------



## GnarlyCharlie

Quote:


> Originally Posted by *PontiacGTX*
> 
> ]
> There is no Problem a R9 390X still beats a 980 Ti


Sorry, never said there was a problem, I was just curious as to what, if any, overclock was applied in the [H] test.

And +.4 fps max/ -6.35 fps min ain't exactly "beating", but take what you can I guess. I wish I had W10 on my 980Ti rig, I'd like to do a few measurements of my own.


----------



## Mahigan

Quote:


> Originally Posted by *amd955be5670*
> 
> Long time lurker here, infact past 2 days at work I've been reading this thread, lol.
> 
> So I have a question, if Maxwell is fail at async, cant nvidia disable it properly to not incur performance loss when requested by dx12 api?


The developer needs to add a non-Asynchronous compute + graphics path in order for NVIDIA to avert a performance loss associated with the fences used by the Async compute + graphics feature.

NVIDIA have disabled Asynchronous compute + graphics in their driver. If Asynchronous compute + graphics is enabled in an application, the NVIDIA software driver will simply execute the code synchronously. The issue is that the fences remain in place.

Fences are like tiny pauses, added to the code, which control synchronization between various contexts. Contexts are different queues found in DX12. There are three contexts, Graphics, Compute and Copy. If work from the Graphics queue needs to be synchronized with work from the compute queue, a fence is put in place. This fence introduces a wait time on the graphics or compute queue (whichever has less work to do time wise) so that the other queue (graphics or compute whichever has more work to do time wise) has time to complete its work. This wait time or pause stops the queue, under a fence, from processing more work. In other words, a fence places the GPU in an idle state.

A fence affects both NVIDIA and AMD GPUs but for AMD, the performance cost associated with a fence is mitigated by the decreased time it takes to perform both the Compute and Graphics tasks because they're performed in parallel (at the same time). For NVIDIA, since both the compute and graphics tasks are performed synchronously (on after the other), the fence introduces a pause in the processing leading to a performance loss.

Hope that answers your question.


----------



## amd955be5670

Yeah that did indeed answer my question. And few threads ago, it suggests a possible maxwell solution, but still not as effective as AMD. dang.

Thanks for this thread, its been an eye opener. Keep up with the good work


----------



## CrazyElf

So to summarize then for Vega, what we think will happen (our best guess)


4096 cores
CUs are still 64 but this time potentially as you've suggested split between 2 groups of 32 SMIDs with an upgraded L1 cache; should result in shader efficiency improvements (and address the occupancy limits)
8 shader engines, so this means that there will be 8 shader engines x 8 CUs per shader engine x 64 SIMDs per CU
ROP count will be doubled to 128, as will the geometry processor count
Also the geometry processor output will be upgraded significantly, so more than double effective triangle output (important as this is a bottleneck on AMD GPUs right now and one reason why tessellation causes such huge frame rate drops)
New hardware scheduler, which should significantly upgrade the power efficiency
L2 cache is vastly upgraded (fewer DRAM requests needed)
Instruction pre-fetching upgrades
Major upgrades to the memory compression
Last, but not least, there is of course HBM2 which doubles the memory bandwidth - effective bandwidth is more than doubled due to the color compression
Then there's the multimedia cores and display engine. HEVC encoding is supported now. I'm expecting AMD to make a big push for VR with all of this. They'll retain the async advantage with the 8 ACEs.

All in all, even though they kept the SIMDs the same, Vega should be a big upgrade.

Quote:


> Originally Posted by *Mahigan*
> 
> And look here... This confirms the above. Funny I just posted the above before NV announced Pascal.
> 
> Pascal maps directly to one Wavefront (like GCN). It is more GCN-like. There are two Warp schedulers per SM. Each Warp is 32 threads wide for a total of 64 threads which is a Wavefront.
> 
> 
> So yeah, I stated that NV would have to be more GCN-like to compete due to the Console effect and that's where they're headed. GP100 is a 300Watt monster and a 610mm2 die.
> Source: https://devblogs.nvidia.com/parallelforall/inside-pascal/


They upgraded the cache quite a bit: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
Quote:


> L1/L2 Cache Changes in GP100
> While Fermi and Kepler GPUs featured a 64 KB configurable shared memory and L1 cache that could split
> the allocation of memory between L1 and shared memory functions depending on workload, beginning
> with Maxwell, the cache hierarchy was changed. The GM200 SM has its own dedicated pool of shared
> memory (64 KB/SM) and an L1 cache that can also serve as a texture cache depending on workload. The
> unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data
> requested by the threads of a warp prior to delivery of that data to the warp.
> 
>  Note: One CUDA Thread Block cannot allocate 64 KB of shared memory by itself, but two Thread Blocks
> could use 32 KB each, etc..
> A dedicated shared memory per SM means applications no longer need to select a preference of the
> L1/shared split for optimal performance- the full 64 KB per SM is always available for shared memory.
> GP100 features a unified 4096 KB L2 cache that provides efficient, high speed data sharing across the
> GPU. In comparison, GK110's L2 cache was 1536 KB, while GM200 shipped with 3072 KB of L2 cache.
> With more cache located on-chip, fewer requests to the GPU's DRAM are needed, which reduces overall
> board power, reduces memory bandwidth demand, and improves performance.


Like AMD, they've upgraded the cache a lot.

They still don't have the 8 ACEs unlike the 290X / Fury X. That is the big difference. I think though that it may have to wait for Volta. Volta must be a huge leap then over Pascal.

An interesting speculation then becomes what will Volta vs Navi look like. We might see in the future a sort of convergence between the two GPU makers on what GPUs are like - compute heavy for heavier tasks.



We know from the slides that Navi will be upgrading the memory again - as for what scalability could mean remains a mystery. It could mean a giant die - the 14/20nm hybrid should be more mature by then.

My personal wet dream is to see a giant >600mm^2 AMD GPU + whatever memory upgrades they make.


----------



## mtcn77

Quote:


> Originally Posted by *CrazyElf*
> 
> So to summarize then for Vega, what we think will happen (our best guess)
> 
> 
> 4096 cores
> CUs are still 64 but this time potentially as you've suggested split between 2 groups of 32 SMIDs with an upgraded L1 cache; should result in shader efficiency improvements (and address the occupancy limits)
> 8 shader engines, so this means that there will be 8 shader engines x 8 CUs per shader engine x 64 SIMDs per CU
> ROP count will be doubled to 128, as will the geometry processor count
> Also the geometry processor output will be upgraded significantly, so more than double effective triangle output (important as this is a bottleneck on AMD GPUs right now and one reason why tessellation causes such huge frame rate drops)
> New hardware scheduler, which should significantly upgrade the power efficiency
> L2 cache is vastly upgraded (fewer DRAM requests needed)
> Instruction pre-fetching upgrades
> Major upgrades to the memory compression
> Last, but not least, there is of course HBM2 which doubles the memory bandwidth - effective bandwidth is more than doubled due to the color compression
> Then there's the multimedia cores and display engine. HEVC encoding is supported now. I'm expecting AMD to make a big push for VR with all of this. They'll retain the async advantage with the 8 ACEs.
> 
> All in all, even though they kept the SIMDs the same, Vega should be a big upgrade.
> They upgraded the cache quite a bit: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
> They still don't have the 8 ACEs unlike the 290X / Fury X. That is the big difference. I think though that it may have to wait for Volta. Volta must be a huge leap then over Pascal.
> 
> An interesting speculation then becomes what will Volta vs Navi look like. We might see in the future a sort of convergence between the two GPU makers on what GPUs are like - compute heavy for heavier tasks.
> 
> We know from the slides that Navi will be upgrading the memory again - as for what scalability could mean remains a mystery. It could mean a giant die - the 14/20nm hybrid should be more mature by then.


Nvidia already said the next architecture would resemble GCN, so that is a given. The real question is if they could really make up for the lost time upon the failure of HMC and the scurry to HBM. We know that in the actual timeline Volta would be here by now and that got replaced by Pascal and delayed by a year and now, we learn both of them are delayed by another full year cycle.


----------



## Mahigan

I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...

1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.

2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.

That's what I figure AMD have done based on recent patent filings.


----------



## PontiacGTX

Quote:


> Originally Posted by *Mahigan*
> 
> I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...
> 
> 1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.
> 
> 2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.
> 
> That's what I figure AMD have done based on recent patent filings.


and more cache on LDS?


----------



## Mahigan

I don't think Local CU caches will change, neither will the LDS or GDS but the L2 cache will get a healthy increase. Increasing the cache sizes takes away from implementing more ROPs, CUs etc. So there's always a delicate balance that must be maintained.


----------



## Paul17041993

Quote:


> Originally Posted by *Mahigan*
> 
> I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...
> 
> 1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.
> 
> 2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.
> 
> That's what I figure AMD have done based on recent patent filings.


Reading the patient suggests that it's #1, and with there being a secondary scalar unit in the diagram it's possible that when only one of the 16 ALUs in a cluster are active it can run at x4 speed (ie ~4GHz).

Interesting stuff.


----------



## ubbb69

I think AMD has laid the ground work to destroy NVIDIA. Taking over the console has established their low level api into the game makers. DX12 is just a low level api. Also DX12 should support multi gpu on a native scale. I see the future top end cards being small die multi chip cards. Small die keeps their cost low and If they push crossfire technology they should lose no performance for multi chip cards.

Thanks for the great read in this thread.


----------



## PontiacGTX

Quote:


> Originally Posted by *Paul17041993*
> 
> Reading the patient suggests that it's #1, and with there being a secondary scalar unit in the diagram it's possible that when only one of the 16 ALUs in a cluster are active it can run at x4 speed (ie ~4GHz).
> 
> Interesting stuff.


how a 2nd scalar would allow higher frequency in vectors SIMD?


----------



## Paul17041993

Quote:


> Originally Posted by *PontiacGTX*
> 
> how a 2nd scalar would allow higher frequency in vectors SIMD?


The diagram indicates that the first ALU cluster was allocated as a scalar unit, hence a second scalar unit and only 3 remaining ALU clusters (set to 2, 4 and 8 respectively).

http://www.freepatentsonline.com/20160085551.pdf

I also just noticed, page 10, [0030] mentions the CU structure could be both physically designed or dynamically allocated. So the next GCN could possibly follow a hybrid style shader structure (ie not just a simple shader count) that varies between models...


----------



## Slaughterem

http://patents.justia.com/patent/9317296 I am not that versed in this but If they can Interrupt a mask and run an override mask momentarily and then resume where they interrupted I believe that this would be a good thing with Branch prediction, CPU or compiler culling of triangles etc....
Quote:


> The provided method and storage medium have several beneficial attributes that promote increased performance of single program multiple thread code on SIMD hardware. For example, higher utilization of the SIMD hardware may be achieved. Furthermore, string comparison and other Standard Template Library (STL) like services within branchy code are improved and software prefetching performance in branchy code is improved. Furthermore, the impact of memory divergence on performance is reduced because workgroups are able to coordinate accesses instead of operating in separate logical execution streams. Additionally, permitting programmers to write more convergent code may improve power efficiency.


Quote:


> The execution mask of the SIMD array 121 is overridden at block 216. For example, software code may be generated to override the execution mask. Overriding the execution mask enables certain lanes 123 of the SIMD array 121. For example, an instruction may be included to set or clear a bit of the execution mask that indicates whether the lane associated with the bit will execute the current instruction. When the override portion of the code has completed, the execution mask may revert back to the status of the execution mask when the override portion was entered. Accordingly, a programmer may effectively take control of all of the execution resources of the machine when the programmer knows that the parallel nature of the hardware would improve execution of the software.
> 
> In some embodiments, a compiler inserts code to perform the operations 200. In some embodiments, the high level application programmer may insert_override_exec_mask_(OxFFFF) to override the execution mask for the lower 16 work items of the workgroup. In some embodiments, the high level language programmer may alter the code of Table 1 to resemble the code presented in Table 2.
> 
> TABLE 2 void kernel_begin(int N, char* str1, char* str2) { if ( threadldx < N ) { _override_exec_mask_ { do_string_compare(str1 ,str2); } } }


----------



## Tugrul512bit

Maybe dynamically allocation is an indicator of fpga? So they adapt to every different workload in milliseconds?


----------



## GorillaSceptre

Around 24 minutes in.


----------



## FunGamer1

Hi everyone.
I'm a long-time reader here and because of this async stuff I decided about 6 months ago to buy a Fury X instead of a 980Ti, which I thought to be best for my 2560*1080 screen. I'm not the typical enthusiast gamer but I like tech and in my opinion AMD has the more advanced technology. Thats the reason I went for a GCN GPU because the tech inside the hardware is what fascinates me, not necessarily the output on the screen. Not complaining about performance tho









One question I do have, which is the reason why i finally signed up, even though it hasnt much to do with DX12. Its widely assumed that GCN, especially the Fury series, is bottlenecked by ROPs and/or geometry. But why is the gap to the big Maxwell cards closing the higher the resolution goes? Doesnt resolution stress pixel throughput? If not, what does (other than tesselation)? Can someone explain me the stuff around pixel throughput, geometry, rasterization and how it relates between each other? Which part does memory bandwith play?

I somewhere read an assumption that GCN might be memory bandwith bottlenecked and so is big Maxwell which is the reason that performance evens out in high resolutions. Might that be true?

Thanks in advance for a more in depth reply because even tho I'm not a tech expert I like reading those topics to understand (my) hardware more.

Sorry if my english isnt perfect, it's not my native language









Cheers,
Fungamer


----------



## NightAntilli

Quote:


> Originally Posted by *FunGamer1*
> 
> Hi everyone.
> I'm a long-time reader here and because of this async stuff I decided about 6 months ago to buy a Fury X instead of a 980Ti, which I thought to be best for my 2560*1080 screen. I'm not the typical enthusiast gamer but I like tech and in my opinion AMD has the more advanced technology. Thats the reason I went for a GCN GPU because the tech inside the hardware is what fascinates me, not necessarily the output on the screen. Not complaining about performance tho
> 
> 
> 
> 
> 
> 
> 
> 
> 
> One question I do have, which is the reason why i finally signed up, even though it hasnt much to do with DX12. Its widely assumed that GCN, especially the Fury series, is bottlenecked by ROPs and/or geometry. But why is the gap to the big Maxwell cards closing the higher the resolution goes? Doesnt resolution stress pixel throughput? If not, what does (other than tesselation)? Can someone explain me the stuff around pixel throughput, geometry, rasterization and how it relates between each other? Which part does memory bandwith play?
> 
> I somewhere read an assumption that GCN might be memory bandwith bottlenecked and so is big Maxwell which is the reason that performance evens out in high resolutions. Might that be true?
> 
> Thanks in advance for a more in depth reply because even tho I'm not a tech expert I like reading those topics to understand (my) hardware more.
> 
> Sorry if my english isnt perfect, it's not my native language
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Cheers,
> Fungamer


I don't directly have an answer for your question, but the Fury cards definitely don't have a memory bandwidth issue. The HBM memory has a lot of bandwidth and I doubt that's an issue right now. I even think (just a wild thought of mine, no proof) AMD engineered the Fury cards with excess bandwidth because they thought that's where the bottlenecks are in their prior cards, but were proven to be elsewhere, as we've seen in the daily performance of the Fiji cards compared to their predecessors. It really has to be elsewhere, but where, I have no idea.


----------



## airfathaaaaa

given that they fury line were originally suppose to be 20nm cards its safe to say that they cut some stuff of the cards in order to make it happen on 28nm..now what it was only raja knows..


----------



## L36

Spoiler: Warning: Spoiler!



Quote:


> Originally Posted by *CrazyElf*
> 
> So to summarize then for Vega, what we think will happen (our best guess)
> 
> 
> 4096 cores
> CUs are still 64 but this time potentially as you've suggested split between 2 groups of 32 SMIDs with an upgraded L1 cache; should result in shader efficiency improvements (and address the occupancy limits)
> 8 shader engines, so this means that there will be 8 shader engines x 8 CUs per shader engine x 64 SIMDs per CU
> ROP count will be doubled to 128, as will the geometry processor count
> Also the geometry processor output will be upgraded significantly, so more than double effective triangle output (important as this is a bottleneck on AMD GPUs right now and one reason why tessellation causes such huge frame rate drops)
> New hardware scheduler, which should significantly upgrade the power efficiency
> L2 cache is vastly upgraded (fewer DRAM requests needed)
> Instruction pre-fetching upgrades
> Major upgrades to the memory compression
> Last, but not least, there is of course HBM2 which doubles the memory bandwidth - effective bandwidth is more than doubled due to the color compression
> Then there's the multimedia cores and display engine. HEVC encoding is supported now. I'm expecting AMD to make a big push for VR with all of this. They'll retain the async advantage with the 8 ACEs.
> 
> All in all, even though they kept the SIMDs the same, Vega should be a big upgrade.
> They upgraded the cache quite a bit: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
> Like AMD, they've upgraded the cache a lot.
> 
> They still don't have the 8 ACEs unlike the 290X / Fury X. That is the big difference. I think though that it may have to wait for Volta. Volta must be a huge leap then over Pascal.
> 
> An interesting speculation then becomes what will Volta vs Navi look like. We might see in the future a sort of convergence between the two GPU makers on what GPUs are like - compute heavy for heavier tasks.
> 
> 
> 
> We know from the slides that Navi will be upgrading the memory again - as for what scalability could mean remains a mystery. It could mean a giant die - the 14/20nm hybrid should be more mature by then.
> 
> My personal wet dream is to see a giant >600mm^2 AMD GPU + whatever memory upgrades they make.






Scalability is an easy one. As we go down the nodes, creating a larger die will increase in cost to insane amounts. Add the increasing complexity of the chips at lower nodes, yields will become atrocious.

I'm betting serious bank that AMD will use the interposer to place two, tree or x amount of much smaller dies and have them work together as a single GPU for navi. There is a reason why AMD keeps pushing crossfire and I think this will also come to consoles in a generation after Polaris based ones hit the market. This will be cheaper to manufacture as we go to very small nodes rather than making bid fat dies.


----------



## Paul17041993

Quote:


> Originally Posted by *NightAntilli*
> 
> I don't directly have an answer for your question, but the Fury cards definitely don't have a memory bandwidth issue. The HBM memory has a lot of bandwidth and I doubt that's an issue right now. I even think (just a wild thought of mine, no proof) AMD engineered the Fury cards with excess bandwidth because they thought that's where the bottlenecks are in their prior cards, but were proven to be elsewhere, as we've seen in the daily performance of the Fiji cards compared to their predecessors. It really has to be elsewhere, but where, I have no idea.


shader optimisation can make a difference of about 1:4.5;
Quote:


> Originally Posted by *L36*
> 
> Scalability is an easy one. As we go down the nodes, creating a larger die will increase in cost to insane amounts. Add the increasing complexity of the chips at lower nodes, yields will become atrocious.
> 
> I'm betting serious bank that AMD will use the interposer to place two, tree or x amount of much smaller dies and have them work together as a single GPU for navi. There is a reason why AMD keeps pushing crossfire and I think this will also come to consoles in a generation after Polaris based ones hit the market. This will be cheaper to manufacture as we go to very small nodes rather than making bid fat dies.


Multi-die processors is nothing new, CPUs have been doing it for more than a decade. The trouble with trying to do it with GPUs is you need a massive amount of bandwidth between them, a multitude more than what HBM1's method can do. However with what HBM2 can provide means it's not impossible to reach the needed bandwidth between dies, in fact that's what AMD was planning to do with at least their 300W APU plan (2 CPU and 2 GPU clusters cross-linked with HBM2 via the GPUs and DDR4 via the CPUs).


----------



## amd955be5670

Is there any news about Pascal and Async Compute? Is it real this time? What conclusion would the experts draw from the available slides.


----------



## airfathaaaaa

Quote:


> Originally Posted by *amd955be5670*
> 
> Is there any news about Pascal and Async Compute? Is it real this time? What conclusion would the experts draw from the available slides.


its showing regression so no its not real http://www.computerbase.de/2016-05/geforce-gtx-1080-test/11/#diagramm-ashes-of-the-singularity-async-compute
tho nothing really much in real life but still its the only sign we have to actually know they are emulating it still


----------



## PontiacGTX

Quote:


> Originally Posted by *amd955be5670*
> 
> Is there any news about Pascal and Async Compute? Is it real this time? What conclusion would the experts draw from the available slides.


Nvidia Added preemption but no asynchronous compute


----------



## bluezone

Quote:


> Originally Posted by *airfathaaaaa*
> 
> its showing regression so no its not real http://www.computerbase.de/2016-05/geforce-gtx-1080-test/11/#diagramm-ashes-of-the-singularity-async-compute
> tho nothing really much in real life but still its the only sign we have to actually know they are emulating it still


my version of that page is different, showing a slight improvement with Async. I book marked it when it first came out.

http://www.computerbase.de/2016-05/geforce-gtx-1080-test/11/#diagramm-ashes-of-the-singularity-async-compute_2



Did they retest?

Never mind 2160 vs 1440. 2160 only shows regression.


----------



## GnarlyCharlie

Different resolution?


----------



## bluezone

Quote:


> Originally Posted by *GnarlyCharlie*
> 
> Different resolution?


thank you I already caught that and edited as you were posting. Makes me wonder what's happening @ 2160 on the GTX 1080.


----------



## airfathaaaaa

Quote:


> Originally Posted by *bluezone*
> 
> thank you I already caught that and edited as you were posting. Makes me wonder what's happening @ 2160 on the GTX 1080.


software emulation is never good compare to any hardware solution


----------



## Paul17041993

The 1080, or pascal so far in general, doesn't have anything at all to take true advantage of DX12 etc. Doesn't even have branch units so a lot of the (I cant call them ALUs) FPUs are probably running idle in complex tasks, unless of course their pre-emption _actually works™_...

1080 is just a mid-range card with stupidly inefficient clocks.


----------



## CrazyElf

For the record (and this is known), but rumors are there will be 2 parts, a 4096 core version and a 6144 core version.

 

One question I do have is the RBEs - what is the optimal ratio of Render Back Ends (RBEs) to CUs?

 

Going by this, it may be that as you hypothesized, the 128 Z/Stencil ROPs are bottlenecking the RX 480. We see no improvements compared to the older 380X. I wonder though about the Fury X - we have a bottleneck going on here? Could more RBEs have helped? They increased the shader count by 45% compared to Hawaii, but the RBEs remained largely unchanged in count (although there were improvements elsewhere).

In Hawaii, we had 16 RBEs and 44 CUs, so 1 RBE : 2.75 CUs
In Fiji, the ratio dropped, with 16 RBEs, but now 64 CUs, so now 1 RBE per 4 CUs
Finally, now in Polaris 11, there are 8 RBEs, but now 36 CUs, so it has dropped even more to 1 RBE to 4.5 CUs
Is there an optimal ratio? Perhaps something in between Hawaii and Fiji?

Another question is, will this ratio change with future improvements? The CUs will get better for sure. Could the RBEs get improved somehow?

Quote:


> Originally Posted by *Mahigan*
> 
> I think AMD went a step further than splitting each CU into two groups of 32 ALUs. I think that AMD did one of two things (or both) for Polaris...
> 
> 1. Each CU retains 4 groups of 16 ALUs but each ALU can be individually power gated. Meaning that unused ALUs are powered down and used ALUs are boosted. So while a Polaris GPU may have a clockspeed of say 1Ghz, individual ALUs will have the capability of boosting to say 1.8GHz. The power savings from the shut down ALUs allow for the higher clock speed of the active ALUs.
> 
> 2. AMD will split each CU into 4 Groups of ALUs of differing sizes. The first group may have 2 ALUs, the second 4 and the third 8. Each group can support concurrent wavefronts (like hyperthreading). Basically executing multiple workloads at once. The power gating remains as described in post number 1.
> 
> That's what I figure AMD have done based on recent patent filings.


The real question is, what percentage of performance can they get per CU compared to Polaris? Polaris got 15% more performance per CU compared to the base GCN.

There is certainly the potential for performance improvements in Vega:


There is the fact that Vega will be shipping with HBM2. Some power savings right off the bat.
If they add more RBEs, as discussed above, then there is the potential for savings.
Vega may have other areas where the architecture has improved. Hopefully they will have resolved Occupancy Limits with the two proposals. They are power gating and splitting the CU's into different ALU group sizes.
Finally, and least important, if the 16/20nm TSMC process is actually more efficient that the 14/20nm Samsung/GF process, then there is potential there as well if AMD switches to TSMC for Vega.
I think that there is a lot of potential for sure.

Quote:


> Originally Posted by *Mahigan*
> 
> I don't think Local CU caches will change, neither will the LDS or GDS but the L2 cache will get a healthy increase. Increasing the cache sizes takes away from implementing more ROPs, CUs etc. So there's always a delicate balance that must be maintained.


4MB may very well be the healthy compromise then. At least on the 4096 part. Would the 6144 part benefit from the larger cache - say, if we scale it up, 6MB of L2?

Quote:


> Originally Posted by *Paul17041993*
> 
> The 1080, or pascal so far in general, doesn't have anything at all to take true advantage of DX12 etc. Doesn't even have branch units so a lot of the (I cant call them ALUs) FPUs are probably running idle in complex tasks, unless of course their pre-emption _actually works™_...
> 
> 1080 is just a mid-range card with stupidly inefficient clocks.


We will not know until Volta.
https://www.computerbase.de/2016-05/geforce-gtx-1080-test/11/



The situation is largely unchanged compared to the GTX 1080 and Fury X, only the GTX 1080 now has a lot more brute force. A 4096 part of Vega should expose this and if there is a 6144 core Vega, then it should blow the large Pascal out of the water, assuming a larger cache and sufficient RBEs.


----------



## Paul17041993

Pretty sure the L2 scales linearly with the amount of active CUs due to the linking design, but changing the active available size of L2 per L2 unit probably isn't anything difficult, and would highly likely be increased anyway due to being able to fit more on the smaller node.

As for the RBE's and their ratio, from what I can tell they were just left in low numbers as they have little importance to the overall performance in real world scenarios, increasing them with the amount of CU's might just result in reduced efficiency and possibly even reduced performance due to the complexity.

Haven't found much about what specifically goes on in these units though so I cant say much about them...


----------



## CrazyElf

Quote:


> Originally Posted by *Paul17041993*
> 
> Pretty sure the L2 scales linearly with the amount of active CUs due to the linking design, but changing the active available size of L2 per L2 unit probably isn't anything difficult, and would highly likely be increased anyway due to being able to fit more on the smaller node.


Not true.

Cache sizes can be upgraded - for example the Fury X with 4096 SP had 2 MB of L2 cache, which is the same as on Polaris with 2304 SP. It just comes at the expense of transistors elsewhere so there are trade-offs.
Quote:


> Originally Posted by *Paul17041993*
> 
> As for the RBE's and their ratio, from what I can tell they were just left in low numbers as they have little importance to the overall performance in real world scenarios, increasing them with the amount of CU's might just result in reduced efficiency and possibly even reduced performance due to the complexity.
> 
> Haven't found much about what specifically goes on in these units though so I cant say much about them...


The problem is that in Polaris, Mahigan noted that the Z/Stencil ROPs are bottlenecking the chip. It needed a 16 RBEs to have 256 Z/Stencil ROPs.



Note the lack of improvement compared to the r9 380X, despite being higher clocked and the new architecture. The 119 GT/s is not far from the theoretical maximum of 128 GT/s. in order for Polaris to reach its potential, it needed 16 RBEs or 256 GT/s maximum. GCN would have allowed this to happen.

I'm wondering if at 249 GT/s, the Fury X was also bottlenecked (theoretical maximum is 256 GT/s). In that case, the Fury X also needs more RBEs.



Logically a chip with almost 2x as many CUs is going to need 32 RBEs (4096 SPs vs 2304 SPs in Polaris) and again, the 6144 core chip with need 48 RBEs. At the very least, it needs 24 RBEs on the 4096 core part and 36 on the 6144 core part.

My preferred designs:

4096 shader part
4 MB of L2 cache
8 shader engines x 8 CUs per shader engine x 64 ALUs per CU
Each CU is then divided into different groups - rather than 4 groups of 16, multiple groups of 2, 4, 8, and 16 ALUs, each of which can power gate, while the ALUs most in use can turbo
Double the triangle performance of the RX480 (from 8 Shader Engines)
I would prefer 4 Render Back Ends per CU (so a total of 32) - meaning 128 color ROPs and 512 Z/Stencil ROPs (we think this may be a bottleneck on Polaris)
Front end will have a new Command Processor and 8 ACEs
8 GB of HBM2 (2 stacks of 4Gb) on a 2048 bit bus
Whatever other architectural upgrades they have made for Vega
They will also need to make sure the memory controller can actually use the VRAM to its full potential (Fury X only used 333 - 387 Gbps of the 512 Gbps)
This would be a GTX 1080 competitor. If the color compression is good enough, then GDDR5X might be used, at the expense of power consumption.

The 6144 part would take everything and add 50%, save the HBM2, which would have 16 GB (4 stacks of 4GB) on a 4096 bit bus. It must have HBM2.

6144 shader part
6 MB of L2 cache
12 shader engines x 8 CUs per shader engine x 64 ALUs per CU
Each CU is then divided into different groups - rather than 4 groups of 16, multiple groups of 2, 4, 8, and 16 ALUs, each of which can power gate, while the ALUs most in use can turbo
Will need to triple the performance of the RX480 (3x as many geometry processors)
I would prefer 4 Render Back Ends per CU (so a total of 48) - meaning 192 color ROPs and 768 Z/Stencil ROPs (we think this may be a bottleneck on Polaris)
Front end will have a new Command Processor and 12 ACEs (may be a new command processor)
16 GB of HBM2 (4 stacks of 4Gb) on a 4096 bit bus
Whatever other architectural upgrades they have made for Vega
They will also need to make sure the memory controller can actually use the VRAM to its full potential (Fury X only used 333 - 387 Gbps of the 512 Gbps) - especially important for the big Vega as it will be for high resolution gaming
This would take on big Pascal.

With Polaris, they've all but wiped out the triangle "gap" in GCN that it had with Nvidia. We've also seen some improvements to Color Compression. I suspect Nvidia is still better here, but with so much bandwidth in HBM2, it might not matter. On the Polaris front, I'd like to see a revised Polaris with double the RBEs - kind of like what AMD did from 4870 to 4890, made refinements.

RTG has had plenty of time to refine Vega, so they must have gone back to the drawing boards to see what works well and what doesn't by now based on their years of experience with GCN. By nature, they are at a drawback with DX11, but with Compute being more important, the FP64 is by no means wasted die for gaming from now on.


----------



## Paul17041993

Quote:


> Originally Posted by *CrazyElf*
> 
> Quote:
> 
> 
> 
> Originally Posted by *Paul17041993*
> 
> Pretty sure the L2 scales linearly with the amount of active CUs due to the linking design, *but changing the active available size of L2 per L2 unit probably isn't anything difficult*, and would highly likely be increased anyway due to being able to fit more on the smaller node.
> 
> 
> 
> Not true.
> 
> Cache sizes can be upgraded
Click to expand...

Did I not say that?









Quote:


> Originally Posted by *CrazyElf*
> 
> The problem is that in Polaris, Mahigan noted that the Z/Stencil ROPs are bottlenecking the chip. It needed a 16 RBEs to have 256 Z/Stencil ROPs.


Considering the 480 is optimised for 1080p, I highly doubt increasing the RBE's would have an effect in actual games. The Fury also has an optimal amount for its 4k target res and I would expect the new "Fury" to also have an optimal balance, on the other hand however it may have the same 16 RBEs as the majority of render time in modern games is the CU's doing light calculations, especially with the rise of pure-compute render tasks such as ray tracing.


----------



## PontiacGTX

Wccftech searched for rumours and found that patent or had a source
http://wccftech.com/amd-vega-10-vega-11-magnum/


----------



## Paul17041993

Quote:


> Originally Posted by *PontiacGTX*
> 
> Wccftech searched for rumours and found that patent or had a source
> http://wccftech.com/amd-vega-10-vega-11-magnum/


Yea, I had seen that article a while back. It's basically reinforcing the theory that polaris was the basic shrink-and-tune and vega is the true successor update.

Depending on how effective the gating is, we're looking at something that is simultaneously faster at basic _and_ complex/branching code as well as being more efficient with async. I think it's getting to a point where CPUs are almost pointless...


----------



## L36

Quote:


> Originally Posted by *Paul17041993*
> 
> Yea, I had seen that article a while back. It's basically reinforcing the theory that polaris was the basic shrink-and-tune and vega is the true successor update.
> 
> Depending on how effective the gating is, we're looking at something that is simultaneously faster at basic _and_ complex/branching code as well as being more efficient with async. *I think it's getting to a point where CPUs are almost pointless...*


That's a bold statement to make...


----------



## Paul17041993

Quote:


> Originally Posted by *L36*
> 
> Quote:
> 
> 
> 
> Originally Posted by *Paul17041993*
> 
> *I think it's getting to a point where CPUs are almost pointless...*
> 
> 
> 
> That's a bold statement to make...
Click to expand...

Certainly is now.

But jokes and x86 domination aside, GCN is already intelligent and capable enough of running literally everything with reasonable efficiency, adding to that a more dynamic ALU system further increases its capability to do so.


----------



## L36

Spoiler: Warning: Spoiler!



Quote:


> Originally Posted by *CrazyElf*
> 
> Not true.
> 
> Cache sizes can be upgraded - for example the Fury X with 4096 SP had 2 MB of L2 cache, which is the same as on Polaris with 2304 SP. It just comes at the expense of transistors elsewhere so there are trade-offs.
> The problem is that in Polaris, Mahigan noted that the Z/Stencil ROPs are bottlenecking the chip. It needed a 16 RBEs to have 256 Z/Stencil ROPs.
> 
> 
> 
> Note the lack of improvement compared to the r9 380X, despite being higher clocked and the new architecture. The 119 GT/s is not far from the theoretical maximum of 128 GT/s. in order for Polaris to reach its potential, it needed 16 RBEs or 256 GT/s maximum. GCN would have allowed this to happen.
> 
> I'm wondering if at 249 GT/s, the Fury X was also bottlenecked (theoretical maximum is 256 GT/s). In that case, the Fury X also needs more RBEs.
> 
> 
> 
> Logically a chip with almost 2x as many CUs is going to need 32 RBEs (4096 SPs vs 2304 SPs in Polaris) and again, the 6144 core chip with need 48 RBEs. At the very least, it needs 24 RBEs on the 4096 core part and 36 on the 6144 core part.
> 
> My preferred designs:
> 
> 4096 shader part
> 4 MB of L2 cache
> 8 shader engines x 8 CUs per shader engine x 64 ALUs per CU
> Each CU is then divided into different groups - rather than 4 groups of 16, multiple groups of 2, 4, 8, and 16 ALUs, each of which can power gate, while the ALUs most in use can turbo
> Double the triangle performance of the RX480 (from 8 Shader Engines)
> I would prefer 4 Render Back Ends per CU (so a total of 32) - meaning 128 color ROPs and 512 Z/Stencil ROPs (we think this may be a bottleneck on Polaris)
> Front end will have a new Command Processor and 8 ACEs
> 8 GB of HBM2 (2 stacks of 4Gb) on a 2048 bit bus
> Whatever other architectural upgrades they have made for Vega
> They will also need to make sure the memory controller can actually use the VRAM to its full potential (Fury X only used 333 - 387 Gbps of the 512 Gbps)
> This would be a GTX 1080 competitor. If the color compression is good enough, then GDDR5X might be used, at the expense of power consumption.
> 
> The 6144 part would take everything and add 50%, save the HBM2, which would have 16 GB (4 stacks of 4GB) on a 4096 bit bus. It must have HBM2.
> 
> 6144 shader part
> 6 MB of L2 cache
> 12 shader engines x 8 CUs per shader engine x 64 ALUs per CU
> Each CU is then divided into different groups - rather than 4 groups of 16, multiple groups of 2, 4, 8, and 16 ALUs, each of which can power gate, while the ALUs most in use can turbo
> Will need to triple the performance of the RX480 (3x as many geometry processors)
> I would prefer 4 Render Back Ends per CU (so a total of 48) - meaning 192 color ROPs and 768 Z/Stencil ROPs (we think this may be a bottleneck on Polaris)
> Front end will have a new Command Processor and 12 ACEs (may be a new command processor)
> 16 GB of HBM2 (4 stacks of 4Gb) on a 4096 bit bus
> Whatever other architectural upgrades they have made for Vega
> They will also need to make sure the memory controller can actually use the VRAM to its full potential (Fury X only used 333 - 387 Gbps of the 512 Gbps) - especially important for the big Vega as it will be for high resolution gaming
> This would take on big Pascal.
> 
> With Polaris, they've all but wiped out the triangle "gap" in GCN that it had with Nvidia. We've also seen some improvements to Color Compression. I suspect Nvidia is still better here, but with so much bandwidth in HBM2, it might not matter. On the Polaris front, I'd like to see a revised Polaris with double the RBEs - kind of like what AMD did from 4870 to 4890, made refinements.
> 
> RTG has had plenty of time to refine Vega, so they must have gone back to the drawing boards to see what works well and what doesn't by now based on their years of experience with GCN. By nature, they are at a drawback with DX11, but with Compute being more important, the FP64 is by no means wasted die for gaming from now on.






Based on recent rumors, this seems like a post coming into materialization. Certainly for the alleged 4096 core part and the performance it produced in AoS.


----------



## Paul17041993

Two big questions I have, assuming vega has full ALU gating, is how fast can the multipliers change and whether it can be controlled via shader instructions/hints for accurate switching. However if the core can allocate a single CU for sequential and leave the rest for pure parallel, then simply switch the required threads between or use additional fences, that'll work mostly perfectly with only very minor potential drops in parallel performance (for a 64 CU core this would be moot).


----------



## mtcn77

I think I understood at last what the texture filtering rate signifies: it means 'writes to memory'.
The random is non-delta-compressible and the black textures are the delta-compressible reads from the memory, so we have 1 write and 2 different read benchmarks.
In this perspective, AMD cards run at a surplus of total bandwidth for writes and Nvidia cards at a deficit; however Nvidia cards read much faster so long as textures are compressible.


----------



## Paul17041993

Quote:


> Originally Posted by *mtcn77*
> 
> I think I understood at last what the texture filtering rate signifies: it means 'writes to memory'.
> The random is non-delta-compressible and the black textures are the delta-compressible reads from the memory, so we have 1 write and 2 different read benchmarks.
> In this perspective, AMD cards run at a surplus of total bandwidth for writes and Nvidia cards at a deficit; however Nvidia cards read much faster so long as textures are compressible.


The reason for the 2:1 r:w is for the same reason that zen/ryzen has it, general compute. Additionally reading textures can be faster per unit due to smaller colour data (24 or 32bit most of the time), compared to writing buffers that tend to use FP (128bit per texture or 64 if using FP16).


----------



## mtcn77

Quote:


> Originally Posted by *Paul17041993*
> 
> The reason for the 2:1 r:w is for the same reason that zen/ryzen has it, general compute. Additionally reading textures can be faster per unit due to smaller colour data (24 or 32bit most of the time), compared to writing buffers that tend to use FP (128bit per texture or 64 if using FP16).


Yet, I still don't understand why put so much reliance on delta compression for general compute. If memory stalls due to incompressible textures, throughput suffers by 18%(247/292 gbps). Nvidia cards seem to have sufficient bandwidth to write 128-bit loads with %10-14 higher throughput than 32/64-bit loads(this signifies a rop deficit rather than a bandwidth deficit).
Note none of this happens with AMD. If all this trouble was general compute, the cards would not jeopardize performance on that feature, imo.


----------



## Paul17041993

Quote:


> Originally Posted by *mtcn77*
> 
> Yet, I still don't understand why put so much reliance on delta compression for general compute. If memory stalls due to incompressible textures, throughput suffers by 18%(247/292 gbps). Nvidia cards seem to have sufficient bandwidth to write 128-bit loads with %10-14 higher throughput than 32/64-bit loads(this signifies a rop deficit rather than a bandwidth deficit).
> Note none of this happens with AMD. If all this trouble was general compute, the cards would not jeopardize performance on that feature, imo.


Compute cant really make use of compression much due to accuracy loss, delta compression usually is only active on standard FP32RGBA buffers in OGL and DX, Vulkan and DX12 however use a long list of different rules for different techniques that activate in the hardware (for at least GCN anyway).

If you're thinking about the 4x0 series memory-bound-ness that simply goes down to them being mid-range cards and it's better to be memory bound as it's more efficient than over-stressing the core.


----------



## mtcn77

Quote:


> Originally Posted by *Paul17041993*
> 
> Compute cant really make use of compression much due to accuracy loss, delta compression usually is only active on standard FP32RGBA buffers in OGL and DX, Vulkan and DX12 however use a long list of different rules for different techniques that activate in the hardware (for at least GCN anyway).
> 
> If you're thinking about the 4x0 series memory-bound-ness that simply goes down to them being mid-range cards and it's better to be memory bound as it's more efficient than over-stressing the core.


Quote:


> Originally Posted by *Paul17041993*
> 
> The reason for the 2:1 r:w is for the same reason that zen/ryzen has it, general compute.


Are these two statements related? They seem contrary to one other.


----------



## Paul17041993

Quote:


> Originally Posted by *mtcn77*
> 
> Are these two statements related? They seem contrary to one other.


Which way do you mean? the amount of read and write units is separate to their techniques however both are generally optimised for both compute accuracy and flexibility.


----------



## mtcn77

I bet Mr. Scott Wasson's words have much to say about Fury's performance:
Quote:


> So why is AMD testing in this fashion? Probably because it plays to the Fiji GPU's strengths-that enormous shader array-while not leaning so hard on its potential areas of relative weakness, like ROP throughput (for MSAA resolve), texturing (for anisotropic filtering), and small triangles (since polys are relatively smaller at lower resolutions). Taking this peculiar path likely puts the R9 Nano in the best possible competitive light.


[Source]
I think I'm figuring out what all this means. Texture quality might be high, but shadows low, anisotropic filtering 2x(my find), surface texture optimisations on, texture quality of the driver to be performance with lots of resolution to throw around(I wish there was a separate polygon count setting).
Ironically, TechReport suggests Fiji has low texturing performance; however in the GTX 1080 review of ixbt, Fury X has quite a bit more performance than either Nvidia card until the hardest test which pushes the fillrate to inordinate amounts, I suppose. Then again, I never suggested anisotropic filtering should be more than 2x, so doesn't matter, imo.[Source]
I wish there was a best practices comparison between the two graphics makers.


----------



## semitope

Quote:


> Originally Posted by *mtcn77*
> 
> I bet Mr. Scott Wasson's words have much to say about Fury's performance:
> [Source]
> I think I'm figuring out what all this means. Texture quality might be high, but shadows low, anisotropic filtering 2x(my find), surface texture optimisations on, texture quality of the driver to be performance with lots of resolution to throw around(I wish there was a separate polygon count setting).
> Ironically, TechReport suggests Fiji has low texturing performance; however in the GTX 1080 review of ixbt, Fury X has quite a bit more performance than either Nvidia card until the hardest test which pushes the fillrate to inordinate amounts, I suppose. Then again, I never suggested anisotropic filtering should be more than 2x, so doesn't matter, imo.[Source]
> I wish there was a best practices comparison between the two graphics makers.


MSAA at 4K is pointless to demand right now and I would expect it off. I see no problem with them having it at 0. I hope that author does the same if nvidia puts up numbers because that whole section seems pointless. Nano compared to a mini 970 at whatever settings would win, there is no reason to get suspicious about it. I would guess AMD has their reasons for using 0x anisotropic filtering. In practice the performance impact does not suggest its a big deal. One could counter-speculate that they do that because nvidia's filtering is sub-par and would give them an unfair advantage, so its better to just leave it off.

Not sure he's accurate about those points either. ROP and texturing unlikely. Tessellation sure. But not tessellation generated polygons I doubt.


----------



## mtcn77

Quote:


> Originally Posted by *semitope*
> 
> MSAA at 4K is pointless to demand right now and I would expect it off. I see no problem with them having it at 0. I hope that author does the same if nvidia puts up numbers because that whole section seems pointless. Nano compared to a mini 970 at whatever settings would win, there is no reason to get suspicious about it. I would guess AMD has their reasons for using 0x anisotropic filtering. In practice the performance impact does not suggest its a big deal. One could counter-speculate that they do that because nvidia's filtering is sub-par and would give them an unfair advantage, so its better to just leave it off.
> 
> Not sure he's accurate about those points either. ROP and texturing unlikely. Tessellation sure. But not tessellation generated polygons I doubt.


I recall an ATi engineer reciting that native 16x anisotropic filtering would look up 128 texture calls which he connotated as "insane" for the task. Crosschecking that with the ixbt review findings, AMD Radeon seems rather strong until the setting that calls forth 80-400 textures. I cannot exactly correlate what setting that level matches, but 2x is all I need. Whether that is a fault of my card, or my driver, it is the one setting I can dismiss any concern for moire patterns.


----------



## PontiacGTX

An Enhanced Async compute is going to be used possibly with 2xFP16 OP and some changed onto the arithmetic operation/instructions











Quote:


> "Vega" NCU with Rapid Packed Math
> GPUs today often use more mathematical precision than
> necessary for the calculations they perform. Years ago, GPU
> hardware was optimized solely for processing the 32-bit
> floating point operations that had become the standard for
> 3D graphics. However, as rendering engines have become
> more sophisticated-and as the range of applications for
> GPUs has extended beyond graphics processing-the value
> of data types beyond FP32 has grown.
> The programmable compute units at the heart of "Vega"
> GPUs have been designed to address this changing
> landscape with the addition of a feature called Rapid
> Packed Math. Support for 16-bit packed math doubles peak
> floating-point and integer rates relative to 32-bit
> operations. It also halves the register space as well as the
> data movement required to process a given number of
> operations. The new instruction set includes a rich mix of
> 16-bit floating point and integer instructions, including
> FMA, MUL, ADD, MIN/MAX/MED, bit shifts, packing
> operations, and many more.
> 
> For applications that can leverage this capability, Rapid
> Packed Math can provide a substantial improvement in
> compute throughput and energy eciency. In the case of
> specialized applications like machine learning and training,
> video processing, and computer vision, 16-bit data types are
> a natural fit, but there are benefits to be had for more
> traditional rendering operations, as well. Modern games,
> for example, use a wide range of data types in addition to
> the standard FP32. Normal/direction vectors, lighting
> values, HDR color values, and blend factors are some
> examples of where 16-bit operations can be used.
> With mixed-precision support, "Vega" can accelerate the
> operations that don't benefit from higher precision while
> maintaining full precision for the ones that do. Thus, the
> resulting performance increases need not come at the
> expense of image quality.
> In addition to Rapid Packed Math, the NCU introduces a
> variety of new 32-bit integer operations that can improve
> performance and eciency in specific scenarios. These
> include a set of eight instructions to accelerate memory
> address generation and hashing functions (commonly used
> in cryptographic processing and cryptocurrency mining), as
> well as new ADD/SUB instructions designed to minimize
> register usage.
> The NCU also supports a set of 8-bit integer SAD (Sum of
> Absolute Dierences) operations. These operations are
> important for a wide range of video and image processing
> algorithms, including image classification for machine
> learning, motion detection, gesture recognition, stereo
> depth extraction, and computer vision. The QSAD
> instruction can evaluate 16 4x4-pixel tiles per NCU per
> clock cycle and accumulate the results in 32-bit or 16-bit
> registers. A maskable version (MQSAD) can provide
> further optimization by ignoring background pixels and
> focusing computation on areas of interest in an image.
> The potent combination of innovations like Rapid Packed
> Math and the increased clock speeds of the NCU deliver a
> major boost in peak math throughput compared with
> earlier products, with a single "Vega" 10 GPU capable of
> exceeding 27 teraflops or 55 trillion integer ops per second


----------

