Explanation for how different processor cores are connected?

Explanation for how different processor cores are connected?
by Drew Sebastino on 2018-05-06 (#218081)

With how just about every modern processor is multi core, this is something that should be trivial to find, but it isn't for whatever reason.

From what I've seen, most every multi core processor has at least two levels of cache; L1 is for each core, and L2 is shared by all of them. What I'm wondering, is how is L2 shared? Assuming we're talking about a dual core processor, is it that even cycles go to one core, and odd cycles go to the other? Or does each processor generally handle this differently? Of course, there's the issue of general external ram then; can each core even interact with ram directly, or must it perform a dma transfer to move the data in and out of cache in chunks?

Re: Explanation for how different processor cores are connec
by Bregalad on 2018-05-07 (#218089)

Quote:

Assuming we're talking about a dual core processor, is it that even cycles go to one core, and odd cycles go to the other?

No. Only L1 cache is accessed randomly, L2 cache is read in bursts, which are copied to L1 caches before being acessed randomly, and then when the data is no longer needed it's written back to L2 cache (or further down). So basically when either core needs data which is not in L1 cache, it's code execution halt and there's a burst read from L2 to L1 before it can resume it's activity. I suppose that if the other core needs data at the same time it has to wait the burst from the other core then it's own burst before continuing. This only work because execution can be so much faster after the data is here in L1 cache.

This is an extremely complex topic which is subject to research at Master and PHD levels, and actually this is the source of the infamous spectre and meltdown vulnerabilities.

Quote:

Of course, there's the issue of general external ram then; can each core even interact with ram directly, or must it perform a dma transfer to move the data in and out of cache in chunks?

Multiple models of caching exists and several have advantages and drawbacks. For instance, write-through cache is simpler because writes are performed to cache and RAM simultaneously, while only reads are truly cached, this is simpler and avoids the problem of data in RAM being out of sync with reality (note, this is really a huge problem when having multiple cores), however it really halfs the efficiency of caching in the 1st place.

Re: Explanation for how different processor cores are connec
by Oziphantom on 2018-05-07 (#218106)

the Cache is transparent to the CPU, the CPU just says "I want address X", it never engages in cache management. So if its in L1/L2/L3 RAM makes no difference to the CPU.

The cache managers deal with what pages should and shouldn't be swapped in and out of each cache. This has a variety of techniques and methods over the years, based upon co-locality and recent usage hits etc. The X86 actually form read and flush lanes ( at at least they use to, might have changed by now) and once a lane gets full, it flushes the data out and down. There are way to command the cache controllers to force a RAM read, with special addresses and commands etc, its only something you need to do very rarely, I think I've done it a whole 5 times in my career.

There is nothing to stop both CPUs trying to write the same RAM at the same time, this is known as a race condition, while the logic for the RAM will have a priority system to handle such clashes, the cache logic is unable to determine which data is the right one. This is called a Race Condition, and the onus is on the programmer to ensure that none happen. Ensuring this is what makes multi-core or even multi-thread programming very tricky.

the ARM multi cores are typically not connected via Cache and you have to use special "pigeon hole" transfers to get data from one to the other..

Re: Explanation for how different processor cores are connec
by Drew Sebastino on 2018-05-07 (#218140)

Bregalad wrote:

this is the source of the infamous spectre and meltdown vulnerabilities.

I thought it was due to some sort of out of execution mumbo jumbo, unless that's what this is.

Bregalad wrote:

write-through cache is simpler because writes are performed to cache and RAM simultaneously

So there's no discrepancy between cache and ram I guess?

Oziphantom wrote:

the Cache is transparent to the CPU, the CPU just says "I want address X", it never engages in cache management. So if its in L1/L2/L3 RAM makes no difference to the CPU.

Interesting... Do you have any idea where cache and ram are located in the address space, unless I'm not understanding this correctly?

Oziphantom wrote:

There is nothing to stop both CPUs trying to write the same RAM at the same time, this is known as a race condition, while the logic for the RAM will have a priority system to handle such clashes

Wouldn't it be the logic for the memory controller managing the processor's FSB?

So what I'm getting from all of this is that to the programmer, a multi core processor isn't much different than a single core?

Re: Explanation for how different processor cores are connec
by adam_smasher on 2018-05-07 (#218142)

Cache doesn't live in the address space. When the CPU executes a memory read, it checks to see if the address(es) it's accessing are in the cache. If it is, it fetches the data from cache; if not, it loads the data from RAM into the cache in order to fetch it. From the programmer's perspective, the only real difference is that one is way slower than the other. Also, things can get weird if different cores are reading/writing to memory at the same time, because operations can happen out of order and the CPU won't necessarily write back data written on one CPU's cache to main memory before another reads it. Maybe give this paper a gander if you're curious about that.

There's no standard for where memory is mapped into the physical address space, AFAIK. The BIOS/EFI tells the OS kernel that.

Applications get a virtual address space. The OS configures the CPU so that when an application tries to read from a certain address, it maps onto certain physical addresses - or doesn't map onto anything, which triggers an interrupt (page fault) that lets the OS, say, load the desired data out of swap on the hard disk, or crash the program for its bad access. The OS dynamically allocates physical address space for processes, so they can be basically anywhere in physical memory, and they don't really know anything about where.

Even a process' virtual address space tends to be dynamically arranged on modern OSes (see address space layout randomization).

Re: Explanation for how different processor cores are connec
by 93143 on 2018-05-07 (#218144)

(ninja'd, but oh well)

Cache isn't a separate address space; it's just a way for the CPU to remember what the data (or code) at some address was last time it saw it, so it doesn't incur the latency of hitting RAM every time it needs it. At least, that's the idea for reads; writes are a similar principle in that it's faster to pretend to operate on RAM and then just have the cache make sure RAM is actually updated at some point.

The Super FX has an instruction cache. Normally, instruction execution is 3 cycles per byte in 10.7 MHz mode, or 5 in 21.4 MHz mode (plus any internal processing that happens after the instruction is loaded, such as with multiplication), and it has to wait for data loads through the ROM buffer (or reads/writes/pixel plotting through the RAM buffer, if you're executing from RAM. Don't execute from RAM if you can avoid it). But if you throw in a cache instruction, the GSU starts copying executed code to the instruction cache, so the next time you loop through that code it's only one cycle per byte and loads in parallel with both bus buffers. There's no way for the programmer to access the instruction cache specifically (at least, not with the GSU - it's mapped to the A bus for some reason, so the S-CPU can write to it); it just does its thing in the background. The only difference is that execution magically gets faster.

Modern CPUs have data caching as well as code caching, write as well as read, and the specifics are far more sophisticated. But the basic idea is the same, that being that it's transparent to the programmer. (Also, I suspect the use of an explicit start-caching-here instruction is archaic.) This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-07 (#218148)

93143 wrote:

Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

Re: Explanation for how different processor cores are connec
by Drew Sebastino on 2018-05-07 (#218149)

93143 wrote:

This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.

I'd have thought this would be irrelevant; what's the textbook definition of cache in computer science? It'd have thought it'd just be "smaller but faster ram."

rainwarrior wrote:

93143 wrote:

Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

You just reiterated what he said. :lol:

Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?

Re: Explanation for how different processor cores are connec
by 93143 on 2018-05-07 (#218150)

rainwarrior wrote:

93143 wrote:

Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

Interesting. Is that the only thing I got wrong?

Espozo wrote:

You just reiterated what he said. :lol:

No, he... OH SNAP

Quote:

Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?

I've used MPI, and you do need to pay attention. Even in C++, there's a bunch of functions involving broadcasting data, communicating between specific processors, checks to see which processor this particular instance of the code is running on, and blocking execution at checkpoints until all processors have finished what they need to get done before the next step starts. It's all scalable to any number of processors, but it's not entirely transparent to the programmer.

To be fair, the version I've been using is over ten years old...

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-07 (#218151)

Espozo wrote:

93143 wrote:

This is apparently why some people object to the term "cache" being used for TMEM on the N64, because unlike the PSX's texture cache, you have to explicitly load data into TMEM in order to use it.

I'd have thought this would be irrelevant; what's the textbook definition of cache in computer science? It'd have thought it'd just be "smaller but faster ram."

Cache is a more generic term, it's not memory-specific. A cache optimization stores some result and reuses it, instead of having to do it the "long" way. That long way could refer to a slow main memory fetch, or it could be some kind of computation, etc.

For instance, a search engine will cache recent search results so that when someone else asks for the same thing, they can just reuse that result instead of doing the whole search process again.

Outside the computer context a cache is just a place where you store stuff, or a collection of stored stuff. In computer science "cache" means "use a cache as an optimization". Store something to save effort.

Espozo wrote:

rainwarrior wrote:

93143 wrote:

Also, I suspect the use of an explicit start-caching-here instruction is archaic.

It's not archaic. Cache control instructions are very useful in high performance development.

You just reiterated what he said. :lol:

I don't think I did, but I guess 93143 really just meant that the SuperFX's mechanism is archaic, or maybe even more specifically that the SuperFX's instruction caching doesn't happen unless you use an instruction to enable it? Modern CPUs do most of their caching in an automated way, without programmer or compiler intervention. For high performance code, though, and occasionally for other reasons, you have some explicit control instructions that can be taken advantage of. Ultimately, you can know more about your code and data than the compiler or CPU does; in the right place doing it "by hand" will outperform the automated version.

Espozo wrote:

Anyway, if the processor itself does so much abstracting already, then do you still need to program for each core specifically, or does the processor somehow handle that during runtime as well?

I think controlling which thread goes on which core is normally the operating system's job. It's got to multitask all running programs, not just yours. If you have 4 cores and create 4 threads and do heavy activity on each of them, the OS will generally automatically spread that load over the 4 cores. That's its job, and it will normally manage to do it fairly well. (Threads can move across cores, too. They don't have to stay where they began.)

...but again, there's ways to explicitly control this too. You can ask an OS how many cores you have, and you can tell it to put a thread on a specific core. On a console where you're not really multitasking in the same way, and the computer architecture is guaranteed, it's often very appropriate to do this. It's a refinement that can be used if needed.

See: Wikipedia: Processor Affinity

Re: Explanation for how different processor cores are connec
by Drew Sebastino on 2018-05-07 (#218152)

rainwarrior wrote:

I don't think I did

It was a poor joke; I was saying that "high performance development" was another way of saying "archaic." Never mind. :lol:

Yeah, for running multiple programs, the operating system definitely can spread the load onto multiple cores (a program or two per core), but I don't know how it would do this for a single intensive program that wants to use the system's full resource; I wouldn't think you could just generically divide up the load, not only because how would you even divide it in the first place, but because obviously programs are going to assume things are going to be processed linearly.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-07 (#218154)

Espozo wrote:

If you want to control it, you query the OS about how many cores there are, create a thread that's bound to each core, and run what you want to on each thread.

If a program doesn't create multiple threads, it will only ever use one core. The OS can't split it up.

Though, you could just create a few threads, assign them independent work as you can, and let the OS sort out balancing them for you. That's probably easier to write, and will translate to a variety of systems better. Doing it manually is a little more common on consoles where you can make a lot stronger assumptions about the architecture.

Re: Explanation for how different processor cores are connec
by 93143 on 2018-05-07 (#218157)

rainwarrior wrote:

I guess 93143 really just meant that the SuperFX's mechanism is archaic, or maybe even more specifically that the SuperFX's instruction caching doesn't happen unless you use an instruction to enable it?

Kinda, but I also didn't know that manual was an option on modern CPUs. I never dive deeper than C++ on modern processors.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-07 (#218160)

The main case I've seen it used for performance is just that the automated caching only helps with repeated use of the same block, because it doesn't know what memory blocks to cache until the CPU actually uses it for the first time. Since a cache fetch can happen in parallel with other execution (that doesn't depend on it), you can give the CPU advance notice to have the block cached before your code needs it, and bypass that first fetch penalty altogether.

There are a bunch of other reasons to use manual cache control, for threading or communication between devices. The XBox 360 GPU shared some memory with the main CPU, but they had separate caches that didn't know about each other. If you didn't manually flush the cache, there was no guarantee about when that data would migrate to the GPU. I once had a weird bug because of this where a character would load in as a ball of random spikes that would slowly turn into the character as you kept playing.

Re: Explanation for how different processor cores are connec
by lidnariq on 2018-05-07 (#218162)

(x86 has had the PREFETCH instruction since pentium 3 days.)

Re: Explanation for how different processor cores are connec
by pubby on 2018-05-08 (#218208)

The built-in prefetcher tends to be pretty good though. PREFETCH makes code slower in many cases.

Espozo wrote:

I don't know how it would do this for a single intensive program that wants to use the system's full resource; I wouldn't think you could just generically divide up the load, not only because how would you even divide it in the first place, but because obviously programs are going to assume things are going to be processed linearly.

It's stupidly hard. See: concurrent programming. The billion dollar question is how to generically parallelize things.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-08 (#218214)

pubby wrote:

The built-in prefetcher tends to be pretty good though. PREFETCH makes code slower in many cases.

Built in prefetching works well for some access patterns (esp. serial/linear access), and in a similar way you can often improve performance by reorganizing how your data is stored, which doesn't in itself require manual cache control, and is probably a better first approach to cache optimization if your data structures are still malleable.

A misused manual prefetch absolutely does make code slower, but IMO if you're effectively using manual cache control you should already measuring its effect to make sure, so you shouldn't fall into this trap unless you're using it blindly.

For instruction prefetching, the automated version tends to work quite well (the main mitigating factor here is branch prediction), since code mostly runs linearly anyway. I've never actually seen manually prefetched code, only data, though I'm sure there's a use case out there somewhere.

Re: Explanation for how different processor cores are connec
by psycopathicteen on 2018-05-09 (#218289)

I remember the late 90's, early 2000's computers where speed was VERY unpredictable. Every time there was slightly too much going on, the computer would go from 60 fps to 5 fps. I'm guessing it's a combination of cache misses, and hard-drive reading/writing.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-09 (#218290)

When you're multitasking, if two programs you're running need to share more RAM than you have available, when the OS switches between them it will temporarily copy RAM to the hard drive to make room. Or even if one program requests too much RAM (web browsers are notorious memory hogs) it may page some of it to the hard disk.

I think the larger availability of RAM has made a huge difference for this, which is not the only cause of unpredictable performance loss but it was definitely a major one. You might also consider this a caching issue, as the physical RAM is essentially a faster cache for the larger virtual RAM space including the hard drive.

Re: Explanation for how different processor cores are connec
by adam_smasher on 2018-05-09 (#218291)

A couple of other technology improvements that have helped: SSDs mean the difference between disk and memory access is no longer quite so dramatic; multicore processors mean that other processes are less likely to steal your compute time out from underneath you.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-09 (#218295)

Oh yeah, and there's also hybrid SSDs where you have a smaller SSD operating on top of a larger traditional hard drive, again a form of cache. Actually hard drives generally have varying types of cache devices, all of it internal and transparent to the interface, usually. Also helps with durability, e.g. you can hold data in the cache if sudden movement is detected to try and prevent a crash.

Re: Explanation for how different processor cores are connec
by rainwarrior on 2018-05-09 (#218296)

rainwarrior wrote:

For some more reference on this: Memoization is a term that describes the technique of result caching, and strongly correlated with this is Dynamic Programming which is a large family of algorithms that can be made very efficient by caching results progressively.

Re: Explanation for how different processor cores are connec
by koitsu on 2018-05-09 (#218297)

adam_smasher wrote:

Yes, all true, except no longer is CPU load what we expect (and here's a 5 minute presentation of the same content from the same author). Thanks, Meltdown/Spectre!

Re: Explanation for how different processor cores are connec
by 93143 on 2018-05-09 (#218300)

adam_smasher wrote:

multicore processors mean that other processes are less likely to steal your compute time out from underneath you.

Even core multithreading by itself helps. Without it, multitasking had a lot of overhead on a single-core CPU because the OS could only run one thread at a time, and every time it switched tasks the CPU had to get rid of what it was doing and start from scratch on the new thread. This meant that a single high-priority process could basically stall the whole machine. When I got a Pentium 4 Prescott (single-core hyperthreading) in 2004, I stopped having this problem - Matlab could get stuck in a tight loop and forget to listen to the command line just like before, but Windows still worked fine around it. I could run Emulator X and DOSBox (both CPU hogs) in parallel without hiccups, to get realistic orchestra music out of TIE Fighter. It was pretty neat for the time, I thought... at least during winter, with an aftermarket cooler...