e-book Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9)

Free download. Book file PDF easily for everyone and every device. You can download and read online Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9) file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9) book. Happy reading Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9) Bookeveryone. Download file Free Book PDF Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9) at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF Architecture of Massively Parallel Microprocessor Systems (Computer Architecture Book 9) Pocket Guide.


  1. All Products
  2. READ THE NEW BOOK Computer System Architecture (3rd Edition) BOOOK ONLINE - video dailymotion
  3. From Wikipedia, the free encyclopedia
  4. MPP (massively parallel processing)

Embedded hardware is often much simpler than a desktop system, but it can also be far more complex too. An embedded computer may be implemented in a single chip with just a few support components, and its purpose may be as crude as a controller for a garden-watering system. Alternatively, the embedded computer may be a processor, distributed parallel machine responsible for all the flight and control systems of a commercial jet.

As diverse as embedded hardware may be, the underlying principles of design are the same. This chapter introduces some important concepts relating to computer architecture, with specific emphasis on those topics relevant to embedded systems. Its purpose is to give you grounding before moving on to the more hands-on information that begins in Chapter 2. In essence, a computer is a machine designed to process, store, and retrieve data.

Data may be numbers in a spreadsheet, characters of text in a document, dots of color in an image, waveforms of sound, or the state of some system, such as an air conditioner or a CD player. All data is stored in the computer as numbers. The computer manipulates the data by performing operations on the numbers. Displaying an image on a screen is accomplished by moving an array of numbers to the video memory, each number representing a pixel of color.

What is Massive Parallel Processing

To play an MP3 audio file, the computer reads an array of numbers from disk and into memory, manipulates those numbers to convert the compressed audio data into raw audio data, and then outputs the new set of numbers the raw audio data to the audio chip. Everything that a computer does, from web browsing to printing, involves moving and processing numbers. The electronics of a computer is nothing more than a system designed to hold, move, and change numbers. A computer system is composed of many parts, both hardware and software.

At the heart of the computer is the processor, the hardware that executes the computer programs. The computer also has memory, often several different types in one system. The memory is used to store programs while the processor is running them, as well as store the data that the programs are manipulating. The computer also has devices for storing data, or exchanging data with the outside world.

These may allow the input of text via a keyboard, the display of information on a screen, or the movement of programs and data to or from a disk drive. The software controls the operation and functionality of the computer. Typically, a given layer will only interact with the layers immediately above or below it. At the lowest level, there are programs that are run by the processor when the computer first powers up.

These programs initialize the other hardware subsystems to a known state and configure the computer for correct operation. The bootloader is located in the firmware. The bootloader is a special program run by the processor that reads the operating system from disk or nonvolatile memory or network interface and places it in memory so that the processor may then run it. The bootloader is present in desktop computers and workstations, and may be present in some embedded computers.

Above the firmware, the operating system controls the operation of the computer. It organizes the use of memory and controls devices such as the keyboard, mouse, screen, disk drives, and so on. It is also the software that often provides an interface to the user, enabling her to run application programs and access her files on disk. The operating system typically provides a set of software tools for application programs, providing a mechanism by which they too can access the screen, disk drives, and so on.

Not all embedded systems use or even need an operating system. Often, an embedded system will simply run code dedicated to its task, and the presence of an operating system is overkill. In other instances, such as network routers, an operating system provides necessary software integration and greatly simplifies the development process. Whether an operating system is needed and useful really depends on the intended purpose of the embedded computer and, to a lesser degree, on the preference of the designer.

At the highest level, the application software constitutes the programs that provide the functionality of the computer. Everything below the application is considered system software. For embedded computers, the boundary between application and system software is often blurred.

This reflects the underlying principle in embedded design that a system should be designed to achieve its objective in as simple and straightforward a manner as possible. The processor is the most important part of a computer, the component around which everything else is centered. In essence, the processor is the computing part of the computer. A processor is an electronic device capable of manipulating data information in a way specified by a sequence of instructions. The instructions are also known as opcodes or machine code. This sequence of instructions may be altered to suit the application, and, hence, computers are programmable.

A sequence of instructions is what constitutes a program. Instructions in a computer are numbers, just like data. Different numbers, when read and executed by a processor, cause different things to happen. A good analogy is the mechanism of a music box.

A music box has a rotating drum with little bumps, and a row of prongs. As the drum rotates, different prongs in turn are activated by the bumps, and music is produced.

All Products

In a similar way, the bit patterns of instructions feed into the execution unit of the processor. Different bit patterns activate or deactivate different parts of the processing core. Thus, the bit pattern of a given instruction may activate an addition operation, while another bit pattern may cause a byte to be stored to memory.

A sequence of instructions is a machine-code program. Each type of processor has a different instruction set , meaning that the functionality of the instructions and the bit patterns that activate them varies. The processor alone is incapable of successfully performing any tasks. The basic computer system is shown in Figure A microprocessor is a processor implemented usually on a single, integrated circuit.

With the exception of those found in some large supercomputers, nearly all modern processors are microprocessors , and the two terms are often used interchangeably. The range of available microcontrollers is very broad. In this book, we will look at both microprocessors and microcontrollers. Microcontrollers are very similar to System-on-Chip SoC processors, intended for use in conventional computers such as PCs and workstations. Microcontrollers usually have all their memory on-chip and may provide only limited support for external memory devices.

The memory of the computer system contains both the instructions that the processor will execute and the data it will manipulate. The memory of a computer system is never empty. It always contains something, whether it be instructions, meaningful data, or just the random garbage that appeared in the memory when the system powered up. Instructions are read fetched from memory, while data is both read from and written to memory, as shown in Figure This form of computer architecture is known as a Von Neumann machine , named after John Von Neumann, one of the originators of the concept.

With very few exceptions, nearly all modern computers follow this form. Von Neumann computers are what can be termed control-flow computers. The steps taken by the computer are governed by the sequential control of a program. In other words, the computer follows a step-by-step program that governs its operation. There are some interesting non-Von Neumann architectures, such as the massively parallel Connection Machine and the nascent efforts at building biological and quantum computers, or neural networks.

A processor can be directed to begin execution at a given point in memory, and it has no way of knowing whether the sequence of numbers beginning at that point is data or instructions. The processor has no way of telling what is data or what is an instruction. If a number is to be executed by the processor, it is an instruction; if it is to be manipulated, it is data.

Because of this lack of distinction, the processor is capable of changing its instructions treating them as data under program control. And because the processor has no way of distinguishing between data and instruction, it will blindly execute anything that it is given, whether it is a meaningful sequence of instructions or not.

There is nothing to distinguish between a number that represents a dot of color in an image and a number that represents a character in a text document. Meaning comes from how these numbers are treated under the execution of a program. This means that sequences of instructions in a program may be treated as data by another program. A compiler creates a program binary by generating a sequence of numbers instructions in memory.

To the compiler, the compiled program is just data, and it is treated as such. It is a program only when the processor begins execution. Similarly, an operating system loading an application program from disk does so by treating the sequence of instructions of that program as data. The program is loaded to memory just as an image or text file would be, and this is possible due to the shared memory space. Each location in the memory space has a unique, sequential address. The address of a memory location is used to specify and select that location.

The address space is the array of all addressable memory locations. Hence, the processor is said to have a 64K address space. Most microprocessors available are standard Von Neumann machines. The main deviation from this is the Harvard architecture , in which instructions and data have different memory spaces Figure with separate address, data, and control buses for each memory space. This has a number of advantages in that instruction and data fetches can occur concurrently, and the size of an instruction is not set by the size of the standard data unit word.

A bus is a physical group of signal lines that have a related function. Buses allow for the transfer of electrical signals between different parts of the computer system and thereby transfer information from one device to another. For example, the data bus is the group of signal lines that carry data between the processor and the various subsystems that comprise the computer.

For example, an 8-bit-wide bus transfers 8 bits of data in parallel. The majority of microprocessors available today with some exceptions use the three-bus system architecture Figure The three buses are the address bus , the data bus , and the control bus. The data bus is bidirectional, the direction of transfer being determined by the processor.

The address bus carries the address, which points to the location in memory that the processor is attempting to access. It is the job of external circuitry to determine in which external device a given memory location exists and to activate that device. This is known as address decoding. The control bus carries information from the processor about the state of the current access, such as whether it is a write or a read operation.

The control bus can also carry information back to the processor regarding the current access, such as an address error. Different processors have different control lines, but there are some control lines that are common among many processors. The control bus may consist of output signals such as read, write, valid address, etc. A processor usually has several input control lines too, such as reset, one or more interrupt lines, and a clock input. It was a massive machine, filling a very big room with the type of solid hardware that you can really kick.

It was quite an experience looking over the old machine. I remember at one stage walking through the disk controller it was the size of small room and looking up at a mass of wires strung overhead. I asked what they were for. There are six basic types of access that a processor can perform with external chips. The internal data storage of the processor is known as its registers. The instructions that are read and executed by the processor control the data flow between the registers and the ALU. A symbolic representation of an ALU is shown in Figure These values, called operands , are typically obtained from two registers, or from one register and a memory location.

The result of the operation is then placed back into a given destination register or memory location. The status outputs indicate any special attributes about the operation, such as whether the result was zero, negative, or if an overflow or carry occurred.

Some processors have separate units for multiplication and division, and for bit shifting, providing faster operation and increased throughput. Each architecture has its own unique ALU features, and this can vary greatly from one processor to another. However, all are just variations on a theme, and all share the common characteristics just described. Interrupts also known as traps or exceptions in some processors are a technique of diverting the processor from the execution of the current program so that it may deal with some event that has occurred.

An interrupt is generated in your computer every time you type a key or move the mouse. You can think of it as a hardware-generated function call. Instead, the processor may continue with other tasks. Interrupts can be of varying priorities in some processors, thereby assigning differing importance to the events that can interrupt the processor. If the processor is servicing a low-priority interrupt, it will pause it in order to service a higher-priority interrupt.

However, if the processor is servicing an interrupt and a second, lower-priority interrupt occurs, the processor will ignore that interrupt until it has finished the higher-priority service. When an interrupt occurs, the usual procedure is for the processor to save its state by pushing its registers and program counter onto the stack. The processor then loads an interrupt vector into the program counter. The interrupt vector is the address at which an interrupt service routine ISR lies. Thus, loading the vector into the program counter causes the processor to begin execution of the ISR, performing whatever service the interrupting device required.

This causes the processor to reload its saved state registers and program counter from the stack and resume its original program. Interrupts are largely transparent to the original program. Processors with shadow registers use these to save their current state, rather than pushing their register bank onto the stack. This saves considerable memory accesses and therefore time when processing an interrupt. If it does not, important state information will be lost.

Upon returning from an ISR, the contents of the shadow registers are swapped back into the main register array. For some time-critical applications, polling can reduce the time it takes for the processor to respond to a change of state in a peripheral. A better way is for the device to generate an interrupt to the processor when it is ready for a transfer to take place. Small, simple processors may only have one or two interrupt inputs, so several external devices may have to share the interrupt lines of the processor.

When an interrupt occurs, the processor must check each device to determine which one generated the interrupt. This can also be considered a form of polling. The advantage of interrupt polling over ordinary polling is that the polling occurs only when there is a need to service a device. Polling interrupts is suitable only in systems that have a small number of devices; otherwise, the processor will spend too long trying to determine the source of the interrupt. Vectored interrupts reduce considerably the time it takes the processor to determine the source of the interrupt.

If an interrupt request can be generated from more than one source, it is therefore necessary to assign priorities levels to the different interrupts. This can be done in either hardware or software , depending on the particular application. In this scheme, the processor has numerous interrupt lines, with each interrupt corresponding to a given interrupt vector. Vectored interrupts can be taken one step further. Some processors and devices support the device by actually placing the appropriate vector onto the data bus when they generate an interrupt. This means the system can be even more versatile, so that instead of being limited to one interrupt per peripheral, each device can supply an interrupt vector specific to the event that is causing the interrupt.

However, the processor must support this function, and most do not. Some processors have a feature known as a fast hardware interrupt. With this interrupt, only the program counter is saved. It assumes that the ISR will protect the contents of the registers by manually saving their state as required. A special and separate interrupt line is used to generate fast interrupts. A software interrupt is generated by an instruction. It is the lowest-priority interrupt and is generally used by programs to request a service to be performed by the system software operating system or firmware.

So why are software interrupts used? For that matter, why use an operating system to perform tasks for us at all? It gets back to compatibility. Jumping to a subroutine calling a function is jumping to a specific address in memory. A future version of the system software may not locate the subroutines at the same addresses as earlier versions.

By using a software interrupt, our program does not need to know where the routines lie. It relies on the entry in the vector table to direct it to the correct location. CISC processors have a single processing unit, external memory, and a relatively small register set and many hundreds of different instructions. In many ways, they are just smaller versions of the processing units of mainframe computers from the s. The tendency in processor design throughout the late 70s and early 80s was toward bigger and more complicated instruction sets.

The diversity of instructions in a CISC processor can run to well over 1, opcodes in some processors, such as the Motorola This had the advantage of making the job of the assembly-language programmer easier, since you had to write fewer lines of code to get the job done. As memory was slow and expensive, it also made sense to make each instruction do more. This reduced the number of instructions needed to perform a given function, and thereby reduced memory space and the number of memory accesses required to fetch instructions. As memory got cheaper and faster, and compilers became more efficient, the relative advantages of the CISC approach began to diminish.

One main disadvantage of CISC is that the processors themselves get increasingly complicated as a consequence of supporting such a large and diverse instruction set.

READ THE NEW BOOK Computer System Architecture (3rd Edition) BOOOK ONLINE - video dailymotion

The control and instruction decode units are complex and slow, the silicon is large and hard to produce, and they consume a lot of power and therefore generate a lot of heat. As processors became more advanced, the overheads that CISC imposed on the silicon became oppressive. A given processor feature when considered alone may increase processor performance but may actually decrease the performance of the total system, if it increases the total complexity of the device.

It was found that by streamlining the instruction set to the most commonly used instructions, the processors become simpler and faster. Fewer cycles are required to decode and execute each instruction, and the cycles are shorter. The drawback is that more simpler instructions are required to perform a task, but this is more than made up for in the performance boost to the processor. The realization of this led to a rethink of processor design. The result was the RISC architecture, which has led to the development of very high-performance processors. The basic philosophy behind RISC is to move the complexity from the silicon to the language compiler.

The hardware is kept as simple and fast as possible. A given complex instruction can be performed by a sequence of much simpler instructions. For example, many processors have an xor exclusive OR instruction for bit manipulation, and they also have a clear instruction to set a given register to zero. However, a register can also be set to zero by xor -ing it with itself.

Thus, the separate clear instruction is no longer required.

  1. Massively parallel - Wikipedia;
  2. The new landscape of parallel computer architecture!
  3. What is a supercomputer?;
  4. Onur Mutlu's Research and Publications.
  5. Equal Rites: The Book of Mormon, Masonry, Gender, and American Culture (Religion and American Culture);

It can be replaced with the already present xor. Further, many processors are able to clear a memory location directly by writing a zero to it. That same function can be implemented by clearing a register and then storing that register to the memory location. The instruction to load a register with a literal number can be replaced with the instruction for clearing a register, followed by an add instruction with the literal number as its operand. Thus, six instructions xor , clear reg , clear memory , load literal , store , and add can be replaced with just three xor , store , and add.

From Wikipedia, the free encyclopedia

The resulting code size is bigger, but the reduced complexity of the instruction decode unit can result in faster overall operation. Dozens of such code optimizations exist to give RISC its simplicity. RISC processors have a number of distinguishing characteristics. They have large register sets in some architectures numbering over 1, , thereby reducing the number of times the processor must access main memory. Often-used variables can be left inside the processor, reducing the number of accesses to slow external memory.

Compilers of high-level languages such as C take advantage of this to optimize processor performance. By having smaller and simpler instruction decode units, RISC processors have fast instruction execution, and this also reduces the size and power consumption of the processing unit. Generally, RISC instructions will take only one or two cycles to execute this depends greatly on the particular processor.

This is in contrast to instructions for a CISC processor, whose instructions may take many tens of cycles to execute. For example, one instruction integer multiplication on an CISC processor takes 42 cycles to complete. The same instruction on a RISC processor may take just one cycle. Instructions on a RISC processor have a simple format. All instructions are generally the same length which makes instruction decode units simpler. This means that the only instructions that actually reference memory are load and store.

In contrast, many most instructions on a CISC processor may access or manipulate memory. On a RISC processor, all other instructions aside from load and store work on the registers only. This facilitates the ability of RISC processors to complete most of their instructions in a single cycle.

RISC processors also often have pipelined instruction execution. This means that while one instruction is being executed, the next instruction in the sequence is being decoded, while the third one is being fetched. At any given moment, several instructions will be in the pipeline and in the process of being executed.

MPP (massively parallel processing)

Again, this provides improved processor performance. Thus, even though not all instructions may be completed in a single cycle, the processor may issue and retire instructions on each cycle, thereby achieving effective single-cycle execution. Some RISC processors have overlapped instruction execution, in which load operations may allow the execution of subsequent, unrelated instructions to continue before the data requested by the load has been returned from memory. This allows these instructions to overlap the load , thereby improving processor performance.

Due to their low power consumption and computing power, RISC processors are becoming widely used, particularly in embedded computer systems, and many RISC attributes are appearing in what are traditionally CISC architectures such as with the Intel Pentium. If power consumption needs to be low, then RISC is probably the better architecture to use. These processors have instruction sets and architectures optimized for numerical processing of array data.

They often extend the Harvard architecture concept further by not only having separate data and code spaces, but also by splitting the data spaces into two or more banks. This allows concurrent instruction fetch and data accesses for multiple operands. DSPs have special hardware well suited to numerical processing of arrays. They often have hardware looping , whereby special registers allow for and control the repeated execution of an instruction sequence.

This is also often known as zero-overhead looping , since no conditions need to be explicitly tested by the software as part of the looping process. DSPs often have dedicated hardware for increasing the speed of arithmetic operations. DSP processors are commonly used in embedded applications, and many conventional embedded microcontrollers include some DSP functionality. Memory is used to hold data and software for the processor.

There is a variety of memory types, and often a mix is used within a single system. Some memory will retain its contents while there is no power, yet will be slow to access. Other memory devices will be high-capacity, yet will require additional support circuitry and will be slower to access. Still other memory devices will trade capacity for speed, yielding relatively small devices, yet will be capable of keeping up with the fastest of processors.

Memory chips can be organized in two ways, either in word-organized or bit-organized schemes. In the word-organized scheme, complete nybbles, bytes, or words are stored within a single component, whereas with bit-organized memory, each bit of a byte or word is allocated to a separate component Figure Memory chips come in different sizes, with the width specified as part of the size description. In both cases, each chip has exactly the same storage capacity, but organized in different ways.

However, because the DRAMs are organized in parallel, they are accessed simultaneously. Written by the manager for the HTMT project, this paper summarizes the state of the art in supercomputer construction. These facts illustrate how little we understand about building large-scale parallel computers and why this subject is a fascinating area for research. A critique of HLL architectures. Description of a program optimization technique becoming increasingly popular.

Introductory survey of critical regions, semaphores, etc. Describes the technique used to compile for HTMT. Compiler analysis is used to break up the program into threads at the points where a long latency memory access occurs; these threads are then scheduled for execution in the processor. Widely referenced paper introducing a critical idea; must read for understanding memory-consistency models. Discusses prefetching, an important new technique for dealing with increasing memory access latency.

Discussion of synchronization primitives and implementation details. Lessons learned on what kind of instruction set a computer must expose to the compiler writers. Describes a MEMS-based hard-disk device that does not use any rotating media and is incredibly faster than mechanical hard disks. Much referenced paper showing performanc advantages of lower dimensional networks. Good survey paper on hot spot phenomena and various attempted solutions. Comparative survey of various interconnection network schemes. Describes part of the network for the HTMT project. Original paper describing the Tomasulo algorithm.

Good introductory paper on Network Processor design. Introductory paper explaining the SMT approach to multithreading. Original paper introducing the ideas behind compiling for VLIW architectures; somewhat difficult to read and requires some idea of microprogramming. Excellent recent update of the current technology trends and their influence on future microprocessor architecture. Description of memory disambiguation buffer later used in Intel Itanium. Excellent paper that describes a technique to reduce register pressure in register-renaming technique; illustrates the kind of detailed, critical thinking required for microarchitecture design.

Introductory, readable account; other papers in this issue also deal with the new Intel Itanium family. Paper describing many theretical ideas behind ILP techniques. Discussion of advantages of merging logic and DRAM on the same chip and report of one such project. Definitely the place to start to understand the concepts related to branch prediction techniques. Predicated execution is one of the latest microarchitectural techniques that has now been commecialized in the Itanium product; this paper advocates predication of all the instructions as against partial predication present in some processors.

Excellent overview of the current state-of-the-art; papers in this special issue are devoted to the limits of semiconductor technology and worth a review. Description of novel scheme to imiplement pipelines. An up-to-date review of the current state of microarchitecture. This issue is devoted to microprocessors and compilers and all papers are worth browsing. With more chip real-estate are becoming available and traditional performance enhancing techniques running into trouble, one way to make use of the chip real estate is to integrate two complete CPUs on the same chip; this paper makes a case for this approach.

Important tutorial paper on the subject; introduces expressive terms "sequential architectures", "dependence architectures" and "independence architectures. A trace cache is constructed from the instructions that are being retired; it can contain any number of control transfer instructions. It is a way to complement the branch-prediction logic to increase the scope of prediction. Trace caches become important as pipeline depth increases and have been implemented in the new Pentium IV.

Introduction to the technology behind Intel Itanium. A novel scheme to get around the memory access latency problem. Definitely the place to start to understand the ILP techniques used in modern processors. Another excellent tutorial paper on the subject. Description of the eponymous algorithm by the author. Describes the use of speculative execution of memory loads to reduce memory latency and the use of rotating register sets to reduce cost of procedure calls as implemented in the Intel Itanium processor.

Multithreading can be used to saturate the processor-memory interface; even two threads suffice. Makes the case for simulataneous multithreaded architectures. Book having a number of excellent papers on multithreaded architectures.