Designing embedded hardware second edition




















The second edition includes some software: "I won't even attempt to cover the instructions of each processor in this book. What I will do is show some simple assembly language techniques. While the instructions may be wildly different between architectures, the basic concepts are the same. Software professionals who want to design their own hardware will find a wealth of information in Designing Embedded Hardware, Second Edition to help them penetrate the mysteries of building their own specialized devices and start them well on their way.

I hardly need to say at this point that I highly recommend it for hardware design rookies. By page 77, you're ready to start applying those principles For too long, hardware design has been a black art. Or maybe just the dearth of good books has made it seem so.

I hope John Catsoulis' new book will be as successful in the market as it's been in showing me how to design hardware. I can't wait to put my new knowledge into practice. Home Management. Information Technology. Information Management. Lean Management. Team Management. Institution Development. General Law. Labour Law. Business Management. Case Studies In Management. Corporate Governance. Disaster Management. Hospitality Management.

Human Resource Management. Interview And Career. Logistics Management. Organizational Management. Project Management. Purchasing And Inventory. Skills Management. Supply Chain. Microsoft Windows Programming. MS ASP. MS Biztalk. MS Office. MS Project. MS Sharepoint. MS Small Business Server. MS Windows. Mobile Development. App Development. Data Center.

Juniper Networks. Mobile Communications. Network Management. Storage Network. Wireless Communications. Operating Systems. Political Science. Product Development. R Programming. Assembly Language. C Programming. Computer Systems. Embedded Programming. Erlang Programming. Object-Oriented Programming. Parallel Programming. Real-Time Systems.

Ruby Programming. Security Programming. PMP Certification. Environmental Science. Software Engineering. Agile Development. Design Patterns. Enterprise Applications. Because of this lack of distinction, the processor is capable of changing its instructions treating them as data under program control. And because the processor has no way of distinguishing between data and instruction, it will blindly execute anything that it is given, whether it is a meaningful sequence of instructions or not.

Data has no inherent meaning. There is nothing to distinguish between a number that represents a dot of color in an image and a number that represents a character in a text document.

Meaning comes from how these numbers are treated under the execution of a program. Data and instructions share the same memory.

This means that sequences of instructions in a program may be treated as data by another program. A compiler creates a program binary by generating a sequence of numbers instructions in memory.

To the compiler, the compiled program is just data, and it is treated as such. It is a program only when the processor begins execution. Similarly, an operating system loading an application program from disk does so by treating the sequence of instructions of that program as data.

The program is loaded to memory just as an image or text file would be, and this is possible due to the shared memory space. Memory is a linear one-dimensional array of storage locations. The processor's memory space may contain the operating system, various programs, and their associated data, all within the same linear space. Each location in the memory space has a unique, sequential address. The address of a memory location is used to specify and select that location.

The address space is the array of all addressable memory locations. Hence, the processor is said to have a 64K address space. Most microprocessors available are standard Von Neumann machines. The main deviation from this is the Harvard architecture, in which instructions and data have different memory spaces Figure with separate address, data, and control buses for each memory space.

This has a number of advantages in that instruction and data fetches can occur concurrently, and the size of an instruction is not set by the size of the standard data unit word. Harvard architecture 1. Buses allow for the transfer of electrical signals between different parts of the computer system and thereby transfer information from one device to another. For example, the data bus is the group of signal lines that carry data between the processor and the various subsystems that comprise the computer.

The "width" of a bus is the number of signal lines dedicated to transferring information. For example, an 8-bit-wide bus transfers 8 bits of data in parallel. The majority of microprocessors available today with some exceptions use the three-bus system architecture Figure The three buses are the address bus, the data bus, and the control bus. Three-bus system The data bus is bidirectional, the direction of transfer being determined by the processor.

The address bus carries the address, which points to the location in memory that the processor is attempting to access.

It is the job of external circuitry to determine in which external device a given memory location exists and to activate that device. This is known as address decoding. The control bus carries information from the processor about the state of the current access, such as whether it is a write or a read operation. The control bus can also carry information back to the processor regarding the current access, such as an address error. Different processors have different control lines, but there are some control lines that are common among many processors.

The control bus may consist of output signals such as read, write, valid address, etc. A processor usually has several input control lines too, such as reset, one or more interrupt lines, and a clock input. This was one of the world's first digital computers, designed and built in Sydney, Australia, in the late s. It was a massive machine, filling a very big room with the type of solid hardware that you can really kick.

It was quite an experience looking over the old machine. I remember at one stage walking through the disk controller it was the size of small room and looking up at a mass of wires strung overhead.

I asked what they were for. The internal data storage of the processor is known as its registers. The instructions that are read and executed by the processor control the data flow between the registers and the ALU. A symbolic representation of an ALU is shown in Figure These values, called operands, are typically obtained from two registers, or from one register and a memory location. The result of the operation is then placed back into a given destination register or memory location.

The status outputs indicate any special attributes about the operation, such as whether the result was zero, negative, or if an overflow or carry occurred. Some processors have separate units for multiplication and division, and for bit shifting, providing faster operation and increased throughput. Each architecture has its own unique ALU features, and this can vary greatly from one processor to another.

However, all are just variations on a theme, and all share the common characteristics just described. Interrupts Interrupts also known as traps or exceptions in some processors are a technique of diverting the processor from the execution of the current program so that it may deal with some event that has occurred.

An interrupt is generated in your computer every time you type a key or move the mouse. You can think of it as a hardware-generated function call. Instead, the processor may continue with other tasks. Interrupts can be of varying priorities in some processors, thereby assigning differing importance to the events that can interrupt the processor. If the processor is servicing a low-priority interrupt, it will pause it in order to service a higher-priority interrupt.

However, if the processor is servicing an interrupt and a second, lower-priority interrupt occurs, the processor will ignore that interrupt until it has finished the higher-priority service. When an interrupt occurs, the usual procedure is for the processor to save its state by pushing its registers and program counter onto the stack.

The processor then loads an interrupt vector into the program counter. The interrupt vector is the address at which an interrupt service routine ISR lies. Thus, loading the vector into the program counter causes the processor to begin execution of the ISR, performing whatever service the interrupting device required.

This causes the processor to reload its saved state registers and program counter from the stack and resume its original program. Interrupts are largely transparent to the original program. This means that the original program is completely "unaware" that the processor was interrupted, save for a lost interval of time. Processors with shadow registers use these to save their current state, rather than pushing their register bank onto the stack.

This saves considerable memory accesses and therefore time when processing an interrupt. However, since only one set of shadow registers exists, a processor servicing multiple interrupts must "manually" preserve the state of the registers before servicing the higher interrupt. If it does not, important state information will be lost. Upon returning from an ISR, the contents of the shadow registers are swapped back into the main register array.

The first is busy waiting or polling, where the processor continuously checks the device's status register until the device is ready. This wastes the processor's time but is the simplest to implement. For some time-critical applications, polling can reduce the time it takes for the processor to respond to a change of state in a peripheral.

A better way is for the device to generate an interrupt to the processor when it is ready for a transfer to take place. Small, simple processors may only have one or two interrupt inputs, so several external devices may have to share the interrupt lines of the processor.

When an interrupt occurs, the processor must check each device to determine which one generated the interrupt. This can also be considered a form of polling. The advantage of interrupt polling over ordinary polling is that the polling occurs only when there is a need to service a device. Polling interrupts is suitable only in systems that have a small number of devices; otherwise, the processor will spend too long trying to determine the source of the interrupt.

Vectored interrupts reduce considerably the time it takes the processor to determine the source of the interrupt. If an interrupt request can be generated from more than one source, it is therefore necessary to assign priorities levels to the different interrupts. This can be done in either hardware or software, depending on the particular application. In this scheme, the processor has numerous interrupt lines, with each interrupt corresponding to a given interrupt vector.

So, for example, when an interrupt of priority 7 occurs interrupt lines corresponding to "7" are asserted , the processor loads vector 7 into its program counter and starts executing the service routine specific to interrupt 7. Vectored interrupts can be taken one step further. Some processors and devices support the device by actually placing the appropriate vector onto the data bus when they generate an interrupt.

This means the system can be even more versatile, so that instead of being limited to one interrupt per peripheral, each device can supply an interrupt vector specific to the event that is causing the interrupt. However, the processor must support this function, and most do not. Some processors have a feature known as a fast hardware interrupt. With this interrupt, only the program counter is saved. It assumes that the ISR will protect the contents of the registers by manually saving their state as required.

A special and separate interrupt line is used to generate fast interrupts. It is the lowest-priority interrupt and is generally used by programs to request a service to be performed by the system software operating system or firmware. So why are software interrupts used? Why isn't the appropriate section of code called directly? For that matter, why use an operating system to perform tasks for us at all? It gets back to compatibility.

Jumping to a subroutine calling a function is jumping to a specific address in memory. A future version of the system software may not locate the subroutines at the same addresses as earlier versions. By using a software interrupt, our program does not need to know where the routines lie. It relies on the entry in the vector table to direct it to the correct location. CISC processors have a single processing unit, external memory, and a relatively small register set and many hundreds of different instructions.

In many ways, they are just smaller versions of the processing units of mainframe computers from the s. The tendency in processor design throughout the late 70s and early 80s was toward bigger and more complicated instruction sets. Well, with CISC 80x86 family , there's a single instruction to do it! The diversity of instructions in a CISC processor can run to well over 1, opcodes in some processors, such as the Motorola This had the advantage of making the job of the assembly-language programmer easier, since you had to write fewer lines of code to get the job done.

As memory was slow and expensive, it also made sense to make each instruction do more. This reduced the number of instructions needed to perform a given function, and thereby reduced memory space and the number of memory accesses required to fetch instructions. As memory got cheaper and faster, and compilers became more efficient, the relative advantages of the CISC approach began to diminish.

One main disadvantage of CISC is that the processors themselves get increasingly complicated as a consequence of supporting such a large and diverse instruction set. The control and instruction decode units are complex and slow, the silicon is large and hard to produce, and they consume a lot of power and therefore generate a lot of heat. As processors became more advanced, the overheads that CISC imposed on the silicon became oppressive. A given processor feature when considered alone may increase processor performance but may actually decrease the performance of the total system, if it increases the total complexity of the device.

It was found that by streamlining the instruction set to the most commonly used instructions, the processors become simpler and faster. Fewer cycles are required to decode and execute each instruction, and the cycles are shorter.

The drawback is that more simpler instructions are required to perform a task, but this is more than made up for in the performance boost to the processor. The realization of this led to a rethink of processor design. The result was the RISC architecture, which has led to the development of very high-performance processors.

The basic philosophy behind RISC is to move the complexity from the silicon to the language compiler. The hardware is kept as simple and fast as possible.

A given complex instruction can be performed by a sequence of much simpler instructions. For example, many processors have an xor exclusive OR instruction for bit manipulation, and they also have a clear instruction to set a given register to zero. However, a register can also be set to zero by xor-ing it with itself. Thus, the separate clear instruction is no longer required.

It can be replaced with the already present xor. Further, many processors are able to clear a memory location directly by writing a zero to it. That same function can be implemented by clearing a register and then storing that register to the memory location. The instruction to load a register with a literal number can be replaced with the instruction for clearing a register, followed by an add instruction with the literal number as its operand.

Thus, six instructions xor, clear reg, clear memory, load literal , store, and add can be replaced with just three xor, store, and add. So the following CISC assembly pseudocode: clear 0x load r1, 5 ; clear memory location 0x ; load register 1 with the value 5 becomes the following RISC pseudocode: xor r1,r1 ; clear register 1 store r1,0x add r1, 5 ; clear memory location 0x ; load register 1 with the value 5 The resulting code size is bigger, but the reduced complexity of the instruction decode unit can result in faster overall operation.

Dozens of such code optimizations exist to give RISC its simplicity. RISC processors have a number of distinguishing characteristics. They have large register sets in some architectures numbering over 1, , thereby reducing the number of times the processor must access main memory. Often-used variables can be left inside the processor, reducing the number of accesses to slow external memory.

Compilers of high-level languages such as C take advantage of this to optimize processor performance. By having smaller and simpler instruction decode units, RISC processors have fast instruction execution, and this also reduces the size and power consumption of the processing unit. Generally, RISC instructions will take only one or two cycles to execute this depends greatly on the particular processor.

This is in contrast to instructions for a CISC processor, whose instructions may take many tens of cycles to execute. For example, one instruction integer multiplication on an CISC processor takes 42 cycles to complete. The same instruction on a RISC processor may take just one cycle.

Instructions on a RISC processor have a simple format. All instructions are generally the same length which makes instruction decode units simpler. This means that the only instructions that actually reference memory are load and store. In contrast, many most instructions on a CISC processor may access or manipulate memory. On a RISC processor, all other instructions aside from load and store work on the registers only. This facilitates the ability of RISC processors to complete most of their instructions in a single cycle.

RISC processors also often have pipelined instruction execution. This means that while one instruction is being executed, the next instruction in the sequence is being decoded, while the third one is being fetched.

At any given moment, several instructions will be in the pipeline and in the process of being executed. Again, this provides improved processor performance. Thus, even though not all instructions may be completed in a single cycle, the processor may issue and retire instructions on each cycle, thereby achieving effective single-cycle execution. Some RISC processors have overlapped instruction execution, in which load operations may allow the execution of subsequent, unrelated instructions to continue before the data requested by the load has been returned from memory.

This allows these instructions to overlap the load, thereby improving processor performance. Due to their low power consumption and computing power, RISC processors are becoming widely used, particularly in embedded computer systems, and many RISC attributes are appearing in what are traditionally CISC architectures such as with the Intel Pentium.

If power consumption needs to be low, then RISC is probably the better architecture to use. However, if the available space for program storage is small, then a CISC processor may be a better alternative, since CISC instructions get more "bang" for the byte. These processors have instruction sets and architectures optimized for numerical processing of array data. They often extend the Harvard architecture concept further by not only having separate data and code spaces, but also by splitting the data spaces into two or more banks.

This allows concurrent instruction fetch and data accesses for multiple operands. DSPs have special hardware well suited to numerical processing of arrays. They often have hardware looping, whereby special registers allow for and control the repeated execution of an instruction sequence. This is also often known as zero-overhead looping, since no conditions need to be explicitly tested by the software as part of the looping process.

DSPs often have dedicated hardware for increasing the speed of arithmetic operations. DSP processors are commonly used in embedded applications, and many conventional embedded microcontrollers include some DSP functionality.

Memory Memory is used to hold data and software for the processor. There is a variety of memory types, and often a mix is used within a single system. Some memory will retain its contents while there is no power, yet will be slow to access. Other memory devices will be high-capacity, yet will require additional support circuitry and will be slower to access.

Still other memory devices will trade capacity for speed, yielding relatively small devices, yet will be capable of keeping up with the fastest of processors. Memory chips can be organized in two ways, either in word-organized or bit-organized schemes. In the word-organized scheme, complete nybbles, bytes, or words are stored within a single component, whereas with bit-organized memory, each bit of a byte or word is allocated to a separate component Figure Eight bit-organized 8x1 devices and one word-organized 8x8 device Memory chips come in different sizes, with the width specified as part of the size description.

In both cases, each chip has exactly the same storage capacity, but organized in different ways. However, because the DRAMs are organized in parallel, they are accessed simultaneously. It is common practice for multiple DRAMs to be placed on a memory module.

This is the common way that DRAMs are installed in standard computers. The common widths for memory chips are x1, x4, and x8, although x16 devices are available. A bit-wide bus can be implemented with thirty-two x1 devices, eight x4 devices, or four x8 devices. This is a bit of a misnomer, since most all computer memory may be considered "random access. It is where the processor may easily write data for temporary storage. RAM is generally volatile, losing its contents when the system loses power.

Any information stored in RAM that must be retained must be written to some form of permanent storage before the system powers down. There are special nonvolatile RAMs that integrate a battery-backup system, such that the RAM remains powered even when the rest of the computer system has shut down.

SRAMs use pairs of logic gates to hold each bit of data. SRAMs are the fastest form of RAM available, require little external support circuitry, and have relatively low power consumption. Their drawbacks are that their capacity is considerably less than DRAM, while being much more expensive. Their relatively low capacity requires more chips to be used to implement the same amount of memory. A modern PC built using nothing but SRAM would be a considerably bigger machine and would cost a small fortune to produce.

It would be very fast, however. DRAM uses arrays of what are essentially capacitors to hold individual bits of data. The capacitor arrays will hold their charge only for a short period before it begins to diminish. Therefore, DRAMs need continuous refreshing, every few milliseconds or so. This perpetual need for refreshing requires additional support and can delay processor access to the memory. If a processor access conflicts with the need to refresh the array, the refresh cycle must take precedence.

DRAMs are the highest-capacity memory devices available and come in a wide and diverse variety of subspecies. Interfacing DRAMs to small microcontrollers is generally not possible, and certainly not practical. Most processors with large address spaces include support for DRAMs.

Connecting DRAMs to such processors is simply a case of "connecting the dots" or pins, as the case may be. These caches are often, but not always internal to the processors and are implemented with fast memory cells and high-speed data paths. Instruction execution normally runs out of the instruction cache, providing for fast execution. The processor is capable of rapidly reloading the caches from main memory should a cache miss occur. Some processors have logic that is able to anticipate a cache miss and begin the cache reload prior to the cache miss occurring.

This is also a bit of a misnomer, since many modern ROMs can also be written to. ROMs are nonvolatile memory, requiring no power to retain their contents. The primary purpose of ROM within a system is to hold the code and sometimes data that needs to be present at power-up.

It may contain either a bootloader program to load an operating system off disk or network or, in the case of an embedded system, it may contain the application itself. Many microcontrollers contain on-chip ROM, thereby reducing component count and simplifying system design. Standard ROM is fabricated in a simplistic sense from a large array of diodes.

This term comes from the fact that the programming process is performed by passing a sufficiently large current through the appropriate diodes to "blow them," or burn them, thereby creating a zero at that bit location. A device known as a ROM burner can accomplish this, or, if the system supports it, the ROM may be programmed in-circuit.

Computer manufacturers typically use them in systems where the firmware is stable and the product is shipping in bulk to customers.

Mask-programmable ROMs are also one-time programmable, but unlike OTPs, they are burned by the chip manufacturer prior to shipping. Like OTPs, they are used once the software is known to be stable and have the advantage of lowering production costs for large shipments. As such, OTPs make for a very expensive development option. No sane person uses OTPs for development work.

Shining ultraviolet light through a small window on the top of the chip can erase the EPROM, allowing it to be reprogrammed and reused. They are pin- and signalcompatible with comparable OTP and mask devices. The drawback with EPROM technology is that the chip must be removed from the circuit to be erased, and the erasure can take many minutes to complete. The chip is then inserted into the burner, loaded with software, and then placed back in-circuit. This can lead to very slow debug cycles.

Further, it makes the device useless for storing changeable system parameters. EPROMs are relatively rare these days. You can still buy them, but flash-based memory to be discussed shortly is far more common and is the medium of choice.

EEROMs can be erased and reprogrammed in-circuit. Their capacity is significantly smaller than standard ROM typically only a few kilobytes , and so they are not suited to holding firmware. Instead, they are typically used for holding system parameters and mode information to be retained during power-off. It is common for many microcontrollers to incorporate a small EEROM on-chip for holding system parameters.

This is especially useful in embedded systems and may be used for storing network addresses, configuration settings, serial numbers, servicing records, and so on. Flash is normally organized as sectors and has the advantage that individual sectors may be erased and rewritten without affecting the contents of the rest of the device. Typically, before a sector can be written, it must be erased. It can't just be written over as with a RAM. There are several different flash technologies, and the erasing and programming requirements of flash devices vary from manufacturer to manufacturer.

Some examples are serial controllers that communicate with keyboards, mice, modems, etc. An external device will interrupt the processor assert an interrupt control line into the processor , at which time the processor will suspend the current task program and begin executing an interrupt service routine.

The service of an interrupt may involve transferring data from input to memory, or from memory to output. DMA is used in high-speed systems, where the rate of data transfer is important. Not all processors support DMA. Let's say you want to read in M from disk and store it in memory. You have two options. One option is for the processor to read one byte at a time from the disk controller into a register and then store the contents of the register to the appropriate memory location.

For each byte transferred, the processor must read an instruction, decode the instruction, read the data, read the next instruction, decode the instruction, and then store the data. Then the process starts over again for the next byte. The second option in moving large amounts of data around the system is DMA. There are several ways in which this could be implemented by the system designer. The most common approach and probably the simplest is to suspend the operation of the processor and for the processor to "release" its buses the buses are tristate.

This allows the DMAC to "take over" the buses for the short period required to perform the transfer. The transfers involve a load operation from a source address followed by a store operation to a destination address. Standard block transfers are initiated under software control and are used for moving data structures from one region of memory to another. Demand-mode transfers Similar to standard mode except that the transfer is controlled by an external device.

Fly-by transfer Provides high-speed data movement in the system. Instead of using multiple bus accesses as with conventional DMA transfers, fly-by transfers move data from source to destination in a single access. The data is not read into the DMAC before going to its destination. Data-chaining transfers Allow DMA transfers to be performed as specified by a linked-list in memory.

Data chaining is started by specifying a pointer to a descriptor in memory. The descriptor is a table specifying byte count, source address, destination address, and a pointer to the next descriptor. The DMAC loads the relevant information about the transfer from this table and begins moving data. The transfer continues until the number of bytes transferred is equal to the entry in the byte-count field.

On completion, the pointer to the next descriptor is loaded. This continues until a null pointer is found. To illustrate the use of DMA, let's consider the example of a fly-by transfer of data from a hard-disk controller to memory. This setup involves specifying the source, destination, and size of the data, as well as other parameters. The disk controller generates a request for service to the DMAC not the processor.

The processor completes the current instruction; places the address, control, and data buses in a high-impedance state floats, tristates, or releases them ; and responds to the DMAC with a HOLD-acknowledge or BG bus granted and enters a dormant state. Upon receiving a HOLD-acknowledge, the DMAC places the address of the memory location where the transfer to memory will begin onto the address bus and generates a WRITE to the memory while the disk controller places the data on the data bus.

Hence, a direct memory access is accomplished from the disk controller to the memory. DMACs are capable of handling block transfers of data. Once the transfer is complete, the buses are returned to the processor and it resumes normal operation. Some DMA controllers simply read data from a source, hold it internally, and then store it to a destination.

They perform the transfer in exactly the same way that a processor would. The advantage in using a DMA controller instead of a processor is that if the transfer were to be performed by the processor, each transfer would still have program fetches associated with it. Thus, even though the transfer takes place by sequential reads and writes, the DMA controller does not also have to fetch and execute code, thereby providing a faster transfer than a processor.

Support for DMA is normally not found in small microcontrollers. Some mid-range processors 16bit, low-end bit may have DMA support. Similarly, peripherals intended for small-scale computers will not provide DMA support, whereas peripherals intended for high-speed and powerful computers definitely will have DMA support.

Parallel and Distributed Computers Some embedded applications require greater performance than is achievable from a single processor. For cost reasons, it may not be practical to implement a design with the latest superscalar RISC processor, or perhaps the application lends itself to distributed processing where the tasks are run across several communicating machines.

It may make more sense to use a fleet of lower-cost processors, distributed throughout the installation. It is becoming increasingly common to see embedded systems implemented using parallel processors. Computers based on this form usually have a single, sequential processor.

The main limitation of this form of computing architecture is that the conventional processor is able to execute only one instruction at a time. Algorithms that run on these machines must therefore be expressed as a sequential problem. A given task must be broken down into a series of sequential steps, each to be executed in order, one at a time. Many problems that are computationally intensive are also highly parallel. An algorithm that is applied to a large data set characterizes these problems.

Often the computation for each element in the data set is the same and is only loosely reliant on the results from computations on neighboring data.

Thus, speed advantages may be gained from performing calculations in parallel for each element in the data set, rather than sequentially moving through the data set and computing each result in a serial manner. Machines with multitudes of processors working on a data structure in parallel often far outperform conventional computers in such applications. The grain of the computer is defined as the number of processing elements within the machine.

A coarsely grained machine has relatively few processors, whereas a finely grained machine may have tens of thousands of processing elements. Typically, the processing elements of a finely grained machine are much less powerful than those of a coarsely grained computer.

The processing power is achieved through the brute-force approach of having such a large number of processing elements. There are several different forms of parallel machine. Each architecture has its own advantages and limitations, and each has its share of supporters. In an SIMD machine, each processing element has a small amount of local memory. The instructions executed by the SIMD computer are broadcast from a central instruction server to every processing element within the machine.

In this way, each processor executes the same instruction as all other processing elements within the machine. Since each processor executes the instruction on its local data, all elements within the data structure are worked upon simultaneously. The SIMD machine is generally used in conjunction with a conventional computer.

The CM-1 was a finely grained SIMD computer with up to 64K of processing elements that appeared as a block of 64K of "intelligent memory" to the host system. An application running on the host downloaded a data set into the processor array of the CM-1, each processor within the CM-1 acting as a single memory unit. The host then issued instructions to each processing element of the CM-1 simultaneously.

After the computations were completed, the host then read back the result from the CM-1 as though it were conventional memory. The primary advantage of the SIMD machine is that simple and cheap processing elements are used to form the computer. Thus, significant computing power is available using inexpensive, off-theshelf components.

In addition, since each processor is executing the same instructions and therefore sharing a common instruction fetch, the architecture of the machine is somewhat simpler. Only one instruction store is required for the entire computer. The use of multiple processing elements, each executing the same instructions in unison, is also the SIMD's main disadvantage.

Many problems do not lend themselves to being broken down into a form suitable for executing on an SIMD computer. In addition, the data sets associated with a given problem may not match well with a given SIMD architecture. For example, an SIMD machine with 10k processing elements does not mesh well with a data set of 12k data elements. These machines are typically coarsely grained collections of semi-autonomous processors, each with their own local memory and local programs.

An algorithm being executed on an MIMD computer is typically broken up into a series of smaller sub-problems, each executed on a processor of the MIMD machine. MIMD computers tend to use a smaller number of very powerful processors, rather than a large number of less powerful ones. Shared-memory MIMD systems have an array of high-speed processors, each with local memory or cache, and each with access to a large, global memory Figure



0コメント

  • 1000 / 1000