MCLR5 QUAD-ISSUE SUPERSCALAR RISC V

I have completed the framework for the MCLR5 Quad-issue Supersclar RISC v core. It supports most of the RV32I Base Instruction Set with exceptions for the timers and shifters. I have targeted the core to the Xilinx Ultrascale+ and the Intel Stratix-10 series FPGAs.

Each of the MCLR5 RISC V cores is implemented as a combinational, single-cycle ALU.  Up to four register updates can occur per clock cycle and results are are forwarded to the appropriate cores when appropriate within the same clock cycle.  This just means that if core0 updates r4 and core1 uses r4 as an input, the updated value will be passed to core1.

Loads and stores are handled by core0 and are treated initially as JUMP opcodes. The MCLR5 performs a JUMP to the address containing the LOAD/STORE instruction which aligns it with core0. Instructions following the LOAD/STORE are blocked until the LOAD/STORE is completed.

The User’s Program ROM is four-opcodes wide and dual-ported so any instruction alignment is supported. Only one clock is consumed for JUMP opcodes. Because the ALU is single-cycle, no other pipelining penalties are incurred.

The Quad-issue Superscalar MCLR5 achieves nearly 90Mhz when targeted to Stratix-10 or Ultrascale+.  A Single MCLR5 can reach over 250Mhz.  The relatively slow timing of the quad-issue core is due to the very long combinaional paths through all four cores which is not surprising.  I am actually impressed that these clock frequencies were reached at all.  🙂

It is hard to come up with a picture to illustrate a quad-issue CPU core, so I will just post a screen-shot of a sixteen ADD r1 , r1 , 1  instructions with a SW stuck in the middle.

Capture

 

 

 

MCLR5 QUAD-ISSUE SUPERSCALAR RISC V

MCLR5 Quad-issue Superscalar RISC V initial results

Im working on a Quad-issue Superscalar RISC V processor at the moment and I thought I would share an exciting milestone….

This is a simulation snippet of four RISC V ADDI r1,r1,0x1 instructions in a row. The core fetches and executes the four instructions simultaneously and writes it back to r1, all in one clock cycle.    Neat!!!

The MCLR5 is a quad-issue superscalar RISC V processor core with single-cycle instruction timing.  There are four combinational RISC V ALU cores which process four consecutive instructions and can update up to four registers per clock cycle. The core should come fairly close to an aggregate IPC of nearly four.

 

Quadissue1

MCLR5 Quad-issue Superscalar RISC V initial results

MCL65 running Apple II+ Programs

I uploaded some videos of the system running a few applications and games. My hope was to test the MCL65 on a variety of programs that could demonstrate the instruction as well as cycle accuracy of the core.

MicroCore Labs YouTube Videos

The MC65 is an ultra-small footprint, microsequencer-based, 100% instruction-set compatible, cycle-exact NMOS 6502 core that can be implemented in any FPGA or ASIC technology which can utilize as little as 252 LUTs (0.77%) of a Xilinx Spartan-7 FPGA. It has also been ported to a Xilinx Spartan-3 device where it uses about 10% of the part.

The MCL65 is instruction set compatible with the original NMOS version of the 6502 which was the processor used in computers and game machines such as the Commodore VIC20, Apple II, Atari-2600, and the Commodore-64 as well as many others.

Key Features:

100% Compatible with NMOS 6502 instruction set
Cycle-exact with the original processor
All signals from the original DIP packaged CPU are supported such as SO, SYNC, INT_n, and NMI_n.

Bus timing is identical to the original 6502. All over-fetches, read/write sequences, and addressing mode wrapping/errors are supported.

BCD (Binary Coded Decimal) addition and subtraction are supported.

The MCL65 6502 core is an embedded processor core implemented with a high performance 32-bit microsequencer which can utilize as little as 252 Xilinx LUTs and two block RAMs in a Spartan-7 FPGA. The core is 100% compatible with the original processor and is designed to be cycle-exact which will allow it to be used in applications where firmware cycle timing is critical.

The core was tested on a Commodore VIC-20, Apple II Plus, and the Atari-2600.

Here are a few pictures I took of the system in action in the Apple II Plus.

20171007_10402020171007_10383420171007_10384920171007_11094920171007_11172020171007_111729.jpgUtilization

MCL65 running Apple II+ Programs

MCL65 works in Apple II+

Received the Apple II+ in the mail today but it did not come with any diskettes. I used a terrific tool, ADTPro, to transfer disk images from my PC over to the Apple using the cassette port. It is slow but works great! I was able to transfer over DOS 3.3 and a few games such as Castle Wolfenstein, Zaxxon, and Lode Runner.  They all appear to work fine with the MCL65 and I will take some pictures and video in a day or so.

MCL65 works in Apple II+

MCL65 Working!

The MCL65 is currently running inside of a Commodore VIC-20 computer!  I have no game cartridges at the moment, so I am just running the classic a=a+1 BASIC counting program.

I am using a Digilent Arty S7 board which has a Xilinx Spartan-7 XC7S50. The core utilizes about 0.77% of the device!

The MCL65 is designed to be cycle-exact to the original MOS 6502 microprocessor, so it should be able to run timing-dependent computers like the Apple II’s. ( I believe the disk controller requires certain instruction cycle timing). Hopefully I can get one of these machines soon to give it a try.

I also hope to test the core on an Atari-2600, and a Commodore-64.

Pictures and videos will be coming soon!

 

MCL65 Working!

World’s fastest IBM PCjr

I added 128KB of memory inside of the FPGA and  disabled the MCL86 cycle compatibility with the original 4.77Mhz 8088 processor and got some interesting results:

img_4510

If these speed test results are to be believed, then this IBM PCjr is many times faster than the original IBM PC XT and, for some tests, even faster than the 6Mhz IBM PC AT.

img_4509

img_4513

I am using DOS 2.1 and PCJRMEM.COM /C to allow these test programs to run from the upper/faster 128KB of memory.

It is interesting that Norton Utilities SI.EXE now thinks the processor is a NEC V20.  I think this may have something to do with the prefetch queue and the speed at which it fills when running programs from the upper/fast memory.

The lower 128KB physical DRAM is accessed in the normal fashion with four to six 4.77Mhz clock cycles.  The upper 128KB is located inside of the FPGA and is accessed in a number of 100Mhz clock cycles, so it is many times faster than the 4.77Mhz local bus.

The MCL86 clock cycle compatibility mode is turned off once the PCjr exits it’s POST. This means that once the microsequencer finishes processing an instruction it immediately fetches the next one. With cycle compatibility turned on, the microsequencer will pause for the same number of 4.77Mhz clock cycles that the original processor takes for that instruction.

Is this the world’s fastest IBM PCjr?  🙂

Please visit us at: www.MicroCoreLabs.com for more information.

 

World’s fastest IBM PCjr