New 64GC IP in the SCRx family of the RISC-V compatible cores by Syntacore

Alexander Redkin
Executive director

RISC-V Summit
December 2018, Santa Clara
Outline

- Company intro
- Introducing new 64GC IP
Syntacore introduction

IP company, founding member of RISC-V foundation

Develops and licenses state-of-the-art RISC-V cores
- Initial line is available and shipping to customers
- 3+ years of focused RISC-V development
- Core team comes from 10+ years of highly-relevant background
- SDKs, samples in silicon, full collateral

Full service to specialize CPU IP for customer needs
- One-stop workload-specific customization for 10x improvements
  ▪ with tools/compiler support
- IP hardening at the required library node
- SoC integration and SW migration support

Visit us at booth T3
**Syntacore background**

**Company:**
- Est 2015
- HQ in EU
  - R&D offices in St. Petersburg and Moscow
  - Representatives in EMEA, APAC

**Team background:**
- 10+ years in the corporate R&D (major semi MNC)
- Developed cores and SoC are in the mass productions
- 15+ tapeouts, 180..14nm

**Expertise:**
- Low-power and high-performance embedded cores and IP
- ASIP technologies and reconfigurable architectures
- Architectural exploration & workload characterization
- Compiler technologies
SCRx baseline cores

- **SCR1**: Compact MCU-class open-source core
  - Minimal area configuration is ~11 kGates
  - [https://github.com/syntacore/scr1](https://github.com/syntacore/scr1)

- **SCR3**: High-performance 32-bit MCU with privilege modes
  - Competitive characteristics

- **SCR4**: 32-bit MCU core with high-performance FPU
  - IEEE 754-2008 compatible

- **SCR5**: Efficient mid-range APU/embedded core
  - 1GHz@28nm, virtual memory, 2-4 cores SMP, Linux

Stable industrial-strength designs, available since 2017
- SDKs, silicon samples, tools, documentation, support
- All cores are licensed
SCR1 overview

Compact MCU core for deeply embedded applications and accelerator control

- RV32I|E[MC] ISA
- 2 to 4 stages pipeline
- M-mode only
- Optional configurable IPIC
  - 8..32 IRQs
- Optional integrated Debug Controller
  - OpenOCD compatible
- Choices of the optional MUL/DIV unit
  - Area- or performance- optimized
- Open sourced under SHL-license (Apache 2.0 derivative)
  - Unrestricted commercial use allowed

- High quality free MCU IP
- In the top System Verilog Github repos in the world
- Best-effort support provided, commercial offered
### SCR1 overview cont

<table>
<thead>
<tr>
<th></th>
<th>DMIPS</th>
<th>Coremark</th>
</tr>
</thead>
<tbody>
<tr>
<td>Performance*, per MHz</td>
<td>-O2</td>
<td>best**</td>
</tr>
<tr>
<td></td>
<td>1.28</td>
<td>1.89</td>
</tr>
<tr>
<td></td>
<td>-best**</td>
<td>2.95</td>
</tr>
</tbody>
</table>

* Dhrystone 2.1, Coremark 1.0, GCC 7.1 BM from TCM
** -O3 -funroll-loops -fpeel-loops -fgcse-sm -fgcse-las -flto

**What’s new:**
- Extensive user guide and quick start collateral
- works out-of-the-box in all major sims
- Verilator support (version 3.922 and later)
- More tests/sample: RISC-V compliance, others
- Regular talk at ORCONF
- Updated and maintained

**Synthesis data:**
- Minimal RV32EC config: 11 kGates
- Default RV32IMC config: 32 kGates
- Range 10..40+ kGates

**250+ MHz @ tsmc90lp {typical, 1.0V, +25C}**
SCR1 SDK

https://github.com/syntacore/scr1-sdk

Repository content:
- docs - SDK documentation
- fpga - SCR1 SDK FPGA projects
- images - precompiled binary files
- scr1 - SCR1 core source files
- sw – sample SW projects

Supported platforms:
- Digilent Arty (Xilinx)
- Terasic DE10-Lite (Intel)
- Arria V GX Starter (Intel)

Software:
- Bootloader
- Zephyr OS
- Tests/sample apps
- Pre-built GCC-based toolchain (Win/Linux)

Open designs and pre-build images for a quick start
Fully featured SW development tools

Stable IDE in production:

- GCC 8.1
- GNU Binutils 2.31.0
- Newlib 3.0
- GNU GDB 8.0.50
- Open On-Chip Debugger 0.10.0
- Eclipse 4.9.0

Hosts: Linux, Windows
Targets: BM, Linux

Simulators:
- Qemu
- Spike

Also available:
- LLVM 5.0
- CompCert 2.6
- 3rd party vendors in 2019

Debug solution:
Segger J-link, Olimex ARM-USB-OCD family, Digilink JTAG-HS2 supported, Lauterbach – H2’18
RV64 SCR3

High-performance multicore capable MCU-class core

- RV64I[MCA]
- Machine and User privilege modes
- Optional MPU (Memory Protection Unit)
- Optional Tightly Coupled Memory (TCM), L1 caches ECC/parity
- 32|64bit AHB or AXI4 external interface
- Optional high-performance or area-optimized MUL/DIV unit
- Integrated IRQ controller and PLIC
- Advanced debug with JTAG i/f
- Multicore configs up to 4 SCRx cores
  - SMP and heterogeneous

<table>
<thead>
<tr>
<th>Performance*, per MHz</th>
<th>DMIPS</th>
<th>Coremark</th>
</tr>
</thead>
<tbody>
<tr>
<td>-O2</td>
<td>-O2</td>
<td>1.97</td>
</tr>
<tr>
<td>-best**</td>
<td>-best**</td>
<td>3.27</td>
</tr>
<tr>
<td>Coremark-best**</td>
<td>Coremark-best**</td>
<td>3.40</td>
</tr>
</tbody>
</table>

* Dhrystone 2.1, Coremark 1.0, GCC 8.1 BM from TCM
** -O3 -funroll-loops -fpeel-loops -fgcse-sm -fgcse-las -flto
RV64 SCR4

High-performance multicore capable MCU core with FPU

- RV64IMCF[DA] ISA
- U- and M-mode
- Configurable advanced BP, fast MUL/DIV
- Integrated IRQ controller
- 32- or 64bit bit AHB or AXI4 external interface
- Optional MPU, TCM, L1 caches w/ECC
- Advanced debug controller with JTAG
- Configurable SP or DP FPU
  - IEEE 754-2008 compliant
- Multicore support up to 4 SCRx cores
  - SMP and heterogeneous

Performance*, per MHz

<table>
<thead>
<tr>
<th></th>
<th>DMIPS</th>
<th>-O2</th>
<th>-best**</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dhrystone 2.1</td>
<td>-O2</td>
<td>1.97</td>
<td></td>
</tr>
<tr>
<td>Coremark 1.0</td>
<td>-best**</td>
<td>3.27</td>
<td></td>
</tr>
<tr>
<td>DP Whetstone</td>
<td>-best**</td>
<td>3.40</td>
<td></td>
</tr>
</tbody>
</table>

* Dhrystone 2.1, Coremark 1.0, GCC 8.1 BM from TCM
** -O3 -funroll-loops -fpeel-loops -fgcse-sm -fgcse-las -flto
RV64 SCR5

Efficient entry-level APU/embedded core
- RV64IMC[AFD] ISA
- Multicore configs up to 4 SCRx cores
  - SMP and heterogeneous
- Advanced BP (BTB/BHT/RAS),
- IRQ controller (integrated and PLIC)
- M-, S- and U-modes
- Virtual memory support, full MMU
- L1, L2 caches with coherency, atomics, ECC
- High performance double-precision FPU
- Linux and FreeBSD support
- 1GHz+ @28nm
- Advanced debug with JTAG i/f

Performance*, per MHz
<table>
<thead>
<tr>
<th>Performance</th>
<th>DMIPS</th>
<th>O2-best**</th>
<th>Coremark-best**</th>
</tr>
</thead>
</table>
| -Dhrystone 2.1, Coremark 1.0, GCC 8.1 BM from TCM
** O3-funroll-loops -fpeel-loops -fgcse-sm -fgcse-las -flto

1.70
2.62
3.02
RV64 SCR7

Efficient mid-range application core
- RV64GC ISA
- Multicore configs up to 8 cores
- Flexible uarch template, 10-12 stage integer pipeline
- initial SCR7 configuration (Q1’19):
  - Decode and dispatch of up to two instructions per cycle
  - Out-of-order issue of up to four micro-ops
  - Out-of-order completion, in-order retirement
- M-, S- and U-modes
- Virtual memory support, full MMU
- 16-64KB L1, up to 2MB L2 cache with ECC
- 1.2GHz+ @28nm
- Advanced debug with JTAG i/f

Performance*, per MHz

<table>
<thead>
<tr>
<th>Performance</th>
<th>DMIPS -O2</th>
<th>DMIPS -best**</th>
<th>Coremark -best**</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>2.73</td>
<td>3.00</td>
<td>4.00</td>
</tr>
</tbody>
</table>

* Preliminary data. Dhrystone 2.1, Coremark 1.0, GCC 8.1 BM
** O3-funroll-loops -fpeel-loops -fgcse-sm -fgcse-las -flto
## SCRx features at glance Q1’19

<table>
<thead>
<tr>
<th>Features</th>
<th>SCR1</th>
<th>SCR3</th>
<th>SCR4</th>
<th>SCR5</th>
<th>SCR7</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Width</strong></td>
<td>32bit</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>64bit</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ISA</td>
<td>RV32I[E][MC]</td>
<td>RV[32</td>
<td>64I][MC][A]</td>
<td>RV[32</td>
<td>64I][MCF][AD]</td>
</tr>
<tr>
<td><strong>Pipeline type</strong></td>
<td>In-order</td>
<td>In-order</td>
<td>In-order</td>
<td>In-order</td>
<td>Superscalar</td>
</tr>
<tr>
<td><strong>Pipeline, stages</strong></td>
<td>2-4</td>
<td>3-5</td>
<td>3-5</td>
<td>7-9</td>
<td>10-12</td>
</tr>
<tr>
<td><strong>Branch prediction</strong></td>
<td>Static BP, RAS</td>
<td>Static BP, RAS</td>
<td>Static BP, RAS</td>
<td>Static BP, BTB, BHT, RAS</td>
<td>Dynamic BP, BTB, BHT, RAS</td>
</tr>
<tr>
<td><strong>Execution priority levels</strong></td>
<td>Machine, User</td>
<td>Machine</td>
<td>User, Machine</td>
<td>User, Supervisor, Machine</td>
<td>User, Supervisor, Machine</td>
</tr>
<tr>
<td><strong>Extensibility/customization</strong></td>
<td>Machine</td>
<td>Machine, User</td>
<td>User, Machine</td>
<td>User, Supervisor, Machine</td>
<td>User, Supervisor, Machine</td>
</tr>
<tr>
<td><strong>Execution units</strong></td>
<td>MUL/DIV</td>
<td>area-opt</td>
<td>hi-perf</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>FPU</td>
<td>V</td>
<td>V</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td></td>
<td>TCM</td>
<td>O</td>
<td>O</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td><strong>Memory subsystem</strong></td>
<td>L1$</td>
<td>O</td>
<td>O</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td></td>
<td>L2$</td>
<td>V</td>
<td>V</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td></td>
<td>MPU</td>
<td>V</td>
<td>V</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td><strong>MMU, virtual memory</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Debug</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Integrated JTAG debug</td>
<td>V</td>
<td>V</td>
<td>V</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td>HW BP</td>
<td>1-2</td>
<td>1-8 adv ctrl</td>
<td>1-8 adv ctrl</td>
<td>1-8 adv ctrl</td>
<td>1-8 adv ctrl</td>
</tr>
<tr>
<td>Performance counters</td>
<td>O</td>
<td>O</td>
<td>O</td>
<td>O</td>
<td>O</td>
</tr>
<tr>
<td><strong>Interrupt Controller</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>IRQs</td>
<td>8-32</td>
<td>8-1024</td>
<td>8-1024</td>
<td>8-1024</td>
<td>8-1024</td>
</tr>
<tr>
<td><strong>Features</strong></td>
<td>basic</td>
<td>advanced</td>
<td>advanced</td>
<td>advanced+</td>
<td>advanced+</td>
</tr>
<tr>
<td><strong>SMP support</strong></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>up to 4 cores with coherency</td>
<td>up to 8 cores</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>I/O options</strong></td>
<td>AHB</td>
<td>O</td>
<td>O</td>
<td>O</td>
<td>O</td>
</tr>
<tr>
<td></td>
<td>AXI4</td>
<td>O</td>
<td>V</td>
<td>V</td>
<td>V</td>
</tr>
<tr>
<td></td>
<td>ACE</td>
<td>O</td>
<td>V</td>
<td>V</td>
<td>O</td>
</tr>
</tbody>
</table>

*Download SCR1 free at [https://github.com/syntacore/scr1](https://github.com/syntacore/scr1)*

Baseline cores: extensible and customizable
Extensibility/customization: how it works

Dynamic power

Processing time

Full energy

Customized core

Full energy

General-purpose core
Workload-specific customization

Extensibility features:
- Computational capabilities
  - New functions using existing HW
  - New Functional Units
- Extended storage
  - Mems/RF, addressable or state
  - Custom AGU
- I/O ports
- Specialized system behavior
  - Standard events processing
  - Custom events

Domain examples:
- Computationally intensive algorithms acceleration
- Specialized processors (including DSP)
- High-throughput applications
  - Wire Speed Processing/DPI/Real-time/Comms
SCRx extensibility example

Custom ISA extension for AES & other crypto kernels acceleration for SCR5

- Data
  - RV32G – FPGA-based devkit, g++ 5.2.0, Linux 4.6, optimized C++ implementation
  - Rv32G + custom – same + intrinsics
  - Core i7 6800K @ 3.4GHz, g++ 5.4.0, Linux 64, optimized C++ implementation

  60..575x speedup @ modest area increase: 11.7% core, 3.7% at the CPU cluster level

<table>
<thead>
<tr>
<th>Platform</th>
<th>Fmax, MHz</th>
<th>Encoding throughput, MB/s</th>
<th>Normalized per MHz, MB/s</th>
<th>RV32G + custom speed-up</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Crypto-1</td>
<td>Crypto-2</td>
<td>AES-128</td>
</tr>
<tr>
<td>RV32G</td>
<td>20</td>
<td>0.025</td>
<td>0.129</td>
<td>0.238</td>
</tr>
<tr>
<td>RV32G + custom</td>
<td>20</td>
<td>14.375</td>
<td>15.188</td>
<td>14.502</td>
</tr>
<tr>
<td>Core i7</td>
<td>3400</td>
<td>79.115</td>
<td>235.343</td>
<td>335.212</td>
</tr>
<tr>
<td>Core i7 + NI</td>
<td>3400</td>
<td>3874.552</td>
<td>3874.552</td>
<td>3874.552</td>
</tr>
</tbody>
</table>

Disclaimer: Authors are aware AES allows for more efficient dedicated accelerators designs, used as example algorithm.
Conclusion

- Syntacore offers high-quality RISC-V compatible CPU IP
  - Founding member, fully focused on RISC-V since 2015
  - Silicon-proven and shipping to customers
  - Turnkey customization services

- Introduced 4 RV64 cores, available starting Q1’19

- Visit us at booth T3 for SCRx demos, including RISC-V silicon
Thank you!