The Stack

Truffle's software stack is built as follows

In essence, our hardware engines are optimized as an array of cortexes. Each cortex is built in a way to ensure maximal performance given the hardware available.

For instance, a future version of our vision cortex will rely on the Orin's built in DLA to handle convolutions, however an LLM doesn't need the same capability.

Audocortexes are transcription model engines (Whisper, etc)

Speechcortexes are TTS model engines (text to speech)

Neocortexes are Language model engines

Model weights from Pytorch, or huggingface are run through our compiler to optimize building the model binary for the Truffle-1 engine and make full use of all hardware capabilities. For comparsion, GGML runs Mixtral8x7B at just 8-9 tokens/s on Truffle, while our compiler can run it at 20+ tokens/s

Hippocampus's are interesting. We use them like long term memory for RAG, through embeddings based retrieval

Let's move onto running your first LLM on Truffle

Last updated