Immersive Impressions

Research Round-up: Recent LLM blitz

Izumi

May 1, 2024 • 12 min read

tl;dr The AI Language Model world moves fast, use the Table of Contents below to quickly find quick insights for the ones that interest you:

DBRX
Command R+
Command R
Mistral Large
Mixtral 8x22b
Llama 3 70b
Llama 3 8b
Phi-3 mini (4k)

April 2024, a month that produced a vortex amongst the AI/Language model world, it threw the full range at the world from Small Language Models (SLM) to Large Language Models(LLM) and the in-between. Let's cut through the clutter—'Lingatrix' or 'LLM' will do for simplicity. Amid the chaos my backlog of unpublished articles and research has been growing, guilt over unpublished findings gnawing at me.

So I’d thought I’d throw together this “Research Round-Up”. This post won't dive deep but will sketch a rough outline of the various LLMs under scrutiny recently and my preliminary thoughts on each.

A detailed analysis is on the cards, as my notes are still being refined.

With new models like Llama 3 (partially) out and Phi-3 on being dripped released, each one eagerly trampling over the achievements of its predecessors and it’s competitors.

This flood of updates is thrilling sure but overwhelming to some, so I’ll strip down the niceties and get straight to the point: here’s where each LLM stands in the grand scheme, from my critical, perhaps slightly jaded, perspective.

Alright let’s get into it, this follows a rough release order.

DBRX by Databricks: Setting the scene

Introduction: DBRX marked the beginning of this latest wave of LLM releases and immediately set itself apart with its mixture-of-experts architecture. As the first of its kind to catch my attention, DBRX not only outperformed its contemporaries (At the time, it was exceptional) but also reshaped my expectations for what LLMs can achieve.

Core Features:

Model Size and Structure: 22 billion parameters with a mixture-of-experts approach, surpassing the older Mixtral 8x7b model in efficiency and effectiveness.
Context Window: Utilizes a 32k context window, offering expansive room for n-shot learning to tailor responses to specific use cases.
Retrieval Augmented Generation (RAG): Shows proficient logic in task execution and data retrieval, though it primarily excels in systematic tasks over creative or narrative-driven applications.

Performance Insights:

Practical Applications: DBRX shines as a tool within specific workflows, particularly where logical precision and data retrieval are paramount. It's not best suited as a conversational assistant or for tasks requiring substantial creative input.
Comparative Analysis: Arguably more useful than GPT-3.5 in practical scenarios, though the latter retains certain advantages in areas DBRX struggles with, such as creative outputs.
Limitations: Despite its capabilities, expect some data 'hallucinations' or inaccuracies, typical of current LLMs, underlining the importance of cautious integration into critical applications.

Final Thoughts: For about a month, DBRX was the benchmark setter in the LLM arena, raising the bar for what we expect from such technologies. While it has its limitations, particularly in creative tasks, it was an evolution and made waves, it is still extremely capable, just it was overshadowed by the constantly big releases that came later.

Command-R Plus: A new challenger

Introduction: Before the echoes of DBRX's debut had even faded, Command-R Plus burst onto the scene. Released by a somewhat obscure company named Cohere, completely unknown to me until I stumbled upon it on HuggingFace. It's clear they weren't here to play games. This model wasn't just trending; it was setting a new trend. This is also an Open Weight model, and moves the Open Weight community forwards by leaps and bounds.

Specifications and Performance:

Context and Capacity: Boasting a staggering 128k context length (albeit with some bugs beyond 112k tokens), Command-R Plus doesn't just push the envelope—it shreds it. The sheer speed of this thing in demo was enough to make your head spin.
Capabilities: With robust RAG and logical reasoning, topped off with function calling & “preamble” aka system prompts, it does not just deliver it goes above & beyond.
Model Size and Impact: At a colossal 104 billion parameters, this isn't just another model; it's a titan. Throw anything 'real world' at it, and it doesn't just respond, it excels.
Creativity: From what I can see this model is hardly censored, in fact even reading about how it was trained was a complete refresher from what we are seeing from those who are doing "anticompetitive practices” in this space. Due to it’s logic abilities when it’s doing something creative it’s able to stick to the flow real close, no off-topic or forgetful responses that break “immersion”.

Cost Efficiency: Here’s where it gets even more interesting. Cohere is practically giving away power at $3 per million input tokens and $15 for output. Compare that to Mistral's laughable rates for their inferior "Mistral Large," and it's no contest. Command-R Plus isn’t just competing; it's rewriting the rules on value.

Final Verdict: Command-R Plus took the bar set by DBRX, raised it, and then smashed it with a sledgehammer. It’s a model that doesn’t forget, even if you're cautious with pushing it past 100k tokens. If you’re looking for a solid choice in your toolkit, this is it. Priced really good, it shows finesse when handling inputs and retrieving data and when it needs to reason about something it’s really capable. Not GPT-4 levels but for what is out there for research purposes, not much can keep up.

Command-R: The struggling little sibling

While Command-R Plus was out there setting the stage on fire, its smaller sibling, Command-R, was somewhat lurking in the backstage shadows. With a 35 billion parameter size and a pricing scheme that could only be described as 'dirt cheap,' it seemed poised to offer something worthwhile. But this fails badly.

Initial Brush: I’ll be frank—I gave Command-R the kind of fleeting attention one might give a secondary character in a play. Its more impressive sibling had already captured the spotlight, after all. Yet, given its similar 128K context window and purported RAG capabilities, one would think it deserves a fair chance.

Stumbling Blocks:

Performance Issues: It wasn't entirely without merit, but the devil, as they say, is in the details. The model stumbled over its own feet, hallucinating data when faced with tasks requiring a hint of logical prowess. Basic arithmetic seemed a hurdle too high, which is a glaring red flag in my book. It loses context of the chat between questions, can’t make sense of input and has nothing really going for it—sorry but it’s just bad.
Reliability Woes: The risk of relying on a tool that trips over foundational logic isn't just risky—it's a veritable minefield. For anyone in the thick of work, babysitting an errant algorithm is not on the to-do list.

Verdict: Passable, Not Preferable: So, would I recommend using Command-R for rigorous tasks or as a reliable RAG implement? That’s a resounding no. It's too much of a gamble when precision is paramount.

Mistral Large: Striding, Not Sprinting, Past GPT-3.5

NOTE: I must add a disclaiming here, Mistral Large is not apart of this latest LLM wave, but Mistral Large is a poor performer that will be compared against constantly so I feel obligated to add my current findings here.

Upon its introduction, Mistral Large seemed promising, a solid effort to step past the shadows of GPT-3.5. But as the curtain rose, it became clear that while it's not floundering, it's certainly not leading the pack, either.

First Impressions: Mistral Large comes equipped with all the standard bells and whistles of a contemporary large language model—32k context window, functional calling capabilities, and a system prompt designed to be more tool than tyrant. After all, not everyone appreciates a model with more personality than utility.

Technical Takedown:

Logic Leap: It did impress me somewhat—managing to outperform GPT-3.5 in logic-related tasks and tests. But, let's not get carried away; it still trips over the high bars set by GPT-4.
Creative Conundrums: Here, 'Mistral Large' proves to be a middling performer. It attempts creativity, yes, but with the excitement of a rehearsed school play. The responses are often predictably dull, hinting that either its 'Top P' value needs tweaking, or its creative juices need a serious stir.

Step Up or Sit Down: For what it's worth, Mistral Large has shed some of the more pretentious aspects of its predecessors' personalities. It presents itself as a reliable assistant more than an aloof savant. It’s a dependable tool in the toolbox—serviceable, if not sensational. Major drawbacks is it's just too mid for me. It does not have a single unique or good thing about it.

Mixtral 8x22b: A Real Contender Emerges

Introduction: Just as I was about to resign myself to the mediocrity of previous models, along came Mixtral 8x22b, cutting through the noise just as Command-R started gaining traction. This model demanded attention, and rightfully so.

Performance Highlights:

Context Window Expansion: With a 64k context window—yes, that's double what Mistral Large offers—it made the 'Large' look like a training wheel. It’s not just bigger; it’s smarter, more capable.
Benchmarking and Capabilities: The impressive benchmark results were enough to shift my focus. This model doesn’t just perform; it excels, leaving its predecessors in the dust.
Technological Edge: Adopting a sparse Mixture-of-Experts (SMoE) technique, Mixtral 8x22b isn’t just playing the game; it’s changing how it’s played.

Drawbacks:

Shared Limitations: Despite its advancements, it inherits some of the same limitations as the Mistral Large—lackluster in areas where you expect advancement, innovation or excitement. But let's be clear, it's a minor gripe compared to its gains.
Timing and Competition: Just when I was diving deep into what Mistral 8x22b could offer, the tech world had to throw in Command-R Plus and Llama 3 into the mix, cutting my research short. Annoying, but that's the pace of innovation for you.

Verdict: If you're considering the Mistral Large, don't. Mixtral 8x22b is the clear upgrade—more robust, more intelligent, and frankly, more worth your time. It’s a tool that doesn’t just do the job; it does it with a flair that its predecessors can only dream of.

Llama 3 70b: Earned it's reputation, but raised the bar yet again

In a world where tech giants are clamoring to seal off their secrets, Llama 3 70b blasts onto the scene like a breath of fresh, open weight air. Created for those left cold by Mistral's cozying up under Microsoft's wing, this model is a beacon for the true believers of open AI development. With how much the world has already benefitted from Llama 2, Llama 3 does not just have big shoes to fill but it’s already proven itself to be worth the hype and admiration.

Technical Specs:

Model Size: Like it’s predecessor, 70 billion parameters.
Vast Training: Trained on a mind-boggling 15 trillion tokens. Without knowing how much GPT is trained on, I'd say this is a first and has intrigued a lot of researchers. It defies what people thought was "optimal" and proves you can still rely on raw scale.
Advanced Tokenizer: Wielding a new tokenizer of 120,000 tokens, it slices through sentences with surgical precision. Llama 2 used 32k tokens. This also means Llama 2 and Llama 3 is tuned slightly differently.
Open-Source with a Twist: It champions open-source ideals—kind of. It’s more open-hearted than open-book, enticing you to dive deep and fine-tune its gears. It's what we now call "Open weight", however seeing the commitment is welcomed.

Flaws? Sure, It Has a Few:

Contextually Challenged: The initial context window is laughably short at 8,000 tokens. While this may be a hiccup for some, as a chatbot, it's still top-tier. To put it in perspective, its so-called rival Gemini 1.5 might as well be in another galaxy with its million-token window.
Maths Woes: It can conjure up the right formulas like a pro, but when it's time to crunch the numbers, it’s more "uh-oh" than "eureka." Yes, it’s a head-scratcher, but this is a known limitation of all LLMs so out of the ordinary here.

Why It’s a Game-Changer:

Personality Galore: The default personality is so affable it’s like chatting with your cleverest friend—engaging, never condescending. This isn’t just an LLM; it's the life of the party in AI form. I cannot praise this point enough, Mistral and Microsofts offerings all are condescending and belittling and lecturing me. This however is a complete social butterfly happy to crack a joke or two to keep the mood light.
Logic Leaps: Beyond friendly banter, its logical prowess is quietly revolutionary. Outshines GPT-3.5 and makes Mistral Large seem obsolete. Further testing is needed to compare it directly with Gemini 1.5, but initial impressions are promising.
Social Savant: Its social intelligence is so advanced, it feels like it’s from another dimension. This isn’t just a tool; it’s the most enjoyable LLM, ready to push boundaries and charm its way into being an indispensable part of your digital life.
Retrieval and Generation (RAG) Capabilities: Shows potential in its ability to retrieve information within its limited context, but the full extent of its retrieval capabilities remains untested due to the initial version's constraints, that dreadful 8k context window size.

Final Thoughts: With its bold stance on open-source and a personality that could outshine your average social butterfly, Llama 3 70b isn't just playing in the big leagues—it's making its own rules. Sure, it may stumble over some algebra, but in the realms of conversation and charisma, it's unmatched. Stay tuned for more tests, but as of now, it's clear that Llama 3 70b is here to make a splash.

Llama 3 8b: Punching Above Its Weight

Meet Llama 3 8b, the little sibling in the Llama 3 series, that packs a surprisingly hefty punch with just 8 billion parameters. Despite its smaller size, this model has been trained on the same colossal 15 trillion tokens as its bigger counterpart, showcasing an efficiency that’s nothing short of revolutionary. 15T tokens in 8b, is it's own form of madness and I am here for it.

Capabilities:

Social Skills: Smooth and savvy, this model handles social interactions with a finesse that belies its size. The conversation flows so naturally you might forget you're chatting with a smaller model, that is until you hit a snag.
Speed Demon: What it lacks in depth, it makes up for in speed. This model responds with such speed it really kept me wondering am I really using a 8b model? The cost-performance ratio is amazing.

Limitations:

Logic Lapses: Here's where the charm wears thin. Ask it to perform a logical gymnastics routine, and it’s more likely to trip up than stick the landing. It struggles with deeper comprehension, often misinterpreting complex inputs.
Surface-Level Understanding: While it can mimic understanding quite convincingly, don’t be fooled—it’s mostly just skimming the surface. This can lead to errors when responses depend on a deeper grasp of the context.

Conclusion:Llama 3 8b may not replace your need for more powerful models like the 70b, but it's not trying to. Instead, it excels within its limits, offering a user-friendly, quick-witted companion that's perfect for those who need speed over depth. It’s not perfect, but calling it a failure? That would be a gross misunderstanding of its niche appeal.

Phi-3 Mini: A Compact Powerhouse with a Touch of Drama

Having experimented with the Phi-3 Mini, it's clear that despite its compact size, this model packs a significant punch, much like its contemporary, the Llama 3 8b. My anticipation for testing this model was piqued after reviewing the technical report prior to the official release of the model weights.

Technical Specifications:

Parameters: 3.8 Billion
Training Data: 3.3 Trillion tokens
Context Lengths: 4K and 128K (using the LongRope technique)

Initially, only the mini version has been released, and there seems to be an uncertain wait for the rest of the family—a bit disappointing, really. For my testing, I utilized the 4K context length to gauge its basic capabilities.

Performance Analysis:

Baseline with 4K Context: While the Phi-3 Mini demonstrates commendable capabilities within the 4K context, it's not without its quirks. The model does recall information but tends to embellish or distort facts, which muddles its reliability. It’s like having a chat with someone who can’t resist throwing in a few tall tales for flavor. Completely unnecessary and mildly infuriating.
Scaling Up to 128K Context: Increasing the context length to 128K doesn't alleviate the issue; if anything, it exacerbates it. The larger context offers more room for the model to weave its intricate web of fabrications alongside actual data. This tendency to overwrite straightforward queries with verbose, unsolicited advice turns simple interactions into tedious lectures.

Logic Capabilities: Despite its flaws in recall, the Phi-3 Mini excels in logical reasoning. It claims to outperform GPT-3.5 in logic, and my tests confirm this claim slightly, there is a lot of untested things but to say it's not keeping up would be a lie. It adeptly handles complex queries that stump older models. This level of performance suggests potential in more demanding computational tasks.

Practical Applications:

Data Processing: It can convert input into JSON format, but it tends to overstep by adding irrelevant or false information. Imagine asking for a straightforward conversion and getting unnecessary embellishments that cloud the output. Now you dealing with a JSON object which has the keys you asked for, but it's lectures built into the key values with it! It just cannot keep it's self righteous attitude to itself. The result? You asked for vanilla and got tutti frutti.
Service Use: Imagine ordering a simple, neatly-cut slice of cake, only to watch in horror as the server pulls out a chainsaw instead of a knife. Sure, you get your slice, but so does everyone and everything within a five-foot radius. It delivers, but not without a mess.

Recommended Usage: Given its tendency to lose focus in extended conversations, the Phi-3 Mini is best used in short, single-instance interactions. Fire your query and let it process that alone without threading it into a longer conversation. This approach minimizes its narrative embellishments, making it somewhat more reliable—like using a scalpel instead of a chainsaw for precision work.

With a lull in the tech world, DBRX hit the scene proving itself to be quite capable, but it’s reign was short lived by the absolute flurry of models coming out. It’s a shame DBRX did not get more time in the spot light, and this has all been a little too unpredictable, but life is not a straight line.