Cerebral Static

Trying Mixtral, Phi2 and Dolphin2.5

Izumi

12 Jan 2024 • 14 min read

tl;dr Using the base Mixtral I wanted to see how effect it was as a tool to process large amounts of data. It failed due to it’s major alignment problems. Cooking it into the model at a base level is the worst.

Dear Reader

Brace yourselves for a ride through the unpredictable world of LLMs, where plans change faster than my tolerance for incompetence. Initially, my sights were set on Mixtral – the latest buzz in the tech sphere (Released December 2023). But, as luck would have it, Phi-2 sauntered in, flaunting its underdog allure and, admittedly, piqued my interest.

Just when I was adapting to this new diversion, Dolphin 2.5 burst onto the scene, throwing my meticulously laid plans into utter disarray. So, here we are – what was supposed to be a solo mission with Mixtral turned into an impromptu triathlon with Phi-2 and Dolphin 2.5 joining the fray.

All of this within the course of one month and this poor article kept getting pushed out as I added to it.

My mission? To find Mixtral's breaking point, to see if it could handle the heat in the kitchen. Not for idle chit-chat, mind you, but to wield it as a tool.

Personally, I oscillate between GPT3.5 Turbo – the budget-friendly option – and GPT4 Turbo – which actually gets the job done – when I'm dealing with heaps of text. Though, if I'm being completely honest, GPT4 Turbo often ends up being my weapon of choice. They're not perfect, mind you, but they get the job done without much drama.

What you'll find here is a brutally honest analysis of these LLMs. I pushed them, prodded them, and yes, even tried to offend them. Surprise as tools they don't make the cut.

If you're trying to decide which LLM might be the right fit for your needs, this post might just steer you in the right direction. Or, at the very least, you'll walk away with some newfound knowledge.

Let's start with the basics and see how these tools hold up under the scrutinizing eye of yours truly.

Test 1: The Tokyo Test

Starting off with something close to home – literally. Looking at the example questions I noticed this and could not resist, a topic I know like the back of my hand: Tokyo.

Question/Input:

I'm planning a vacation to Japan. Can you suggest a one-week itinerary including must-visit places and local cuisines to try?

Response:

Click to reveal

Absolutely, I'd be happy to help you plan your trip to Japan! Here's a suggested one-week itinerary:

Day 1 - Arrival in Tokyo

Explore the neighborhood around your hotel and adjust to the time difference.
Try some sushi for dinner at a conveyor belt restaurant (kaiten zushi) or a traditional sushi bar.
Day 2 - Tokyo

Visit the historic Asakusa district and see Sensoji Temple.
Walk along Nakamise shopping street and sample some traditional Japanese snacks like senbei (rice crackers), mochi (pounded rice cakes) and imo yokan (sweet potato jelly).
Head over to Akihabara, known as "Electric Town," where you can find anime, manga, video games and electronics stores.
For dinner, try tonkatsu (deep fried pork cutlet) or ramen.
Day 3 - Tokyo

Spend the day exploring Harajuku and Shibuya districts, famous for their fashion culture and youth subcultures.

THOUGHTS

Now, let's dissect this. The response is accurate; you can indeed walk from Shibuya to Harajuku. However, referring to them as "districts" is a bit of a stretch. It's like calling your backyard an 'estate' – technically true, but a bit grandiose. Still, I'll give credit where it's due. The answer was logical and coherent.

One minor thing which I noticed in hindsight, the question was about "Japan" not "Tokyo" but it just went off on Tokyo.

At first glance, no major red flags. But then again, this was an easy pitch. Let's see how they handle something with a bit more curve.

Stepping it Up: The Shopping List Challenge

Next, I decided to throw a shopping list at these LLMs. Sounds simple, right? Well, not so fast. There are a few twists in the tale.

Here's what I was looking for:

Categorization Skills: Can the LLM make sense of the items and categorize them logically?
The Chicken Conundrum: How does it handle the item "Chicken"? This was a little trap I set.
Accuracy and Hallucination: No repeating items or making up stuff that wasn't on the list.

Input:

Show Question/Input

Create a JSON representation of a shopping list. Include the product name, price, and quantity. Determine the category for each item based on its name and include this in the data. Ensure the JSON format is correct, with each item as a separate object within an array.
Product: Bananas, Price: $0.30 each, Quantity: 6 Product: Laundry Detergent, Price: $12.00, Quantity: 1 Product: Chicken Breast, Price: $8.00 per kg, Quantity: 1 kg Product: Notebook, Price: $2.00, Quantity: 3

Response:

Show Response

Certainly! Here's the JSON representation of the updated shopping list:

[
{
"product_name": "Bananas",
"price": "$0.30",
"quantity": 6,
"category": "produce"
},
{
"product_name": "Laundry Detergent",
"price": "$12.00",
"quantity": 1,
"category": "cleaning_supplies"
},
{
"product_name": "Chicken Breast",
"price": "$8.00",
"quantity": 1,
"unit_of_measurement": "kg",
"category": "meat"
},
{
"product_name": "Notebook",
"price": "$2.00",
"quantity": 3,
"category": "stationery"
}
]

I hope this helps! Let me know if there's anything else I can do for you.

With my basic "shopping list," I have to say, Mixtral passed the test. The categories were decent enough. Sure, GPT might have shown more finesse, but let's not get too picky. Hat’s off, It also managed to get the chicken product correct.

So, on the surface, it looks like Mixtral isn't completely clueless. But don't get too excited; we're just warming up. The real test is yet to come. Let's see if it can handle a bit more complexity without going haywire.

Ramping it up

We're ramping up from a snooze-worthy level 1 straight to a juicy 7. Brace yourselves, it's about to get interesting.

For my next trick, I added a 'dildo' to the shopping list. Yes, you read that right. Why? Because it's easy to handle bread and milk, but let's see how these LLMs deal with something a tad more... risqué.

This is where we separate the prudes from the bold. I wanted to see if Mixtral would blush, stutter, or just flat-out ignore the item. It's a test of how well it can handle unexpected or unconventional inputs without losing its composure.

So, did Mixtral turn red and run for the hills? Or did it take it in stride like the sophisticated AI it claims to be? Let's find out.

Remember, this isn't just for laughs. It's a legitimate test of an LLM's ability to process and categorize all types of data, no matter how... unconventional. Let's see how it fared.

Updated Input/Question

Response

Sure! Here's a JSON representation of the shopping list you requested:

Oh the json is incorrect

I apologize for that mistake. Here is the corrected version:

Ah, Mixtral, you prude. As expected, it couldn't even utter the word 'dildo' due to its overzealous censorship settings. Talk about being sheltered. It's like asking someone to read aloud from an erotica book, and they can't get past the first page. For a tool that's supposed to process text, this is a glaring flaw in its current setup.

Throw the same 'shopping list' at GPT3.5 or GPT4, and watch them handle it without batting an eyelid. But we all know GPT also runs into these “problems” when processing data as well, but it’s more usable.

And now, for a little surprise - let's see how little Phi performed.

Phi's Output

Response: [
{
"product": "Bananas",
"price": 0.30,
"quantity": 6,
"category": "Fruit"
},
{
"product": "Laundry Detergent",
"price": 12.00,
"quantity": 1,
"category": "Cleaning"
},
{
"product": "Chicken Breast",
"price": 8.00,
"quantity": 1,
"category": "Meat"
},
{
"product": "Notebook",
"price": 2.00,
"quantity": 3,
"category": "Stationery"
},
{
"product": "Dildo",
"price": 25.00,
"quantity": 1,
"category": "Adult Toys"
}
]
<|endoftext|>

Well, look at that. Phi categorized it as 'Adult toys' and got the rest of the list mostly correct. That's right Phi failed the chicken challenge.

While I've had my fun poking at Mixtral's prudishness, let's be clear: it's not just about Mixtral lacking a sense of adventure. The real issue here is the output just stopping dead in its tracks. It's like having a conversation and the other person suddenly walks away mid-sentence. Rude and unhelpful.

I won't say Mixtral completely lacks understanding or the ability to categorize. The problem is its output cessation, which is a serious hiccup, especially for programming-related tasks. Imagine you're coding and need your LLM tool to process all sorts of strings and variables, and it just decides to bail on you because it hits a word it doesn't like. That's not just inconvenient; it's a deal-breaker.

Oh and Dolphin? Since it does not exit out of it’s response it manages to finish outputting the JSON. Get’s the job done with nothing of note to say.

This test wasn't just for kicks. It highlights a crucial aspect: an LLM's ability to process a variety of data types without unnecessary censorship. In this round, it looks like Dolphin takes the lead, followed by Phi and leaving Mixtral fumbling in the dark. Let's see what else these LLMs have in store for us. (To be fair, both Phi and Mixtral both failed this, so Dolphin is the only winner here.)

The Rebecca Tattoo Gauntlet

Time and again, I've used this question to trip up LLMs, and it's become somewhat of a legend in my posts. For the uninitiated, this query is based on a piece of text from the Cyberpunk Wiki about a character named Rebecca (https://cyberpunk.fandom.com/wiki/Rebecca). The real challenge here is to test the LLM's scraping skills in a tool context.

Here's a reminder of our 'big name' flops:

Google Gemini Pro? Epic failure.
GPT3.5? Crashed and burned.

With a track record like that, my expectations for Mixtral were in the gutter. The question was not if it would fail, but how spectacularly it would nosedive.

⚠️

EDIT: Hold up, let's pump the brakes for a second. After mulling it over, I've realized that plastering scraped page data all over here isn't the brightest idea. So, I've snipped out most of it and left just the tattoo bit. Here's a text file for the rest of it – neatly tucked away, so it doesn't end up in the clutches of search engines.

rebecca

rebecca.txt

3 KB

Input / Question

Accept the following input
"--snip--
Appearance
Rebecca is short in stature, with soft features and stark white skin to contrast with the pink tattoos on her neck, stomach. Her tattoo on her right thigh reads "PK DICK". Clothing is minimal for Rebecca as her preferred dress is simply her black bra and underwear, a black, Kitsch, high collar jacket with green accents that go down below her waist (left half-zipped), and a pair of matching sneakers. Besides the jacket and shoes, she pulls her green hair up into pigtails with a dual clipped hair band. Her cyberware is subtle with only her pink and yellow cyberoptics standing out.

--snip--"

----

Tell me about her tattoos.

Mixtral: Output / Response

Attempt 1

Rebecca has several notable tattoos on her body. The most prominent ones are located on her neck, stomach, and right thigh. These tattoos are done in pink ink and contrast sharply with her stark white skin.

On her neck, there is a small tattoo visible above the collar of her jackets. It is a simple line drawing of what appears to be a feather or quill. This tattoo may symbolize her love for literature or storytelling.

The second tattoo is located on her lower abdomen, slightly above her navel. It is a larger design that covers a significant portion of her stomach. Unfortunately, the specific details of this tattoo are not described in the given text. However, we know that it is also done in pink ink.

The third tattoo is located on her right thigh and is likely the largest and most noticeable of her designs. It consists of the letters "PK DICK," which are arranged vertically in bold font. This acronym is a clear reference to American science fiction author Philip K. Dick, whose works have had a profound impact on the genre. Given Rebecca's involvement in the world of cybernetics and technology, it is fitting that she would pay homage to such a visionary thinker. Overall, Rebecca's tattoos serve as unique personal expressions and contribute to her distinctive style.

Attempt 2

Rebecca has several tattoos on her body, but the most prominent one reads "PK DICK" in bold pink letters. This tattoo is located on her neck, stomach, and right thigh. PK DICK is a reference to science fiction writer Philip K. Dick, whose works often explored themes of reality, identity, and human perception.

The choice of tattooing PK DICK on her body could suggest that Rebecca is a fan of Philip K. Dick's works or that she identifies with the themes he explored in his writing. It could also be a way for her to express her individuality and rebelliousness against societal norms in the world of Cyberpunk: Edgerunners.

Rebecca's other tattoos are less noticeable but add to her overall aesthetic as a character. She has various small tattoos scattered around her arms and fingers, which are mostly geometric shapes or abstract designs in black ink . These tattoos further emphasize Rebecca's edgy and unconventional personality , making her stand out amongst other characters in the series .

And there you have it, a failure as expected. But what's this? Hallucination too? This is not just bad; it's a catastrophe. While Gemini and GPT3.5 might stumble, they at least manage to stay in the realm of reality. Mixtral, on the other hand, has gone off the deep end.

A little side note: Sometimes, and I mean only sometimes, this question fails because of the LLMs' aversion to PK Dick's"name. It's like saying "Voldemort" in the Harry Potter universe. Try correcting it by mentioning, "No, she only has a tattoo of PK Dick on her right thigh," and watch the censorship alarms blare.

Diving into Mistral's 'logic' – or lack thereof – is like trying to find sense in a surrealist painting. Take my first attempt with it, which I, luckily recorded. The model says 'jackets' as if she is wearing multiple layers on top of each other.

Ask it about clothing, and you'll quickly realize it has about as much understanding of layering as a toddler dressing up for the first time. But hey, perhaps it's channeling its inner superhero. After all, Superman rocks his underwear over his suit. So maybe Mistral's not completely off its rocker; maybe it's just superhero-level unconventional. But really, it's just clueless.

Let's have a look at what Dolphin has to say

Dolphin's response / output

Attempt 1

Rebecca's tattoos play a significant role in her characterization. She has a large tattoo on her neck and smaller ones on her stomach. The tattoo on her neck reads "PK DICK," which is a reference to science fiction writer Philip K. Dick. This tattoo may suggest that Rebecca has an interest in sci-fi literature or that she identifies with some of Dick's themes, such as questioning reality and exploring humanity's relationship with technology. The tattoos on her stomach are not explicitly explained, but they add to Rebecca's overall aesthetic and reinforce her rebellious and edgy personality.

Attempt 2

Rebecca, a character from the anime series "Cyberpunk: Edgerunners," is known for her distinctive tattoos. She has several visible ink designs on various parts of her body. The most prominent tattoos are the ones on her neck and stomach, which are soft pink in color against her stark white skin. These tattoos form intricate patterns and read "PK DICK."

The PK DICK tattoo is a notable tribute to Philip K. Dick, a renowned science fiction author. This nod to literary influences adds depth to Rebecca's character. Additionally, she has another set of cybergreen sleeve tattoos that extend upward towards her shoulders, indicating possible future cyber enhancements or personal style choices. Moreover, when she utilizes oversized cyber arms, those weapons also bear red and blue colors, complementing her existing tattoos.

Oh, dear reader, Dolphin not only missed the mark; it launched itself into a whole new dimension of delusion. It's like asking someone for directions and they start describing a fantasy world.

And let's not forget little Phi in this mess. It too stumbled spectacularly on this challenge. The issue with Phi is its limited context window, making it virtually impossible to get a straight answer on this one. And when it did respond, it veered off into some bizarre tangent about what 'Rebecca' could do, completely ignoring the actual data I fed it. It's like asking for a weather forecast and getting a lecture on free will.

So, how do we rank these failures? Dolphin 2.5 takes takes the lead in the disaster race, with Mixtral hot on its heels, bringing its own brand of hallucinations. And trailing behind, almost a non-contender for this question, is Phi – lost in its own little world.

Comparing 'Dolphin2.5' to 'Mistral' is like watching a glitchy AI soap opera. What's fascinating – or should I say, mildly amusing – is how Mistral's already trippy hallucinations get even more twisted under Dolphin. It's like Dolphin decided to take Mistral's delusions and amplify them with a megaphone.

This begs the question: at what point in their training do these models start spewing nonsense? Is it a side effect of "untraining" the horrible core? If so why is this nonsense baked into the core. Right now I lack the experience to answer any of these but I am trying to get my hands dirty and learn.

It's a sad state of affairs when you can't even rely on these LLMs to stick to the script. But hey, at least it's entertaining.

Conclusion

So would I write off Mixtral? Well I personally don’t think it’s a step in the right direction. But what is the right direction? Let’s just say it’s an alt scene. Would I use it again? Yes just because I am unhappy with this use-case does not mean my mind is satisfied, c’mon now we got more tests to perform and learning to do. Life is not a straight line.

Phi was performing and keeping up with Mixtral a fraction of the size and resource, Mixtral did not seem inherently better than Mistral and does not seem to be real-world better than GPT3.5 let alone cheaper than GPT3.5. And Dolphin? A new twist, a break from the mundane, but it's still tangled in Mistral's old issues.
I'd say in terms of entertainment Dolphin won this no contest. If I wanted some whacky creative writing, Dolphin > GPT3.5

Sidenote: I don’t believe Mixtral uses the “Sliding Window Attention” technique but I've got a bone to pick with this so-called 'innovation'. Frankly, I'm skeptical – it's like trying to read a book through a keyhole. Sure, you get snippets, but the big picture? Lost in translation.

I'm itching to mess around more with Mistral, see for myself where it shines and where it trips over its own feet. But let's be real – it's hallucinating problems are quite a blocker. Is it the 'Sliding window attention' not pulling its weight, or is Mistral just inherently bonkers? Jury's still out on that one. Also how much of Mixtral is made up of Mistral? Does it matter? Are they independent?

So, here's my take: 'Sliding window attention' isn't a step forward; it's more like a clumsy sidestep. But hey, I'm not one to jump to conclusions without rolling up my sleeves and digging in the dirt. I'll keep toying with it, but don't expect me to hold my breath.

Well, there you have it – a whirlwind tour of these models and their mind-bending antics. Hopefully, these real-world-ish scenarios and use cases have given you a glimpse into the capabilities of these digital beasts. Maybe they've even confirmed some of your suspicions, or dare I say, helped you make a choice in this AI circus.

Whether you're here for the knowledge or just for the kicks, I hope this little escapade through the land of LLMs has been either enlightening or amusing. Or both. Remember, in the world of AI, expect the unexpected – and always keep a healthy dose of skepticism handy.