"GenAI Explained Simply: GPT, Tokens, Attention & Embeddings for Begin

How It All Started

To be honest, it all started with ChatGPT.
If someday someone asks, "Have you seen the evolution of ChatGPT, Gemini, and all that?" I can proudly raise my hand as one of those people who experienced it.

I still remember how, after every 2–3 chats, it would completely forget the context, and we’d be like, ‘Wait, what were we even talking about?’ And now, you can literally talk to it the whole day, and it still feels familiar — almost like your own.

So yeah, just to explore this whole thing a little deeper — and I mean just explore — because even now I have no intention of diving into heavy stuff like Calculus, Statistics, Probability, and all that, the kind of topics that still steal sleep from so many people’s nights.

What is GenAI?

Let’s start with the word itself — GenAI, short for Generative AI.

Generative = Generates Stuff

“Generative” basically means something that generates.
In our case, we’re talking about AI that can generate the “next thing” for us.

But What’s This “Thing”?

Well, that depends on the tool:

In ChatGPT’s case, it’s text
In DALL·E’s case, it’s images
In Suno.ai’s case, it’s audio

So basically, you give it some input, and it gives you back something meaningful.

The GenAI Ecosystem is Growing Fast

We’ve already seen a bunch of GenAI tools out there.

The list doesn’t stop at ChatGPT, Gemini, Claude.ai, or DALL·E.
Just when you think you’ve seen them all, a new one pops up out of nowhere.

Why GenAI is a Game Changer for Builders

This shift is super useful for developers, makers, or founders, especially those who don’t want to go down the traditional AI rabbit hole of:

Calculus
Linear Regression
Probability
Statistics

Instead, they can focus on business logic and still benefit from the power of AI. The only limit is us, not AI.

What's Happening Behind the Scenes?

All these tools — ChatGPT, Gemini, Claude — are powered by models running in the background.

A Quick Look at Some Examples

ChatGPT runs on models like GPT-4o, GPT-4.5, etc.
Gemini uses gemini-2.5-flash, gemini-2.5-pro, and so on.
Claude.ai is backed by Claude Sonnet 3.5, Claude Opus 4, etc.

And that’s just a few — there are many companies building and running their own AI models.

What are LLMs (Large Language Models)?

Now, to be a bit more specific, these “models” are commonly known as LLMs, short for Large Language Models.

They’re trained on huge datasets, and in most cases, they’re proprietary, which means only the companies that own them can train, fine-tune, or update them.

LLMs in Simple Terms

Think of an LLM as a kind of machine brain — one that’s trained to understand and generate human language.

It reads tons of content:

Books
Articles
Websites
Chat logs
...and then learns the patterns behind how humans speak, ask questions, and explain stuff.

How It All Comes Together

The “smartness” you see in ChatGPT or Claude? That’s the LLM working behind the curtain.

You type something
This is your input or prompt to the AI.
The model processes your input
It understands the context, analyzes patterns, and figures out what you might be looking for.
It replies in a way that feels natural and relevant
Based on its training, it generates a response that makes sense, just like a human would.

And remember — no magic here. Just data, patterns, and probability.

Up Next: Going a Bit Deeper

So far, we’ve covered:

What GenAI is
Why it different from traditional AI
What tools it use
What powers these tools behind the scenes

Now, let’s go a bit deeper — not to get “fancy,” but so we don’t get lost when someone drops terms like "tokenization," "transformers," or "fine-tuning."

Once you get familiar with these concepts, you’ll feel more confident in any GenAI discussion — and you’ll know how these tools actually work under the hood.

So let's get started.

GPT?

GPT stands for Generative Pre-trained Transformer.
And honestly, I think by now it’s kind of obvious what that means, right?

Generative — something that generates stuff.
Pre-trained — probably trained on a huge amount of data beforehand.
Transformer — sounds like something that can transform one thing into another, yeah?

Doesn’t it sound a lot like an LLM?
I mean, that also takes input, processes it, and gives a natural output.
Exactly. It’s the same thing.
It’s just that in this context, we call it GPT.

It’s like the whole AI space is split into two camps now — one that builds AI, and one that uses it.
So, hey, if that’s how it works… why can’t we start naming stuff our own way too? 😄

Tokens & Sequences — What's That All About?

If you've been around GenAI stuff even a little, you might’ve heard things like:
“You’re only allowed this many inbound/outbound tokens.”

And if that sounds confusing, no worries. It’s really just a fancier way of saying something super familiar.

Do you remember how, in English grammar, we first learn:

A collection of characters makes a word, and a collection of words makes a sentence?

The same logic applies here, just with slightly different names.

But wait — how do we even decide if a bunch of characters is actually a word?
Because technically, “XYZ” or “HULULULU” are also collections of letters, right?
But they don’t mean anything (at least not in standard English).

That’s because we’ve defined only some character combinations as meaningful — the ones stored in our mental vocabulary or a dictionary. And when we want to form a sentence, we pick our words from that collection.

Now here’s the GenAI twist:

A collection of characters is called a token, and a collection of tokens is called a sequence.

So yeah — "token" is basically just GenAI’s version of a "word" (except it can be part of a word too).

And just like different human languages have different vocabularies,
different LLMs (like GPT-4o, Gemini-2.5-pro, Claude, etc.) have their own vocabulary systems too.

Some models might store the full word “Hello” as a single token.
Others might break it up and store “He” and “llo” separately.

If you’re curious and want to see this in action, check out tiktokenizer — it shows how different models split up words into tokens.

Cool? Alright, let’s look at an example.

If you look closely, you’ll notice that different models handle vocab and tokenization a bit differently.

For example, take GPT-4o — it doesn’t just start breaking the input into tokens directly. First, it adds some extra tokens, like positional tokens, so it can track the order of words properly. Only after that does it start encoding your actual input.

On the other hand, if you try the same sentence with Google's Gemini or Gemma, you’ll see they skip that initial step and jump straight into breaking the sentence into tokens, based on their own vocabulary.

Just look at the color coding on the tokenizer tools — you can literally see how each model splits the sentence differently.

Wanna try it yourself? Just head over to a tokenizer playground like TikTokenizer, plug in a sentence, and compare how GPT tokenizes it versus how another model does. You'll see how even something as small as “Hello!” might get split in totally different ways depending on the model.

Tokenization

So far, we’ve already covered all the heavy stuff. Now what’s left is just how to actually use all that complexity — which is surprisingly chill.

See, we all know one thing:
Unlike humans, computers love numbers.
They prefer storing and processing things in numbers for accuracy and speed.

We, on the other hand, turn numbers into readable stuff, like words or websites.

Quick example — ever seen this IP address: 142.250.4.139?
Any idea whose IP this is?
And you’ve probably used this website hundreds of times… but have you ever typed that IP directly? Nope.
Because humans like names. Computers like numbers. That’s just how it is. 😄

The same goes for LLMs.

Whatever sentence you give, it breaks it down into a sequence of tokens, and behind the scenes, every token is just a number.
Basically, LLMs have a huge vocabulary storage, where every word or chunk (token) is mapped to a specific number.

So yeah, in simple terms:

Tokenization is the process of converting a sentence into tokens,
and then turning those tokens into a list of numbers —
each number pointing to a specific token in the model’s vocabulary.

You can also search for “GPT-4 vocab size“

For GPT-4, the vocabulary size includes 100,256 predefined common tokens, while this number increases to 199,997 in GPT-4o. This tokenizer deviates from strict BPE merge rules when an input token is already part of the vocabulary.

Vector Embedding

See, till now we’ve talked about how LLMs take your input and break it down into tokens — like tiny Lego pieces of your sentence. That’s cool, but it’s just the surface.

Now comes the real magic.

Those tokens? They’re just IDs — like labels or entry numbers in a dictionary. On their own, they don’t really mean anything.

But now, we’re moving into a multidimensional space —
A whole other world where every token is placed based on its semantic meaning.

Think of it like this:

"Embedding" is the process of placing each token in a kind of coordinate system —
but not in 2D or 3D — this is like 1,000D+ space.
And the position of a token in this space depends on its meaning and how it relates to other tokens.

So “king” and “queen” will live close to each other in that world.
Same for “apple” and “fruit”.
But “king” and “banana”? Miles apart.

This is how LLMs understand context and generate human-friendly, meaningful replies — not just based on word match, but actual semantic relationships.

You can also explore vector embedding visually on this website: vector embedding visualization. Each dot represents a semantic meaning that you can check out for yourself.

We could say:

"Embeddings are where tokens stop being numbers... and start making sense."

🔁 In Short:

Let’s say we give some input like:
"Hey, this is AnG"

🔹 Step 1: Tokenization
The model breaks it into token IDs:
[2, 5564, 456, 445, 33, 56, 75]
(Just random example IDs)

🔹 Step 2: Embedding Lookup
Each of those token IDs is then mapped into a high-dimensional vector, like its address in the semantic world.

So now we get something like:

[
  [124, ..., 546],       // for token ID 2
  [768, ..., 332],       // for token ID 5564
  ...
  [4568, 7214, ..., 4567] // for token ID 75
]

Basically, each token gets converted into a long array of numbers (a vector), which tells the model where that token lives in its semantic world.

Positional encoding

We created vector embeddings, meaning each token was given a specific place in the large semantic world.

But here’s the catch...

Take these two sentences:

👉 “Bahubali ne Katappa ko kyun maara?”
👉 “Katappa ne Bahubali ko kyun maara?”

We humans instantly get which one makes sense, right?
But now imagine this from a model’s perspective —
Both sentences use the same tokens, just in a different order.

So if we only go by the vector embeddings of the tokens —
They’ll look the same, just shuffled.

🤔 But bro, in language — position changes everything.
Like here, switching two names flips the whole meaning!

🎯 So what’s the fix?

That’s where Positional Encoding comes in.

We attach some extra metadata to each token’s vector —
which tells the model:

“Hey, this word appeared at position 1, this one at position 2, and so on.”

This way, even if the tokens are the same, their role in the sentence is preserved.
And the model can actually "understand" the full context.

Self Attention

So far, we’ve:

Converted input into tokens
Got vector embeddings
Added their position using positional encoding

Cool. But there's still one problem...

A sentence is not just a list of words — it’s a story.
And to understand that story, tokens need to talk to each other.

Take this example again:

👉 “Bahubali ne Katappa ko kyun maara?”
👉 “Katappa ne Bahubali ko kyun maara?”

Now just ask —
“Who killed whom?”
You need to know the relationship between words like “Bahubali”, “Katappa”, and “maara”.

So, what do we do?

🎯 Self-Attention = Tokens Gossiping About Each Other

We give every token the power to look at every other token in the sentence and decide:

“How important are you to me?”

Each token creates three versions of itself:

Query (Q) – What am I looking for?
Key (K) – What do I offer?
Value (V) – What’s my actual content?

Using Q & K, every token checks its relationship with all others.
And then gathers the most relevant information using V.

Basically: Every word asks — "Whom should I pay attention to to make sense?"

🔁 Real-Life Analogy:

Imagine you're reading a murder mystery.

You don’t just look at each word alone — you connect dots:
"Katappa", "maara", "Bahubali" — now that starts to make sense.

Self-attention makes the model do exactly that —
connect dots, understand who is related to whom, and why.

🤖 So now:

Every token has meaning (embedding)
Has a place (positional encoding)
And now also knows what other tokens matter to it (self-attention)

This is what gives the model real context understanding.
Without it, you’d just get shallow, keyword-based replies.

Transformer

If there’s one thing we could call the mindset or thinking engine behind all GPTs —
It’s the Transformer architecture.

And honestly, if that one paper —

“Attention Is All You Need”
hadn’t dropped back in 2017,
then bro, we probably wouldn’t even be talking about GenAI today.

It’s this one architecture that gave every LLM the power to:

Understand complex language
Focus on relevant words using attention
Handle long-range dependencies (yaani line ke pehle aur end ke words ka relation)
And learn at a massive scale

Basically, it’s the blueprint for how models “think”.

So next time someone asks:

“How does GPT actually understand stuff?”

Just tell them:

It thinks in Transformers.
It’s not magic — it’s a damn smart system of attention, layers, and token gossip. 😄

🔚 So What Have We Really Seen So Far?

If you look closely, till now we’ve only understood how LLMs generate output —
But one big question is still left:

How do they remember what we said earlier?

We all remember the early days, right?
You’d chat with an AI for 2–3 messages, and boom —
it would forget everything you just said. 😑

But now?
These models seem to remember full conversations, like they’ve got memory superpowers. 🧠⚡

😬 One Problem Though…

Most LLMs like GPT-4, Claude, Gemini, etc. are proprietary —
That means we can’t train them ourselves or tweak them fully to behave exactly how we want.

We can use them, but we can’t control them completely.
So yeah, there’s a limitation...

🎩 But Wait — Here Comes the Magic...

Turns out, both problems —

Remembering context properly
Making the model behave more like “ours” —
can be tackled with just one approach.

Yup, it’s possible to give these models custom memory,
and even make them feel like they were trained just for us —
without actually training them from scratch.

And how?
That’s exactly what we’re going to explore in the next blogs.

So if you’re curious how to build your own GPT,
or how to give it custom memory and personality —

Subscribe / Follow — because this journey just got real. 🚀

"Hello World!" with GenAI

How It All Started