Transformer Attention-based neural architecture used by most modern LLMs. Self-Attention Feed-Forward Networks Positional Encoding Mixture of Experts (MoE) Scales model capacity by routing tokens to specialized expert networks. Router Experts Gating Mechanism RNN Derivatives Legacy architectures (LSTM/GRU) used before transformers; sometimes combined with attention. LSTM GRU Supervised Pretraining Training on large text corpora to predict next tokens. Causal Language Modeling Masked Language Modeling Instruction Tuning Fine-tuning with curated instruction–response datasets. Reinforcement Learning from Human Feedback (RLHF) Alignment technique using reward models and policy optimization. Preference Data Reward Model PPO or DPO DPO (Direct Preference Optimization) Alternative to RLHF that directly optimizes preference loss without RL loops. Continual / Online Learning Updates model over time without full retraining. Quantization Reduces weight precision (e.g., 8-bit, 4-bit) to speed up inference. Pruning Removes redundant parameters to reduce model size. Distillation Trains smaller models to emulate larger teacher LLMs. Speculative Decoding Accelerates generation using draft and verifier models. KV Caching Stores attention key/values to avoid recomputation in autoregressive decoding. Batching / Continuous Batching Combines multiple requests for higher throughput. RAG (Retrieval-Augmented Generation) Integrates external document retrieval into LLM responses. Embedding Model Vector Database Retriever Long-Context Attention Mechanisms Techniques like sliding window, recurrence, or ALiBi enabling >100K context tokens. Memory Systems Persistent or session memory extending the model’s ability to reference past information. Vector Databases Stores and retrieves high-dimensional embeddings efficiently. FAISS Milvus Pinecone Weaviate Embedding Models Generate dense vector representations for semantic similarity. Tool Use / Function Calling Structured invocation of tools or APIs based on model output. Agent Frameworks Systems coordinating LLM tasks with memory, tools, and planning. LangChain LlamaIndex OpenAI Agents API Inference Runtimes Optimized execution engines for LLMs. TensorRT-LLM ONNX Runtime vLLM GGML Containerization Packaging and distribution using Docker, OCI images, and orchestration platforms. Distributed Training Frameworks Scaling LLM training across multiple GPUs/nodes. DeepSpeed Megatron-LM PyTorch FSDP