2023 was the year AI engineering went from "interesting research area" to "where we're hiring." Every tech company and most non-tech companies had an LLM strategy — or were urgently developing one. The tooling ecosystem exploded: LangChain went from a GitHub project to a company with $25M in funding; vector databases raised hundreds of millions; and the "AI engineer" role emerged as a distinct job title, sitting at the intersection of software engineering, ML, and product thinking.
It was also a year of productive chaos. New abstractions appeared and were immediately deprecated. Best practices formed and reformed. RAG was invented, popularized, critiqued, improved, and critiqued again — all within twelve months. The pace of change made it genuinely difficult to know which tools and patterns would survive. Looking back from the end of 2023, here's what actually mattered.
GPT-4 Changes the Baseline
March 14, 2023: GPT-4 launches. The capability jump from GPT-3.5 (the model powering ChatGPT at launch) was substantial — not in benchmark numbers alone, but in the qualitative sense that GPT-4 could follow complex multi-step instructions, write better code, reason through problems more consistently, and handle longer contexts. Claude 1 (from Anthropic) also launched in 2023 with different strengths: longer context window, less tendency to confabulate.
The practical effect for AI engineers: the "what can we actually build?" question became dramatically more permissive. GPT-3.5-level models required extensive prompt engineering and careful task decomposition to get reliable outputs. GPT-4 could handle more complex, open-ended tasks with less coaxing. The range of viable LLM applications expanded.
GPT-4's context window (initially 8K tokens, then 32K for GPT-4-32k) also changed the retrieval calculus. Larger context meant you could stuff more retrieved documents into a single prompt — less precision required in retrieval, more flexibility in what you could answer. The race to longer context windows was on.
LangChain: The Framework That Ate AI Engineering
LangChain was the most polarizing library in AI engineering in 2023. By June 2023, it was the fastest-growing open-source project in GitHub history by some measures. By November 2023, there was a significant "LangChain is over-engineered" backlash from engineers who had tried to maintain LangChain-based systems in production.
Both reactions were understandable. LangChain had abstracted the common patterns of LLM application development — retrieval chains, memory, tools, agents — in a way that made prototyping very fast. The abstractions were leaky, the API changed frequently, and debugging a complex LangChain chain was painful. But it had two genuine strengths: an enormous community generating tutorials and cookbook examples, and a large set of pre-built integrations with tools, vector databases, and document loaders.
The practical advice by end of 2023: use LangChain's integrations and individual utilities, but avoid deep coupling to its chain abstractions for production systems. Build the orchestration layer yourself; use LangChain for the connectors.
RAG Becomes the Production Pattern
Retrieval-Augmented Generation (RAG) was the most important architectural pattern that emerged and stabilized in 2023. The concept: instead of relying on the LLM's parametric knowledge (what it learned during training), augment each query with relevant documents retrieved at query time from a vector database or search index.
RAG solved the key problems with pure LLM applications in enterprise settings:
- Knowledge cutoff: LLMs have a training cutoff; your internal documents are updated continuously. RAG bridges this by retrieving current documents.
- Hallucination mitigation: Grounding responses in specific retrieved documents reduces (not eliminates) fabrication.
- Source attribution: You can cite the specific document passages used to generate a response, which is essential for enterprise trust.
- Data privacy: Your sensitive internal data stays in your vector store; you only send the relevant retrieved chunks to the LLM API, not your entire corpus.
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Basic RAG setup (2023 style)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(model_name="gpt-4"),
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
result = qa_chain({"query": "What is our refund policy?"})
print(result["result"])
print([doc.metadata["source"] for doc in result["source_documents"]])
By the end of 2023, the "basic RAG" above was the minimum baseline. Teams that had deployed it were now dealing with the next level of problems: retrieval quality (top-k isn't enough for complex queries), chunk size optimization, query rewriting, hybrid search (dense + sparse), reranking. The pattern was solid; the production tuning remained hard.
The Open-Source LLM Surprise: Llama 1 and 2
February 2023: Meta releases Llama 1 (weights restricted to researchers). Leaked within days. The response from the AI engineering community was immediate: fine-tuning experiments, quantization (running 7B and 13B models on consumer GPUs), and a flood of derived models (Alpaca, Vicuna, WizardLM) appeared within weeks. Open-source LLM development, which had been essentially dormant compared to the closed-model labs, suddenly had a foundation to build on.
July 2023: Llama 2, with genuinely open weights for commercial use. At 70B parameters with RLHF fine-tuning, Llama 2 was competitive with GPT-3.5 on many benchmarks. For enterprises concerned about data privacy or API costs, self-hosted Llama 2 became a real option for many use cases.
The practical implications were significant for AI engineering:
- Fine-tuning on proprietary data was now accessible without sending data to OpenAI
- Local inference became viable for latency-sensitive applications
- The "best model is always OpenAI" assumption started cracking
- A whole category of tooling (llama.cpp, Ollama, vLLM, text-generation-inference) emerged for efficient open-model inference
Function Calling / Tool Use Changes the Agent Story
June 2023: OpenAI releases function calling in the API. Before this, getting an LLM to produce structured output that could trigger a tool was a prompt engineering exercise with unreliable results. Function calling provided a first-class mechanism: you describe available functions in the API call; the model returns a structured function call when it determines a function is needed; your code executes the function and passes results back to the model.
This unlocked reliable tool-using agents. Not the ReAct-style agents that had been possible before, which required careful prompting and often went in circles, but agents that could reliably invoke a database query, call an API, or run a calculation as a native model behavior rather than a prompt engineering trick.
By the end of 2023, function calling was the standard mechanism for LLM-tool integration. Anthropic's tool use API (for Claude) and Gemini's function declarations followed the same pattern. The pattern was converging.
The Evaluation Problem Nobody Wanted to Talk About
Building LLM applications is easy. Knowing if they're good is hard. 2023 was the year that "LLM evaluation" became a serious engineering problem rather than a research footnote. The challenge: unlike traditional ML models where you have ground-truth labels and measurable metrics, LLM output quality is often subjective, multi-dimensional, and expensive to label.
The emerging evaluation approaches in 2023:
- LLM-as-judge: Use a more capable LLM (GPT-4) to evaluate the output of a less capable one. Scales well, reasonable correlation with human judgment for specific criteria.
- RAG-specific metrics: RAGAS (Retrieval Augmented Generation Assessment) introduced context precision, context recall, faithfulness, and answer relevancy as measurable dimensions.
- Human evaluation pipelines: Systematic A/B testing with human raters for high-stakes applications. Expensive and slow but remains the ground truth.
The tooling was nascent (LangSmith launched in late 2023, Arize AI added LLM evaluation, Weights & Biases added LLM tracing) but the problem statement was clear: you cannot ship LLM applications responsibly without evaluation infrastructure. The teams that learned this lesson early in 2023 were better positioned for 2024.
2023 was the most exciting year in AI engineering that most of us have experienced. The field went from "follow the research papers" to "ship a production RAG system by Friday." The speed was disorienting. The number of things that changed between January and November was staggering. And we were just getting started.