Resources - Devverse Labs

OFFICIAL RELEASES & RESEARCH - 2024–2025

Primary sources only. Official documentation, release notes, and research reports from the major AI companies. Arranged newest-first within each provider.

Gemini 2.0 Flash - Official Announcementrelease

Google's production-grade multimodal model with native tool use and significantly reduced latency vs Gemini 1.5.

Google DeepMind · Feb 2025

Gemma 3 Technical Reportpaper

Architecture and evaluation of Google's open model family optimised for production fine-tuning and edge deployment.

Google DeepMind · Mar 2025

GPT-4o System Cardguide

Full capability and safety evaluation of the model most teams default to. Read before making architecture decisions.

OpenAI · 2025

Operator - OpenAI Agent Frameworkrelease

OpenAI's computer-use agent for automating web tasks. Reference architecture for enterprise AI automation pipelines.

OpenAI · Jan 2025

Claude 3.7 Sonnet - Research Overviewrelease

Anthropic's extended thinking model - key architecture for production tasks requiring deep reasoning before output.

Anthropic · Feb 2025

Anthropic Model Overview (Official)docs

Live documentation of Claude models - context windows, API names, and deprecation timelines. Bookmark this, not blog posts.

Anthropic · Maintained 2025

OpenAI Models Reference (Official)docs

Official model registry with context limits, pricing, and deprecation dates. The primary reference for any production migration.

OpenAI · Maintained 2025

Google Gemini API Models (Official)docs

Live documentation for all Gemini model variants - context windows, capabilities, and deprecation schedule.

Google · Maintained 2025

Grok 3 - Model Announcementrelease

xAI's Grok 3 benchmark performance and architecture release. Relevant for teams evaluating frontier model alternatives.

xAI · Feb 2025

Perplexity Assistant - Architecturerelease

Perplexity's production AI assistant architecture - grounded search + LLM generation with live source citation.

Perplexity AI · 2025

AlphaFold 3 - Production AI in Sciencepaper

Benchmark for what production AI looks like when it works - shows the gap between demo and deployment the ICP lives in.

Google DeepMind · 2025

Perplexity Sonar APIdocs

Official docs for Perplexity's grounded search API - real-time web-sourced completions for production use cases.

Perplexity AI · 2025

RAG & RETRIEVAL

The architecture decisions that determine whether retrieval-augmented generation is reliable or fragile. Chunking strategy, embedding models, hybrid search, reranking.

RAGatouillegithub

Late-interaction retrieval via ColBERT - contextual token-level matching that outperforms dense vector retrieval for complex queries.

Answer.AI · 2025

RAGASgithub

Automated RAG evaluation - faithfulness, answer relevancy, and context precision metrics computable without ground truth.

Exploding Gradients · Active OS Project

LlamaIndexgithub

Node parsers, retrievers, query engines - modular RAG primitives with production integrations across major LLM providers.

LlamaIndex · Active OS Project

pgvectorgithub

PostgreSQL extension for vector similarity search - keeps retrieval inside existing Postgres infrastructure, no new infra required.

pgvector contributors · Active OS Project

Qdrantgithub

Rust-based vector store with payload filtering, sparse vector support, and binary quantisation. Benchmarks well on production workloads.

Qdrant · Active OS Project

Advanced RAG - Full Pipeline Walkthroughyoutube

End-to-end walkthrough of advanced RAG techniques including HyDE, multi-query, and reranking with working code.

Andrej Karpathy · 2025

RAG with Claude - Official Guidedocs

Anthropic's recommended patterns for RAG implementation with Claude - chunking, citation, and faithfulness verification.

Anthropic · 2025

AI MONITORING & OBSERVABILITY

What to instrument, what to alert on, and the difference between a model that's working and one that has quietly drifted.

Langfusegithub

Open-source LLM observability - traces, scores, prompt management, self-hostable. The reference tool for production tracing.

Langfuse · Active OS Project

Arize Phoenixgithub

Open-source LLM tracing and evaluation with OTEL-compatible spans and embedding drift detection. Integrates with all major providers.

Arize AI · Active OS Project

OpenLLMetrygithub

OTel-based instrumentation for LLMs - vendor-agnostic trace export to any OpenTelemetry backend.

Traceloop · Active 2025

Evidently AIgithub

ML and LLM monitoring with text descriptors, drift detection, and production dashboards. Open-source and self-hostable.

Evidently AI · Active OS Project

LLM Observability in Productionyoutube

What to instrument, how to set up alerts, and the metrics that actually predict user-facing quality degradation.

AI Engineering Summit · 2025

OTel Semantic Conventions for Gen AIdocs

Official specification for standardised LLM span attributes - model, token counts, prompt content. The standard your infra team needs.

OpenTelemetry · 2025

COST MANAGEMENT

Token budgeting, inference cost modelling, provider comparison, and the specific patterns that turn a controlled pilot into an uncontrolled API bill.

LiteLLMgithub

Unified LLM API proxy with budget limits, cost tracking, and model routing across 100+ providers. Drop-in for any OpenAI client.

BerriAI · Active OS Project

Anthropic Prompt Cachingdocs

Official guide to prompt prefix caching - up to 90% cost reduction on repeated system prompts. Essential for production Claude deployments.

Anthropic · 2025

OpenAI Prompt Cachingdocs

OpenAI's automatic prompt caching for inputs over 1,024 tokens - 50% discount on cached input tokens.

OpenAI · 2025

vLLMgithub

High-throughput LLM inference engine - PagedAttention reduces GPU memory cost per request. The standard for self-hosted inference.

vLLM Project · Active OS Project

Reducing LLM API Costs in Productionyoutube

Practical techniques from a real production deployment - caching, batching, model routing, and token budget enforcement.

AI Engineering Conference · 2025

OpenRouter Model Rankingstool

Real-time quality-to-price comparison across 100+ models. Use before making model selection decisions on new tasks.

OpenRouter · Updated Daily

EVALUATION FRAMEWORKS

How to define "working" before you build, measure it after you ship, and distinguish accuracy degradation from expected variance.

DeepEvalgithub

Pytest-style LLM unit testing - 14+ metrics including G-Eval, hallucination, conversational coherence, and bias detection.

Confident AI · Active OS Project

Promptfoogithub

CLI-first prompt testing framework - red-teaming, regression testing, multi-provider comparison. Works in CI/CD pipelines.

Promptfoo · Active OS Project

OpenAI Evalsgithub

OpenAI's official eval framework - run existing evals or contribute custom ones against any model, including GPT-4o.

OpenAI · Active OS Project

Claude Evaluation Guidedocs

Anthropic's recommended approach to using Claude as a judge for automated evaluation pipelines - includes prompt templates.

Anthropic · 2025

LLM Evaluation - What Works, What Doesn'tyoutube

Practical walkthrough of eval methodologies from a practitioner - why BLEU/ROUGE fail for LLMs and what to use instead.

Hamel Husain · 2025

Google Gemini Evaluation APIdocs

Official guide to using Gemini for automated evaluation and as an LLM judge in production evaluation pipelines.

Google · 2025

PRODUCTION ARCHITECTURE

The patterns that separate a proof-of-concept from a system another engineer can maintain - fallbacks, versioning, rollback, and documentation that survives handover.

Instructorgithub

Structured LLM outputs via Pydantic - type-validated responses from OpenAI, Anthropic, and Cohere. Handles retry logic automatically.

Jason Liu · Active OS Project

LangGraphgithub

Framework for building stateful, multi-actor applications with LLMs. The standard for agentic architectures in production.

LangChain - Active OS Project

DSPygithub

Framework for programming - not prompting - foundation models. Automates prompt optimization.

Stanford NLP - Active OS Project

LiteLLMgithub

Call all LLM APIs using the OpenAI format. Essential infrastructure for multi-model failovers and cost tracking.

Berri AI - Active OS Project

Guardrails AIgithub

Output validation and structured data extraction with retry logic. Prevents malformed outputs from reaching production.

Guardrails AI · Active OS Project

Google Agent Development Kit (ADK)github

Google's official Python framework for building production agents with Gemini - orchestration, tools, and multi-agent composition.

Google · Mar 2025

Anthropic Python SDKgithub

Official SDK with streaming, tool use, prompt caching, and batching support. The starting point for any Claude production deployment.

Anthropic · Active OS Project

OpenAI Python SDKgithub

Official SDK with structured outputs, function calling, async support, and streaming. Reference before using any wrapper library.

OpenAI · Active OS Project

Building Production AI Systems - Architecture Patternsyoutube

Real architecture decisions from shipping AI features - how to structure fallbacks, handle failures, and version prompts at scale.

AI Engineer World's Fair · 2025

AI Engineering - Agents in Productionarticle

Chip Huyen's 2025 analysis of what production agents look like - memory, planning, tool use, and where they still break.

Chip Huyen · Jan 2025

Building Effective Agents - Anthropicguide

Anthropic's practitioner guide to agent patterns - when to use workflows vs autonomous agents, and patterns that actually scale.

Anthropic · Dec 2025

GOVERNANCE & COMPLIANCE

Data residency, audit trails, and the human-in-the-loop design decisions regulators are starting to require. India-first perspective included.

DPDP Act 2023 - Indiaguide

India's primary data protection law - data principal rights, processing obligations, and penalties. Required reading for any India-facing AI deployment.

MeitY · 2023 / enforced 2025

EU AI Act - Official Textguide

Regulation classifying AI systems by risk tier - high-risk requirements and conformity assessment obligations. Enforcement begins Aug 2026.

European Parliament · 2025

NIST AI RMF 1.0framework

Four-function governance model (Govern, Map, Measure, Manage) for enterprise AI risk. Reference framework for AI policy teams.

NIST · 2023

Anthropic Responsible Scaling Policyguide

Public RSP - evaluation thresholds and safety commitments tied to model capability levels. Shows how frontier labs think about deployment risk.

Anthropic · 2025

OpenAI Usage Policiesguide

Official permitted use cases and restrictions - mandatory reading before deploying production applications on the OpenAI API.

OpenAI · Maintained 2025

AI Incident Databasetool

Searchable log of real-world AI failure incidents - sourced for risk analysis, governance presentations, and post-mortem reference.

Partnership on AI · Active OS Project

Devverse Labs does not endorse or derive commercial benefit from any resource listed here. Attribution is preserved as found in original sources. Links verified as of March 2026. Arranged newest-first within each section.

Useful references for people building production AI

If you're building production AI and want a structured view of where your system stands, the diagnostic is the right place to start