Escalando Infraestructura como Código con Terraform

Terraform se ha establecido como la herramienta líder para Infraestructura como Código (IaC), permitiendo a los equipos de DevOps gestionar recursos cloud de forma declarativa, versionable y repetible. Esta guía completa te enseñará cómo escalar tu infraestructura de manera eficiente y profesional usando Terraform.

¿Qué es Terraform y por qué es fundamental para el escalamiento?

Terraform es una herramienta de código abierto desarrollada por HashiCorp que permite definir infraestructura usando archivos de configuración declarativos. A diferencia de los scripts imperativos, Terraform describe el estado deseado de tu infraestructura y se encarga de crear, modificar o destruir recursos para alcanzar ese estado.

Ventajas clave de Terraform para el escalamiento:

  • Gestión de estado: Mantiene un registro detallado del estado actual de tu infraestructura
  • Planificación de cambios: Muestra exactamente qué cambios se realizarán antes de aplicarlos
  • Paralelización: Crea recursos simultáneamente cuando no hay dependencias
  • Reutilización: Permite crear módulos reutilizables para patrones comunes
  • Multi-cloud: Funciona con AWS, Azure, GCP y más de 1000 proveedores

Conceptos fundamentales para el escalamiento

1. Organización del código Terraform

La estructura de tu código Terraform es crucial para el escalamiento. Aquí tienes una estructura recomendada:

infraemgsnoltvdoriuburdsplvcdaircoetrepoalaotnvaoscmt/muum/gd//pa/tremvtiuubeenaaencta5/tirrgtes3snir/i/e//.aao/tbfnflo/ersm..ttffvars

2. Gestión del estado remoto

Para equipos y entornos de producción, es esencial usar un backend remoto para almacenar el estado:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mi-empresa-terraform-state"
    key            = "environments/production/terraform.tfstate"
    region         = "us-west-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

3. Versionado y bloqueo de proveedores

Define versiones específicas para garantizar reproducibilidad:

terraform {
  required_version = ">= 1.0"
  
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

Implementación práctica: Escalando una aplicación web

Paso 1: Módulo base de VPC

Crea un módulo reutilizable para la red:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.project_name}-vpc"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${var.project_name}-igw"
    Environment = var.environment
  }
}

resource "aws_subnet" "public" {
  count = length(var.availability_zones)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name        = "${var.project_name}-public-${count.index + 1}"
    Type        = "public"
    Environment = var.environment
  }
}

resource "aws_subnet" "private" {
  count = length(var.availability_zones)

  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
  availability_zone = var.availability_zones[count.index]

  tags = {
    Name        = "${var.project_name}-private-${count.index + 1}"
    Type        = "private"
    Environment = var.environment
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "${var.project_name}-public-rt"
  }
}

resource "aws_route_table_association" "public" {
  count = length(aws_subnet.public)

  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

Paso 2: Variables del módulo VPC

# modules/vpc/variables.tf
variable "project_name" {
  description = "Nombre del proyecto"
  type        = string
}

variable "environment" {
  description = "Entorno (dev, staging, prod)"
  type        = string
}

variable "vpc_cidr" {
  description = "CIDR block para la VPC"
  type        = string
  default     = "10.0.0.0/16"
}

variable "availability_zones" {
  description = "Lista de zonas de disponibilidad"
  type        = list(string)
}

Paso 3: Outputs del módulo

# modules/vpc/outputs.tf
output "vpc_id" {
  description = "ID de la VPC"
  value       = aws_vpc.main.id
}

output "public_subnet_ids" {
  description = "IDs de las subnets públicas"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "IDs de las subnets privadas"
  value       = aws_subnet.private[*].id
}

output "vpc_cidr_block" {
  description = "CIDR block de la VPC"
  value       = aws_vpc.main.cidr_block
}

Paso 4: Módulo de Auto Scaling

# modules/compute/main.tf
resource "aws_launch_template" "app" {
  name_prefix   = "${var.project_name}-${var.environment}-"
  image_id      = var.ami_id
  instance_type = var.instance_type

  vpc_security_group_ids = [aws_security_group.app.id]

  user_data = base64encode(templatefile("${path.module}/user_data.sh", {
    app_name = var.project_name
  }))

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "${var.project_name}-${var.environment}"
      Environment = var.environment
      ManagedBy   = "terraform"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "${var.project_name}-${var.environment}-asg"
  vpc_zone_identifier = var.subnet_ids
  target_group_arns   = [aws_lb_target_group.app.arn]
  health_check_type   = "ELB"
  health_check_grace_period = 300

  min_size         = var.min_instances
  max_size         = var.max_instances
  desired_capacity = var.desired_instances

  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 50
    }
  }

  tag {
    key                 = "Name"
    value               = "${var.project_name}-${var.environment}-asg"
    propagate_at_launch = false
  }
}

resource "aws_autoscaling_policy" "scale_up" {
  name                   = "${var.project_name}-scale-up"
  scaling_adjustment     = 2
  adjustment_type        = "ChangeInCapacity"
  cooldown              = 300
  autoscaling_group_name = aws_autoscaling_group.app.name
}

resource "aws_autoscaling_policy" "scale_down" {
  name                   = "${var.project_name}-scale-down"
  scaling_adjustment     = -1
  adjustment_type        = "ChangeInCapacity"
  cooldown              = 300
  autoscaling_group_name = aws_autoscaling_group.app.name
}

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.project_name}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "75"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }
}

resource "aws_cloudwatch_metric_alarm" "low_cpu" {
  alarm_name          = "${var.project_name}-low-cpu"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "25"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_autoscaling_policy.scale_down.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }
}

Paso 5: Load Balancer

# modules/compute/load_balancer.tf
resource "aws_lb" "app" {
  name               = "${var.project_name}-${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.subnet_ids

  enable_deletion_protection = var.environment == "production"

  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_lb_target_group" "app" {
  name     = "${var.project_name}-${var.environment}-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
    port                = "traffic-port"
    protocol            = "HTTP"
  }

  tags = {
    Environment = var.environment
  }
}

resource "aws_lb_listener" "app" {
  load_balancer_arn = aws_lb.app.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

resource "aws_security_group" "alb" {
  name_prefix = "${var.project_name}-alb-"
  vpc_id      = var.vpc_id

  ingress {
    description = "HTTP from anywhere"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTPS from anywhere"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.project_name}-alb-sg"
  }

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_security_group" "app" {
  name_prefix = "${var.project_name}-app-"
  vpc_id      = var.vpc_id

  ingress {
    description     = "HTTP from ALB"
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.project_name}-app-sg"
  }

  lifecycle {
    create_before_destroy = true
  }
}

Paso 6: Implementación en producción

# environments/production/main.tf
provider "aws" {
  region = var.aws_region
  
  default_tags {
    tags = {
      Project     = var.project_name
      Environment = "production"
      ManagedBy   = "terraform"
      Team        = "devops"
    }
  }
}

module "vpc" {
  source = "../../modules/vpc"

  project_name       = var.project_name
  environment        = "production"
  vpc_cidr          = "10.0.0.0/16"
  availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
}

module "compute" {
  source = "../../modules/compute"

  project_name     = var.project_name
  environment      = "production"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.public_subnet_ids
  ami_id          = var.ami_id
  instance_type   = "t3.medium"
  min_instances   = 2
  max_instances   = 10
  desired_instances = 3
}

# Base de datos RDS
resource "aws_db_subnet_group" "main" {
  name       = "${var.project_name}-db-subnet-group"
  subnet_ids = module.vpc.private_subnet_ids

  tags = {
    Name = "${var.project_name} DB subnet group"
  }
}

resource "aws_db_instance" "main" {
  identifier     = "${var.project_name}-db"
  engine         = "postgres"
  engine_version = "14.9"
  instance_class = "db.t3.micro"
  
  allocated_storage     = 20
  max_allocated_storage = 100
  storage_type         = "gp2"
  storage_encrypted    = true

  db_name  = var.db_name
  username = var.db_username
  password = var.db_password

  db_subnet_group_name = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.db.id]

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.project_name}-db-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"

  tags = {
    Name = "${var.project_name}-database"
  }
}

resource "aws_security_group" "db" {
  name_prefix = "${var.project_name}-db-"
  vpc_id      = module.vpc.vpc_id

  ingress {
    description     = "PostgreSQL from app servers"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [module.compute.security_group_id]
  }

  tags = {
    Name = "${var.project_name}-db-sg"
  }
}

Mejores prácticas para el escalamiento

1. Gestión de secretos

Nunca hardcodees credenciales en tu código Terraform. Usa AWS Secrets Manager o variables de entorno:

data "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id = "prod/database/credentials"
}

locals {
  db_credentials = jsondecode(data.aws_secretsmanager_secret_version.db_credentials.secret_string)
}

resource "aws_db_instance" "main" {
  username = local.db_credentials.username
  password = local.db_credentials.password
  # ... otros parámetros
}

2. Validación de entrada

Valida las variables de entrada para prevenir errores:

variable "instance_type" {
  description = "Tipo de instancia EC2"
  type        = string
  default     = "t3.micro"

  validation {
    condition = can(regex("^t3\\.", var.instance_type))
    error_message = "El tipo de instancia debe ser de la familia t3."
  }
}

variable "environment" {
  description = "Entorno de despliegue"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "El entorno debe ser dev, staging o production."
  }
}

3. Tagging consistente

Implementa una estrategia de tagging consistente:

locals {
  common_tags = {
    Project     = var.project_name
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = var.team_name
    CostCenter  = var.cost_center
    CreatedBy   = data.aws_caller_identity.current.user_id
    CreatedAt   = timestamp()
  }
}

resource "aws_instance" "example" {
  # ... configuración de la instancia
  
  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-web"
    Type = "web-server"
  })
}

4. Testing y validación

Implementa tests para tu código Terraform usando Terratest:

func TestTerraformVpcModule(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/vpc",
        Vars: map[string]interface{}{
            "project_name": "test-project",
            "environment":  "test",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
}

Estrategias avanzadas de escalamiento

1. Blue/Green Deployments

resource "aws_autoscaling_group" "blue" {
  count = var.deployment_color == "blue" ? 1 : 0
  # ... configuración del ASG
}

resource "aws_autoscaling_group" "green" {
  count = var.deployment_color == "green" ? 1 : 0
  # ... configuración del ASG
}

2. Multi-región

# variables.tf
variable "regions" {
  description = "Lista de regiones para despliegue"
  type        = list(string)
  default     = ["us-west-2", "us-east-1"]
}

# main.tf
module "infrastructure" {
  for_each = toset(var.regions)
  source   = "./modules/regional-infrastructure"
  
  region      = each.key
  environment = var.environment
}

3. Workspace para múltiples entornos

# Comandos para gestionar workspaces
terraform workspace new development
terraform workspace new staging
terraform workspace new production

# Usar workspace actual en configuración
locals {
  environment = terraform.workspace
  
  config = {
    development = {
      instance_type = "t3.micro"
      min_size     = 1
    }
    staging = {
      instance_type = "t3.small"
      min_size     = 2
    }
    production = {
      instance_type = "t3.medium"
      min_size     = 3
    }
  }
}

Monitoreo y observabilidad

1. CloudWatch personalizado

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "${var.project_name}-${var.environment}"

  dashboard_body = jsonencode({
    widgets = [
      {
        type   = "metric"
        width  = 12
        height = 6

        properties = {
          metrics = [
            ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", aws_lb.app.arn_suffix],
            ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", aws_lb.app.arn_suffix],
          ]
          period = 300
          stat   = "Average"
          region = var.aws_region
          title  = "Application Load Balancer Metrics"
        }
      }
    ]
  })
}

2. Alertas inteligentes

resource "aws_sns_topic" "alerts" {
  name = "${var.project_name}-alerts"
}

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "${var.project_name}-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = "300"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "This metric monitors application error rate"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    LoadBalancer = aws_lb.app.arn_suffix
  }
}

CI/CD con Terraform

1. Pipeline de GitLab CI

# .gitlab-ci.yml
stages:
  - validate
  - plan
  - apply

variables:
  TF_ROOT: ${CI_PROJECT_DIR}/terraform
  TF_ADDRESS: ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/terraform/state/${CI_ENVIRONMENT_NAME}

cache:
  key: "${CI_COMMIT_REF_SLUG}"
  paths:
    - ${TF_ROOT}/.terraform

before_script:
  - cd ${TF_ROOT}
  - terraform --version
  - terraform init

validate:
  stage: validate
  script:
    - terraform validate
    - terraform fmt -check

plan:
  stage: plan
  script:
    - terraform plan -out="planfile"
  artifacts:
    paths:
      - ${TF_ROOT}/planfile
    expire_in: 1 week

apply:
  stage: apply
  script:
    - terraform apply -input=false "planfile"
  dependencies:
    - plan
  when: manual
  only:
    - main

2. GitHub Actions

# .github/workflows/terraform.yml
name: 'Terraform'

on:
  push:
    branches: [ main ]
  pull_request:

jobs:
  terraform:
    name: 'Terraform'
    runs-on: ubuntu-latest
    environment: production

    defaults:
      run:
        shell: bash

    steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.5.0

    - name: Terraform Format
      id: fmt
      run: terraform fmt -check

    - name: Terraform Init
      id: init
      run: terraform init

    - name: Terraform Plan
      id: plan
      run: terraform plan -no-color -input=false
      continue-on-error: true

    - name: Terraform Plan Status
      if: steps.plan.outcome == 'failure'
      run: exit 1

    - name: Terraform Apply
      if: github.ref == 'refs/heads/main' && github.event_name == 'push'
      run: terraform apply -auto-approve -input=false

Troubleshooting y optimización

1. Debugging común

# Habilitar logs detallados
export TF_LOG=DEBUG
export TF_LOG_PATH=./terraform.log

# Importar recursos existentes
terraform import aws_instance.example i-1234567890abcdef0

# Actualizar providers sin cambiar infraestructura
terraform init -upgrade

# Validar configuración sin aplicar
terraform validate
terraform plan -detailed-exitcode

2. Optimización de rendimiento

# Usar parallelismo controlado
terraform {
  # Limitar paralelismo en entornos con muchos recursos
  parallelism = 10
}

# Usar depends_on explícito cuando sea necesario
resource "aws_instance" "app" {
  # ... configuración
  
  depends_on = [
    aws_db_instance.main,
    aws_security_group.app
  ]
}

Recursos adicionales y documentación oficial

Para profundizar en Terraform y seguir las mejores prácticas actualizadas, consulta estos recursos oficiales:

Enlaces internos relacionados:

Conclusión

Terraform es una herramienta poderosa para escalar infraestructura como código, pero su éxito depende de implementar las mejores prácticas desde el principio. La modularización, el testing automatizado, la gestión adecuada del estado y la integración con CI/CD son fundamentales para proyectos exitosos a largo plazo.

Recuerda que el escalamiento no es solo técnico - también incluye procesos, documentación y capacitación del equipo. Comienza con implementaciones simples y evoluciona gradualmente hacia arquitecturas más complejas conforme tu equipo gane experiencia.

La inversión inicial en estructurar correctamente tu código Terraform pagará dividendos importantes cuando necesites gestionar infraestructura compleja en múltiples entornos y regiones.