AWS Lambda, Azure Functions, and When to Reach for Step Functions Instead

You write a Lambda. It works. You write another Lambda that calls the first one. Then a third. Then you add error handling, retry logic, timeouts, and a dead-letter queue. Six months later you have a tangle of SQS queues, EventBridge rules, and CloudWatch alarms that nobody fully understands. This is how serverless orchestration goes off the rails.

Lambda and Azure Functions give you compute. Step Functions gives you coordination. They solve different problems.

The function model

Lambda runs your code in response to events — S3 upload, API Gateway request, DynamoDB stream record. It runs for up to 15 minutes, then dies. Stateless, ephemeral, scales to zero.

import { APIGatewayProxyHandler } from "aws-lambda";

export const handler: APIGatewayProxyHandler = async (event) => {
  const order = JSON.parse(event.body!);
  await validateOrder(order);
  await saveToDynamoDB(order);
  return { statusCode: 201, body: JSON.stringify({ id: order.id }) };
};

This is solved. The moment your logic spans multiple functions — validate, charge payment, update inventory, send email, with retries and compensating actions if anything fails — you’re not writing functions anymore. You’re wiring infrastructure.

Azure Functions works the same way. Triggers from Blob Storage, Service Bus, CosmosDB, HTTP. The main difference is bindings — instead of instantiating SDK clients, you declare them in the function signature:

[FunctionName("ProcessOrder")]
public static async Task<IActionResult> Run(
    [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequest req,
    [CosmosDB("orders", "items", Connection = "CosmosDB")] IAsyncCollector<Order> orders,
    ILogger log)
{
    var order = JsonSerializer.Deserialize<Order>(req.Body);
    await orders.AddAsync(order);
    return new OkResult();
}

IAsyncCollector<Order> handles the CosmosDB write. No SDK setup. Nice for simple CRUD, awkward when you need conditional branching or multi-step coordination.

Where chaining functions goes wrong

Take an order fulfillment pipeline: validate → charge → reserve inventory → send confirmation. With raw Lambdas, you have three options and none are great.

Put everything in one 15-minute Lambda. It works, but you’ve got a monolith in a serverless costume. A transient payment gateway failure retries the entire order from scratch.

Chain them through SQS. Each function publishes to a queue consumed by the next. Decoupled, but invisible — did step 3 fail? You’re grepping five CloudWatch log groups trying to trace a correlation ID.

Emit events through EventBridge. Fan-out works. Sequencing doesn’t. If step 2 must finish before step 3 starts, you’re building that yourself with flags in DynamoDB.

All three become operational nightmares as the workflow grows.

Step Functions

Step Functions is AWS’s orchestration layer. Define a state machine, hand it your Lambdas, and it manages execution, retries, error routing, parallel branches, and timeouts. It keeps state between steps so your functions don’t have to.

{
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:validate-order",
      "Next": "ChargePayment",
      "Retry": [{ "ErrorEquals": ["States.ALL"], "MaxAttempts": 3 }]
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:charge-payment",
      "Next": "ReserveInventory",
      "Catch": [{
        "ErrorEquals": ["PaymentDeclined"],
        "Next": "CancelOrder"
      }]
    },
    "ReserveInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:reserve-inventory",
      "Next": "SendConfirmation"
    },
    "SendConfirmation": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:send-confirmation",
      "End": true
    },
    "CancelOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:cancel-order",
      "End": true
    }
  }
}

ValidateOrder retries 3 times. ChargePayment catches PaymentDeclined and routes to CancelOrder. Each Lambda does one thing — the state machine owns the flow. Every execution is visible in the console: which step failed, with what input, how many retries.

Beyond sequential steps, Step Functions handles parallel branches, choice states (if/else), wait states (pause for hours or until a timestamp), and map states (run the same step over every item in an array). You get retries with exponential backoff. You get a visual execution history. You write none of the orchestration code.

The costs

Step Functions charges per state transition. Five steps = five transitions = ~$0.000125 per execution. Negligible at low volume, real at a million executions per month. Compare to raw Lambda invocations at ~$0.20 per million requests.

Latency hurts more. Each transition means the Step Functions service receives the completion, evaluates the state machine, invokes the next Lambda, and waits for it to boot and return. That’s 500ms-2s of orchestration overhead per step. If your functions run for 50ms, Step Functions more than doubles the end-to-end time. If they run for 30 seconds each, the overhead disappears into the noise.

Examples when orchestration is not worth it

A single Lambda behind API Gateway — e.g. validate input, write to DynamoDB, return.
Two functions where the second is fire-and-forget — send an email, update a search index. SQS or EventBridge handles that.
Fan-out to independent consumers. One event triggers five services.

Durable Functions: the Azure equivalent

Azure’s closest answer is Durable Functions — an extension that adds stateful workflows written as ordinary code:

[FunctionName("OrderWorkflow")]
public static async Task RunOrchestrator(
    [OrchestrationTrigger] IDurableOrchestrationContext context)
{
    var order = context.GetInput<Order>();

    await context.CallActivityAsync("ValidateOrder", order);
    await context.CallActivityAsync("ChargePayment", order);

    var inventoryResult = await context.CallActivityAsync<bool>(
        "ReserveInventory", order);

    if (!inventoryResult)
    {
        await context.CallActivityAsync("CancelOrder", order);
        return;
    }

    await context.CallActivityAsync("SendConfirmation", order);
}

The runtime checkpoints after each CallActivityAsync. If the function host crashes, execution resumes from the last checkpoint. You can unit test the orchestrator function like any other code, and complex logic (loops, conditionals, variables) is natural rather than JSON contortions. The trade-off: no visual execution history, no declarative audit trail in the console.

Scenario	Tool
Single sync task (API handler, file processor)	Lambda or Azure Function
Fire-and-forget chaining	SQS, EventBridge, Service Bus
Multi-step workflow with retries, error paths	Step Functions or Durable Functions
Long-running process with pauses (hours/days)	Step Functions wait states or Durable Functions timers
Fan-out to independent consumers	SNS, EventBridge, Service Bus topics
Complex branching logic in orchestration	Durable Functions
Audit trail and visual execution history	Step Functions

The mistake isn’t choosing the wrong service. It’s not realizing you need orchestration until you’re four Lambdas deep hand-rolling retry logic and correlation ID propagation.