Building a Ragnarok Online Server in Elixir - Part 3

Y

Ygor Castor

Guest

The Secret Sauce - Distributed Session Management​


If you've been following along, we've built our Account Server and Character Server. They work great independently, but there's a glaring problem we hinted at in Part 2: How do these servers know about each other?

Think about it, when a player logs in:

  1. They connect to the Account Server on port 6900
  2. After authentication, they disconnect and connect to the Character Server on port 6901
  3. After selecting a character, they disconnect again and connect to the Zone Server on port 5121

Each connection is a completely independent TCP socket. There's no shared memory, no common thread pool, nothing. So how does the Character Server know that the player who just connected is the same one who authenticated with the Account Server moments ago?

The Traditional Approaches (And Why They're Not Great)​


Most private server implementations solve this in one of a few ways:

The Database Approach​


Code:
-- Store sessions in a shared database
INSERT INTO login_sessions (account_id, login_id1, login_id2, ip_address, expires_at)
VALUES (2000000, 123456, 789012, '192.168.1.100', NOW() + INTERVAL '5 minutes');

Every server queries this table to validate sessions. Simple? Yes. Scalable? Not really. You're hitting the database for every single authentication check, and in an MMO, that's a LOT of queries.

The Redis/Memcached Approach​


Code:
# Store sessions in Redis
redis.setex(f"session:{account_id}", 300, json.dumps({
    "login_id1": login_id1,
    "login_id2": login_id2,
    "ip": client_ip
}))

Better for performance, but now you've added another infrastructure component to manage, monitor, and scale. Plus, what happens when Redis goes down?

The rAthena Approach: Inter-Server Communication Protocol​


This is what most Ragnarok emulators actually use. Let's look at how rAthena does it:

When a player successfully logs in, the login server creates an auth node and stores it locally:


Code:
// In login server - when authentication succeeds
struct auth_node* login_add_auth_node(struct login_session_data* sd, uint32 ip) {
    struct auth_node* node = &auth_db[sd->account_id];

    node->account_id = sd->account_id;
    node->login_id1 = sd->login_id1;  // Random token
    node->login_id2 = sd->login_id2;  // Another random token
    node->sex = sd->sex;
    node->ip = ip;
    node->clienttype = sd->clienttype;

    return node;
}

Now, when the player connects to the Character Server, the char server needs to verify this is legitimate. It sends packet 0x2712 to the login server:


Code:
// Character server - when player connects
if (session_isValid(login_fd)) {
    WFIFOHEAD(login_fd, 23);
    WFIFOW(login_fd, 0) = 0x2712;  // "Please verify this auth"
    WFIFOL(login_fd, 2) = sd->account_id;
    WFIFOL(login_fd, 6) = sd->login_id1;
    WFIFOL(login_fd, 10) = sd->login_id2;
    WFIFOB(login_fd, 14) = sd->sex;
    WFIFOL(login_fd, 15) = session_id;  // Who's asking
    WFIFOL(login_fd, 19) = sd->ip;
    WFIFOSET(login_fd, 23);
}

The login server receives this, checks its auth_db, and responds with packet 0x2713:


Code:
// Login server - handling verification request
case 0x2712: // Auth request from char-server
    struct auth_node* node = login_get_auth_node(account_id);

    if (node != nullptr &&
        node->account_id == account_id &&
        node->login_id1 == login_id1 &&
        node->login_id2 == login_id2 &&
        node->sex == sex) {
        // Authentication valid!
        WFIFOHEAD(fd, 21);
        WFIFOW(fd, 0) = 0x2713;  // Auth response
        WFIFOL(fd, 2) = account_id;
        WFIFOL(fd, 6) = login_id1;
        WFIFOL(fd, 10) = login_id2;
        WFIFOB(fd, 14) = sex;
        WFIFOB(fd, 15) = 0;  // 0 = success, 1 = failure
        WFIFOL(fd, 16) = request_id;
        WFIFOB(fd, 20) = node->clienttype;
        WFIFOSET(fd, 21);

        // Remove auth node - it can only be used once
        login_remove_auth_node(account_id);
    } else {
        // Send failure response...
    }
    break;

And the servers need to maintain these connections, handling failures:


Code:
// Periodic keepalive between servers
void inter_server_keepalive() {
    WFIFOHEAD(fd, 2);
    WFIFOW(fd, 0) = 0x2719;  // PING packet
    WFIFOSET(fd, 2);

    if (no_response_count > 3) {
        // Server is dead, reconnect
        do_reconnect();
        cleanup_orphaned_sessions();
    }
}

This approach works, but look at the complexity! You're essentially building a distributed system protocol from scratch, handling connection failures, message ordering, state synchronization, and more. The rAthena codebase has thousands of lines dedicated just to this inter-server communication.

Enter the BEAM: A Different Paradigm​


But we're using Elixir, which runs on the BEAM (Erlang's virtual machine), and the BEAM was literally built for this kind of problem. Ericsson created Erlang to run telephone switches, systems that need to maintain millions of concurrent sessions across distributed hardware with 99.9999999% uptime (that's nine nines!).

Instead of treating our servers as isolated processes that need external coordination, what if we treated them as nodes in a distributed system that inherently know about each other?


Code:
# On the Account Server
{:ok, _} = Aesir.SessionManager.create_session(account_id, session_data)

# On the Character Server - it just knows!
case Aesir.SessionManager.get_session(account_id) do
  {:ok, session} -> # Valid session, proceed
  {:error, :not_found} -> # Invalid, reject
end

No manual packet protocols. No database queries. No Redis. No explicit server-to-server communication code. Just distributed Erlang magic.

What We're Building​


In this part, we'll explore:

  1. Distributed Erlang/Elixir - How to make our servers aware of each other
  2. Mnesia - Erlang's built-in distributed database that lives in-memory
  3. Phoenix PubSub - Real-time event broadcasting between servers
  4. Session Management - Keeping track of who's logged in where
  5. Handling Failures - What happens when a node goes down?

By the end, you'll understand why the BEAM is such a powerful platform for building distributed game servers, and why what seems like magic is actually just good engineering.

A Quick Note on Terminology​


Before we dive in, let's clarify some terms:

  • Node: An instance of the BEAM running our code (could be Account Server, Character Server, etc.)
  • Cluster: Multiple nodes connected together
  • Process: A lightweight Erlang process (not an OS process, you can have millions of these)
  • GenServer: A generic server process that maintains state
  • PubSub: Publish/Subscribe messaging pattern

Ready? Let's dive into the distributed world of the BEAM!

Mnesia: The Database That Comes With Erlang​


Before we dive into our implementation, let's talk about Mnesia. It's a database that comes built into Erlang/OTP, and it's unlike any database you've probably used before.

What Makes Mnesia Special?​


Most databases are separate services you connect to over a network. Mnesia is different, it runs inside your BEAM application as part of your code. Think of it this way:


Code:
# Traditional database approach
defmodule Traditional do
  def get_session(account_id) do
    # Network call to external database
    {:ok, conn} = Postgrex.start_link(...)
    Postgrex.query!(conn, "SELECT * FROM sessions WHERE account_id = $1", [account_id])
    # Network latency, serialization overhead, connection pooling...
  end
end

# Mnesia approach  
defmodule WithMnesia do
  def get_session(account_id) do
    # Direct memory access - it's just a function call!
    :mnesia.dirty_read(:sessions, account_id)
    # Microseconds, not milliseconds
  end
end

But here's where it gets really interesting, Mnesia is distributed by default. When you have multiple BEAM nodes connected, Mnesia automatically replicates data between them:


Code:
# On Node A (Account Server)
:mnesia.dirty_write({:sessions, account_id, session_data})

# On Node B (Character Server) - the data is already there!
{:sessions, ^account_id, data} = :mnesia.dirty_read(:sessions, account_id)

No network protocols to write. No replication lag to worry about. No cache invalidation bugs. The BEAM handles it all.

Why Is This Perfect for Game Servers?​


Think about what game servers need:

  1. Ultra-low latency - Players notice even 50ms delays
  2. High read throughput - Checking permissions, stats, inventory thousands of times per second
  3. Distributed state - Players move between servers (login β†’ char β†’ zone)
  4. Fault tolerance - If one server dies, others keep running
  5. Eventual consistency is fine - A few milliseconds of lag between servers is acceptable

Mnesia gives us all of this out of the box. But raw Mnesia has a pretty gnarly API from the Erlang days:


Code:
% Erlang-style Mnesia - not very Elixir-like
mnesia:create_table(sessions, 
    [{attributes, record_info(fields, session)},
     {ram_copies, [node()]},
     {type, set}]).

Enter Memento: Mnesia for Humans​


This is where Memento comes in. It's an Elixir wrapper that makes Mnesia actually pleasant to use:


Code:
# Define a table like a normal Elixir struct
defmodule Aesir.Session do
  use Memento.Table,
    attributes: [:account_id, :login_id1, :login_id2, :char_id, :ip, :connected_at],
    index: [:char_id],
    type: :set

  @type t :: %__MODULE__{
    account_id: integer(),
    login_id1: integer(),
    login_id2: integer(),
    char_id: integer() | nil,
    ip: tuple(),
    connected_at: DateTime.t()
  }
end

That's it! Now we can use it like any other Elixir module:


Code:
# Write a session
Memento.transaction! fn ->
  Memento.Query.write(%Aesir.Session{
    account_id: 2000000,
    login_id1: :rand.uniform(1_000_000),
    login_id2: :rand.uniform(1_000_000),
    ip: {192, 168, 1, 100},
    connected_at: DateTime.utc_now()
  })
end

# Read it back
Memento.transaction! fn ->
  Memento.Query.read(Aesir.Session, 2000000)
end

# Query with pattern matching
Memento.transaction! fn ->
  Memento.Query.select(Aesir.Session, {:==, :ip, {192, 168, 1, 100}})
end

But here's the killer feature, this automatically works across all connected nodes:


Code:
# Start three nodes
iex --name account@localhost -S mix
iex --name char@localhost -S mix  
iex --name zone@localhost -S mix

# Connect them
Node.connect(:"char@localhost")  # From account node
Node.connect(:"zone@localhost")  # From account node

# Tell Mnesia about the cluster
Memento.add_node(:"char@localhost")
Memento.add_node(:"zone@localhost")

# Create table replicated across all nodes
Memento.Table.create!(Aesir.Session, disc_copies: Node.list())

Now any node can read and write sessions, and they all stay in sync automatically!

The Node Discovery Problem​


But wait, there's a catch. While Mnesia automatically replicates data once nodes are connected, it doesn't magically discover other nodes on its own. In our examples above, we manually connected nodes with Node.connect/1. That's fine for development, but what about production?

Imagine you're running your game servers in a Kubernetes cluster or across multiple Docker containers. Node names change, IP addresses are dynamic, and servers can restart at any time. How do they find each other?

This is where most Mnesia tutorials stop, they assume you'll manually configure everything. But we're building a real game server, so let's solve this properly.

Enter libcluster: Automatic Node Discovery​


libcluster is a brilliant Elixir library that handles automatic node discovery and clustering. It supports multiple strategies, Kubernetes DNS, AWS EC2 tags, Docker Swarm, even simple UDP gossip protocols.

Let's see how we configured it in our config/config.exs:


Code:
config :libcluster,
  topologies: [
    aesir: [
      strategy: Cluster.Strategy.Epmd,
      config: [
        hosts: [:"[email protected]", :"[email protected]", :"[email protected]"]
      ]
    ]
  ]

This tells libcluster to use the EPMD (Erlang Port Mapper Daemon) strategy to discover nodes. In development, we hardcode the node names, but in production, you'd use something like:


Code:
# For Kubernetes
config :libcluster,
  topologies: [
    aesir: [
      strategy: Cluster.Strategy.Kubernetes.DNS,
      config: [
        service: "aesir-nodes",
        application_name: "aesir"
      ]
    ]
  ]

# Or for AWS
config :libcluster,
  topologies: [
    aesir: [
      strategy: Cluster.Strategy.Kubernetes.DNSSRV,
      config: [
        query: "aesir.default.svc.cluster.local",
        node_basename: "aesir"
      ]
    ]
  ]

The MementoCluster Manager: Bringing It All Together​


Now here's where our architecture gets interesting. libcluster handles node discovery, but we still need to coordinate Mnesia initialization across the cluster. Enter our MementoCluster.Manager , a GenServer that handles the complex orchestration.

Here's how it works in apps/commons/lib/aesir/commons/memento_cluster/manager.ex:


Code:
def init(_opts) do
  Logger.info("[MementoCluster:#{node()}] Starting Memento cluster manager...")

  state = %__MODULE__{
    status: :initializing,
    cluster_nodes: [],
    retry_count: 0,
    max_retries: 5
  }

  # Initialize synchronously to ensure tables are ready before SessionManager starts
  case initialize_cluster() do
    :ok ->
      Logger.info("[MementoCluster:#{node()}] Cluster initialized successfully during init")
      Phoenix.PubSub.broadcast(Aesir.PubSub, "cluster:ready", %{node: node(), status: :ready})
      {:ok, %{state | status: :ready}}

    {:error, reason} ->
      Logger.error(
        "[MementoCluster:#{node()}] Failed to initialize cluster during init: #{inspect(reason)}"
      )

      # Still start but schedule retry
      send(self(), :initialize)
      {:ok, state}
  end
end

The manager tries to initialize the cluster synchronously during startup. If it fails (maybe other nodes aren't ready yet), it schedules a retry.

Here's where the magic happens, the cluster setup logic:


Code:
defp setup_cluster do
  if Config.auto_cluster?() do
    # First check if we're already connected to nodes (like when using libcluster)
    connected = Node.list()
    configured = Config.cluster_nodes()

    # Find configured nodes we're already connected to
    cluster_candidates =
      Enum.filter(configured, fn node ->
        node != node() and node in connected
      end)

    case cluster_candidates do
      [] ->
        # No connected nodes, try discovery
        cluster_nodes = discover_cluster_nodes()

        case cluster_nodes do
          [] ->
            Logger.info(
              "[MementoCluster:#{node()}] No cluster nodes found, initializing as primary"
            )
            create_schema()

          nodes ->
            Logger.info("[MementoCluster:#{node()}] Found cluster nodes: #{inspect(nodes)}")
            join_cluster(nodes)
        end

      candidates ->
        # We have connected nodes - check if any have Mnesia running
        running =
          Enum.filter(candidates, fn candidate ->
            case :rpc.call(candidate, :mnesia, :system_info, [:is_running], 5_000) do
              :yes -> true
              _ -> false
            end
          end)

        case running do
          [] ->
            Logger.info(
              "[MementoCluster:#{node()}] Connected nodes don't have Mnesia, initializing as primary"
            )
            create_schema()

          [head | _] ->
            Logger.info(
              "[MementoCluster:#{node()}] Found running Mnesia on #{head}, attempting to join"
            )
            join_cluster([head])
        end
    end
  else
    Logger.info("[MementoCluster:#{node()}] Auto-clustering disabled, initializing standalone")
    create_schema()
  end
end

This is the clever part: the manager first checks if libcluster has already connected us to other nodes. If so, it uses RPC calls to see if any of them already have Mnesia running. If yes, we join their cluster. If no, we become the primary node and create a new schema.

Table Configuration: What Gets Replicated Where​


Our MementoCluster.Config module defines exactly which tables get created and where they're stored:


Code:
def tables do
  [
    {ServerStatus, :disc_copies},      # Persistent - server registry
    {Session, :disc_copies},           # Persistent - player sessions  
    {OnlineUser, :ram_copies},         # In-memory - real-time player list
    {CharacterLocation, :disc_copies}  # Persistent - where characters are
  ]
end

Notice the different storage types:

  • :disc_copies - Data lives in RAM but is also written to disk
  • :ram_copies - Data only in RAM, faster but lost on restart
  • :disc_only_copies - Data only on disk, slower but takes less RAM

This gives us fine-grained control over performance vs persistence trade-offs.

The Session Schema: Distributed Session State​


Let's look at our actual session schema in apps/commons/lib/aesir/commons/inter_server/schemas/session.ex:


Code:
defmodule Aesir.Commons.InterServer.Schemas.Session do
  use Memento.Table,
    attributes: [
      :account_id,
      :login_id1,
      :login_id2, 
      :auth_code,
      :username,
      :state,
      :current_server,
      :current_char_id,
      :created_at,
      :last_activity
    ],
    index: [:username, :current_server],
    type: :set

  def new(account_id, login_id1, login_id2, auth_code, username) do
    now = DateTime.utc_now()

    %__MODULE__{
      account_id: account_id,
      login_id1: login_id1,
      login_id2: login_id2,
      auth_code: auth_code,
      username: username,
      state: :authenticating,
      current_server: :account_server,
      current_char_id: nil,
      created_at: now,
      last_activity: now
    }
  end

  def transition_to_char_server(session) do
    update_activity(%{session | state: :char_select, current_server: :char_server})
  end

  def transition_to_game(session, char_id) do
    update_activity(%{
      session
      | state: :in_game,
        current_server: :zone_server,
        current_char_id: char_id
    })
  end
end

This schema tracks the complete player journey:

  1. :authenticating on :account_server - just logged in
  2. :char_select on :char_server - picking a character
  3. :in_game on :zone_server - actually playing

And because it's a Memento table, this state is automatically replicated across all nodes in real-time!

The SessionManager: Your Distributed Session API​


Now that we have our clustering infrastructure and session schemas set up, we need a clean way to interact with them. Enter the SessionManager, a GenServer that provides a simple API for distributed session management.

Here's how we define it in apps/commons/lib/aesir/commons/session_manager.ex:


Code:
defmodule Aesir.Commons.SessionManager do
  @moduledoc """
  GenServer-based session manager for distributed player sessions.
  Uses Memento for distributed state management across the cluster.
  """
  use GenServer

  # Client API

  def create_session(account_id, session_data) do
    GenServer.call(@server_name, {:create_session, account_id, session_data})
  end

  def validate_session(account_id, login_id1, login_id2) do
    GenServer.call(@server_name, {:validate_session, account_id, login_id1, login_id2})
  end

  def set_user_online(account_id, server_type, character_id \\ nil, map_name \\ nil) do
    GenServer.call(@server_name, {:set_user_online, account_id, server_type, character_id, map_name})
  end

The SessionManager wraps all our Memento operations in a clean, typed API. But more importantly, it handles the complex distributed transactions safely:


Code:
def handle_call({:validate_session, account_id, login_id1, login_id2}, _from, state) do
  result =
    Memento.transaction(fn ->
      case Memento.Query.read(Session, account_id) do
        nil ->
          {:error, :session_not_found}

        session ->
          if session.login_id1 == login_id1 and session.login_id2 == login_id2 do
            updated_session = Session.update_activity(session)
            Memento.Query.write(updated_session)
            {:ok, updated_session}
          else
            {:error, :invalid_credentials}
          end
      end
    end)

  # Handle transaction result...
end

This is critical, all our session operations are wrapped in Mnesia transactions, which means they're atomic across the entire cluster. Either the operation succeeds on all nodes, or it fails on all nodes. No partial states, no race conditions.

The Account Server: Creating Sessions​


Let's see how the Account Server uses the SessionManager. When a player successfully logs in, here's what happens in apps/account_server/lib/aesir/account_server.ex:


Code:
defp handle_successful_login(account, session_data) do
  auth_code = :rand.uniform(999_999_999)
  login_id1 = :rand.uniform(999_999_999)
  login_id2 = :rand.uniform(999_999_999)

  updated_session =
    Map.merge(session_data, %{
      account_id: account.id,
      username: account.userid,
      auth_code: auth_code,
      login_id1: login_id1,
      login_id2: login_id2,
      authenticated: true
    })

  session_data_for_cluster = %{
    login_id1: login_id1,
    login_id2: login_id2,
    auth_code: auth_code,
    username: account.userid
  }

  case SessionManager.create_session(account.id, session_data_for_cluster) do
    :ok ->
      Logger.info("Session created in cluster for account #{account.id}")
      SessionManager.set_user_online(account.id, :account_server)
      PubSub.broadcast_player_login(account.id, account.userid)

      sex_atom =
        case account.sex do
          "M" -> :male
          "F" -> :female
        end

      # Generate response with available character servers...
      case get_available_char_servers() do
        {:ok, char_servers} ->
          response = %AcAcceptLogin{
            login_id1: login_id1,
            aid: account.id,
            login_id2: login_id2,
            last_ip: {127, 0, 0, 1},
            sex: sex_atom,
            char_servers: char_servers
          }

          {:ok, updated_session, [response]}

        {:error, reason} ->
          handle_failed_login(:no_char_servers, session_data)
      end

    {:error, reason} ->
      handle_failed_login(reason, session_data)
  end
end

Notice what's happening here:

  1. We generate random tokens (login_id1, login_id2, auth_code)
  2. We store the session in the distributed cluster via SessionManager.create_session/2
  3. We mark the user as online on the account server
  4. We broadcast a player login event via PubSub
  5. We query for available character servers and send them to the client

What makes this powerful is that every other node in the cluster immediately knows about this new session. No message passing, no manual synchronization, it's just there.

The Character Server: Validating Sessions​


Now when the player connects to the Character Server, they send the same tokens. Here's how we validate them in apps/char_server/lib/aesir/char_server/character_session.ex:


Code:
def validate_character_session(aid, login_id1, login_id2, sex) do
  case SessionManager.validate_session(aid, login_id1, login_id2) do
    {:ok, session} ->
      updated_session_data = %{
        account_id: aid,
        login_id1: login_id1,
        login_id2: login_id2,
        sex: sex,
        authenticated: true,
        username: session.username
      }

      SessionManager.set_user_online(aid, :char_server)

      Logger.info("Character session validated for account: #{aid}")
      {:ok, updated_session_data}

    {:error, reason} ->
      Logger.warning("Character session validation failed for account #{aid}: #{reason}")
      {:error, reason}
  end
end

This is the magic moment! The Character Server doesn't need to ask the Account Server "hey, is this user legit?" It already knows because the session data is right there in its local Mnesia tables.

The validate_session/3 call:

  1. Reads the session from the local Memento table
  2. Compares the provided tokens with the stored ones
  3. Updates the session's last_activity timestamp
  4. Marks the user as online on the character server
  5. Returns the validated session data

All of this happens in microseconds because it's just memory access, not network calls.

Session State Transitions​


Our system tracks the complete player journey through different server states. Here's how we handle state transitions:


Code:
# Session starts as :authenticating on :account_server
def transition_to_char_server(session) do
  update_activity(%{session | state: :char_select, current_server: :char_server})
end

def transition_to_game(session, char_id) do
  update_activity(%{
    session
    | state: :in_game,
      current_server: :zone_server,
      current_char_id: char_id
  })
end

This gives us incredible visibility into player state across the entire cluster. Any node can query:

  • "How many players are currently in character selection?"
  • "Which character is account 2000000 playing right now?"
  • "What server is this player currently on?"

Server Registry and Load Balancing​


The SessionManager also handles server registration and discovery. Here's how character servers register themselves:


Code:
defp get_available_char_servers do
  case SessionManager.get_servers(:char_server) do
    [] ->
      {:error, :no_char_servers}

    servers ->
      online_servers =
        servers
        |> Enum.filter(fn server -> server.status == :online end)
        |> Enum.group_by(fn server -> server.metadata[:cluster_id] || "default" end)
        |> Enum.map(fn {_cluster_id, cluster_servers} ->
          # Pick the server with lowest player count
          best_server = Enum.min_by(cluster_servers, & &1.player_count)

          %AcAcceptLogin.ServerInfo{
            ip: best_server.ip,
            port: best_server.port,
            name: best_server.metadata[:name],
            users: best_server.player_count,
            # ...
          }
        end)

      {:ok, online_servers}
  end
end

This automatically load-balances players across available character servers, grouping them by cluster_id for multi-world support.

Cleanup and Session Expiration​


The SessionManager also handles cleanup of stale sessions:


Code:
defp cleanup_expired_sessions do
  one_hour_ago = DateTime.add(DateTime.utc_now(), -3600, :second)

  Memento.transaction(fn ->
    expired_sessions =
      Memento.Query.select(Session, [
        {:<, :last_activity, one_hour_ago}
      ])

    Enum.each(expired_sessions, fn session ->
      Memento.Query.delete(Session, session.account_id)
      Memento.Query.delete(OnlineUser, session.account_id)
      Logger.info("Cleaned up expired session for account #{session.account_id}")
    end)
  end)
end

This runs periodically across the cluster, cleaning up sessions that haven't been active for over an hour. Since it's a distributed operation, any node can perform the cleanup and it affects the entire cluster.

The Result: Seamless Multi-Server Authentication​


With this architecture, we've achieved something remarkable:

  • No external dependencies - No Redis, no separate database for sessions
  • Ultra-low latency - Session validation is just memory access
  • Automatic replication - All nodes stay in sync without explicit synchronization
  • Fault tolerance - If a node crashes, sessions persist on other nodes
  • Load balancing - Players are automatically distributed across available servers
  • Real-time visibility - Any node can query the complete player state

Compare this to the traditional approaches we discussed earlier. Instead of thousands of lines of inter-server protocol code, we have a simple, clean API that "just works" across the cluster.

Handling Failures: Automatic Cleanup and Recovery​


But what happens when things go wrong? One of the most impressive aspects of our system is how it handles failures automatically. Let's explore the cleanup mechanisms that keep our cluster healthy.

Session Expiration: Cleaning Up Stale Data​


The SessionManager runs periodic cleanup to remove expired sessions:


Code:
defp cleanup_expired_sessions do
  one_hour_ago = DateTime.add(DateTime.utc_now(), -3600, :second)

  Memento.transaction(fn ->
    expired_sessions =
      Memento.Query.select(Session, [
        {:<, :last_activity, one_hour_ago}
      ])

    Enum.each(expired_sessions, fn session ->
      Memento.Query.delete(Session, session.account_id)
      Memento.Query.delete(OnlineUser, session.account_id)
      Logger.info("Cleaned up expired session for account #{session.account_id}")
    end)
  end)
end

This runs every 5 minutes, automatically removing sessions that haven't been active for over an hour. Since it's a distributed transaction, any node can perform this cleanup and it affects the entire cluster.

Node Failure Detection: Heartbeat System​


Each server sends heartbeats every 10 seconds to update its last_heartbeat timestamp:


Code:
defp send_heartbeat do
  current_node = Node.self()

  Memento.transaction(fn ->
    servers = Memento.Query.select(ServerStatus, [{:==, :server_node, current_node}])

    Enum.each(servers, fn server ->
      updated_server = ServerStatus.update_heartbeat(server)
      Memento.Query.write(updated_server)
    end)
  end)
end

Dead Node Cleanup: Automatic Recovery​


Every 15 seconds, the system checks for dead nodes and cleans up their data:


Code:
defp cleanup_dead_nodes do
  timeout_seconds = 30
  now = DateTime.utc_now()

  Memento.transaction(fn ->
    all_servers = Memento.Query.select(ServerStatus, [])

    dead_servers =
      Enum.filter(all_servers, fn server ->
        DateTime.diff(now, server.last_heartbeat) > timeout_seconds &&
          server.status == :online
      end)

    Enum.each(dead_servers, fn server ->
      Logger.warning("Deleting dead server #{server.server_id} on node #{server.server_node}")
      Memento.Query.delete(ServerStatus, server.server_id)
      cleanup_node_data(server.server_node)
    end)
  end)
end

When a node is detected as dead (no heartbeat for 30+ seconds), the system automatically:

  1. Removes the server from the registry
  2. Cleans up orphaned data - sessions, online users, character locations
  3. Logs the cleanup for monitoring

Code:
defp cleanup_node_data(dead_node) do
  # Remove online users from dead node
  online_users = Memento.Query.select(OnlineUser, [{:==, :server_node, dead_node}])
  Enum.each(online_users, &Memento.Query.delete(OnlineUser, &1.account_id))

  # Remove sessions from dead node  
  sessions = Memento.Query.select(Session, [{:==, :node, dead_node}])
  Enum.each(sessions, &Memento.Query.delete(Session, &1.account_id))

  # Remove character locations from dead node
  char_locations = Memento.Query.select(CharacterLocation, [{:==, :node, dead_node}])
  Enum.each(char_locations, &Memento.Query.delete(CharacterLocation, &1.char_id))

  Logger.info("Completed cleanup for dead node: #{dead_node}")
end

What This Means in Practice​


Imagine you're running a production cluster:

  1. Character Server A crashes - Maybe out of memory, network partition, whatever
  2. Within 30 seconds, other nodes detect it's not sending heartbeats
  3. Automatically, the cluster removes its server registration and cleans up its data
  4. Players connected to other nodes continue playing normally
  5. Load balancer stops routing new players to the dead server
  6. When Character Server A restarts, it automatically rejoins the cluster and starts accepting new players

No manual intervention. No stale data. No cascading failures. The cluster self-heals.

The BEAM Advantage​


This automatic failure handling is what makes the BEAM special. In traditional architectures, you'd need:

  • External health checkers (like Kubernetes liveness probes)
  • Manual cleanup scripts
  • Complex orchestration for node replacement
  • Careful coordination to avoid split-brain scenarios

With Erlang/Elixir, this is all built-in. The same mechanisms that let Ericsson build telephone switches with 99.9999999% uptime (nine nines!) now protect your game server from failures.

You must be itching to see it working, right? Lets run it!

First, we need to run the account-server in a named manner, so libcluster can find it.


Code:
RELEASE_COOKIE=imthecookie iex --name [email protected] -S mix run

You will see this:

Account Server

Since its a new cluster, it will start up the cluster and the mnesia tables, now, lets start the char-server


Code:
RELEASE_COOKIE=imthecookie iex --name [email protected] -S mix run

Char Server

Aha, see how it goes? It detects the account server, and instead of bootstrapping a new cluster, it will just copy the schemas that were generated copy the data, and done!

Lets see what happens when we request the server data


Code:
iex([email protected])1> Aesir.Commons.SessionManager.get_servers
[
  %Aesir.Commons.InterServer.Schemas.ServerStatus{
    __meta__: Memento.Table,
    server_id: "[email protected]",
    server_type: :char_server,
    server_node: :"[email protected]",
    status: :online,
    player_count: 0,
    max_players: 1000,
    ip: {192, 168, 178, 101},
    port: 6121,
    last_heartbeat: ~U[2025-08-13 07:55:49.376615Z],
    metadata: %{name: "Aesir", new: false, type: 0, cluster_id: "default"}
  },
  %Aesir.Commons.InterServer.Schemas.ServerStatus{
    __meta__: Memento.Table,
    server_id: "[email protected]",
    server_type: :account_server,
    server_node: :"[email protected]",
    status: :online,
    player_count: 0,
    max_players: 1000,
    ip: {0, 0, 0, 0},
    port: 6900,
    last_heartbeat: ~U[2025-08-13 07:50:54.864225Z],
    metadata: %{}
  }
]

See? The data that was registered by the account server is right here!

Now the moment y'all want, lets connect our client to our server


Ta-da! We have a working integration between Accounts and Char Server!

With this, the easy part is done and the real challenge begins! The Zone Server, where everything related to the Game Mechanics will be built. On part 4 we will see:

  • Char to Zone Transition
  • Map Loading and the GAT format
  • Fast as Fuck Map representation with ETS

Previous Parts

Part 2
- Persistence, Char Server and Inter Server Communication

Part 1 - Account Server

Continue reading...
 


Join 𝕋𝕄𝕋 on Telegram
Channel PREVIEW:
Back
Top