TL;DR
Erlang nodes find each other through EPMD on port 4369. Fusion creates three SSH tunnels to make remote ports appear local:
- Reverse tunnel for the local node’s distribution port
- Forward tunnel for the remote node’s distribution port
- Reverse tunnel for EPMD itself
The remote BEAM joins your cluster without knowing it’s on a different machine. Part 2 of a 3-part series.
How Erlang Nodes Connect
Before understanding Fusion, you need to understand Erlang distribution. It’s simpler than it sounds.
Every machine running Erlang nodes has its own EPMD (Erlang Port Mapper Daemon) on port 4369. EPMD is a local registry - it only knows about nodes on its own machine. EPMDs never talk to each other.
When an Erlang/Elixir node starts with --sname or --name, it:
- Picks a random port for distribution traffic
- Registers that port with the local EPMD
When node A on machine 1 wants to connect to node B on machine 2, node A’s runtime contacts machine 2’s EPMD directly:
Machine 1 Machine 2
========= =========
EPMD (4369) EPMD (4369)
knows: A → port 38210 knows: B → port 45892
Node A Node B (port 45892)
| |
|--- "Where is B?" --------> EPMD on machine 2
|<-- "Port 45892" ----------|
| |
|--- connect ----------------> Node B (port 45892)
Three requirements for this to work:
- EPMD must be reachable on the remote machine (port 4369)
- The distribution port must be reachable on both machines
- The node names must resolve to the correct hosts
The Problem Across Networks
On a local network, this works out of the box. Across the internet or behind firewalls, it breaks.
Distribution ports are random. Firewalls block them. EPMD isn’t exposed publicly. You can’t just Node.connect(:"remote@server.com") from your laptop to a server in a data center.
Solutions like VPNs or exposing ports work but require infrastructure changes. Fusion takes a different approach: SSH tunnels that already have access through the firewall.
Three Tunnels
Fusion creates exactly three SSH tunnels. Each one solves a specific connectivity gap.
LOCAL MACHINE REMOTE MACHINE (via SSH)
============= =======================
EPMD (4369) <---reverse tunnel----- EPMD tunnel (random port)
(remote can query ^
local EPMD) |
Remote BEAM reads this
as its EPMD via ERL_EPMD_PORT
Local Node <---reverse tunnel----- Remote-accessible port
(port X) (remote node can ^
reach local node) |
Remote node connects
here to join cluster
---forward tunnel----> Remote Node (port Y)
Local BEAM ------> ^
can reach |
remote node Remote BEAM listens
on this pinned port
Tunnel 1: Reverse - Local Node Distribution Port
The remote node needs to reach the local node’s distribution port. A reverse tunnel (ssh -R) binds the local port on the remote machine:
# From NodeManager.setup_tunnels/6
Ssh.cmd_port_tunnel(
auth, remote,
local_node.port, # bind this port on remote
%Spot{host: "localhost", port: local_node.port}, # forward to local
:reverse
)
This generates: ssh -nNT -R 45892:localhost:45892 deploy@10.0.1.5
The remote machine can now reach the local node’s distribution port at localhost:45892.
Tunnel 2: Forward - Remote Node Distribution Port
The local node needs to reach the remote node’s distribution port. A forward tunnel (ssh -L) binds the remote port locally:
Ssh.cmd_port_tunnel(
auth, remote,
remote_node_port, # bind locally
%Spot{host: "localhost", port: remote_node_port}, # from remote
:forward
)
This generates: ssh -nNT -4 -L 51234:localhost:51234 deploy@10.0.1.5
The local machine can now reach the remote node at localhost:51234.
Tunnel 3: Reverse - Local EPMD
The remote node needs to find the local EPMD to discover cluster members. Another reverse tunnel:
Ssh.cmd_port_tunnel(
auth, remote,
epmd_tunnel_port, # bind on remote
%Spot{host: "localhost", port: epmd_port}, # forward to local EPMD
:reverse
)
This generates: ssh -nNT -R 52000:localhost:4369 deploy@10.0.1.5
The remote machine now has access to the local EPMD at port 52000. Fusion tells the remote node to use this port instead of 4369.
Bootstrapping the Remote Node
With tunnels in place, Fusion starts a BEAM node on the remote machine via SSH. The command is carefully constructed:
defp build_remote_node_cmd(node_name, epmd_port, node_port) do
cookie = Node.get_cookie()
[
"ERL_EPMD_PORT=#{epmd_port}", # Use tunneled EPMD, not local
@default_elixir_path,
"--sname #{node_name}", # Short name for distribution
"--cookie #{cookie}", # Must match local cluster
"--erl \"-kernel inet_dist_listen_min #{node_port}
inet_dist_listen_max #{node_port}\"", # Pin distribution port
"-e \"Process.sleep(:infinity)\"" # Keep the node alive
]
|> Enum.join(" ")
end
Four critical details:
ERL_EPMD_PORT - Overrides the default EPMD port (4369) with the tunneled port. The remote node registers with and queries the local EPMD through the SSH tunnel.
--cookie - Must match the local node’s cookie. Erlang distribution requires matching cookies for authentication.
Pinned distribution port - inet_dist_listen_min and inet_dist_listen_max set to the same value. Normally Erlang picks a random port. Fusion pins it to the port the forward tunnel expects.
@localhost node name - All distribution traffic goes through SSH tunnels that bind on localhost. The remote node is named fusion_worker_123456@localhost, not fusion_worker@actual-hostname:
defp gen_remote_node_name(_host) do
id = :rand.uniform(999_999) |> Integer.to_string() |> String.pad_leading(6, "0")
:"fusion_worker_#{id}@localhost"
end
Using the actual hostname would bypass the tunnels entirely.
Waiting for Connection
The remote node takes a moment to start. Fusion retries Node.connect/1 in a loop:
defp do_wait_for_connection(node_name, deadline) do
if System.monotonic_time(:millisecond) > deadline do
{:error, :connect_timeout}
else
case Node.connect(node_name) do
true -> :ok
false ->
Process.sleep(@connect_retry_interval)
do_wait_for_connection(node_name, deadline)
:ignored ->
{:error, :local_node_not_alive}
end
end
end
Once connected, Fusion monitors the remote node for unexpected disconnections:
Node.monitor(remote_node_name, true)
If the remote goes down, handle_info({:nodedown, node}, state) updates the status.
The Full Picture
From Fusion.NodeManager.connect/2 to a working remote node:
- Query local EPMD for our distribution port
- Generate random ports for the remote node and EPMD tunnel
- Open three SSH tunnels (reverse, forward, reverse)
- Start a remote BEAM via SSH with pinned ports and shared cookie
- Retry
Node.connect/1until the remote node joins the cluster - Monitor for disconnections
All of this happens in do_connect/1 - about 30 lines of orchestration code.
What’s Next
The remote node is running and connected. But it only has Erlang/Elixir stdlib. When you call Fusion.run(remote_node, MyApp.Worker, :process, [data]), how does MyApp.Worker get there?
Part 3 covers bytecode pushing - how Fusion reads BEAM files to discover dependencies and ships compiled code to the remote node.