Originally posted as an answer to What is DAG in Spark, and how does it work?
As Rajagopal ParthaSarathi pointed out, a DAG is a directed acyclic graph. They are commonly used in computer systems for task execution.
In this context, a graph is a collection of nodes that are connected by edges. In the case of Hadoop and Spark, the nodes represent executable tasks, and the edges are task dependencies. Think of the DAG like a flow chart that tells the system which tasks to execute and in what order. The following is a simple example of an undirected graph of tasks.
This graph is undirected because there it does not capture which node is the start node and which is the end node. In other words, this graph does not tell me if the reduce task should be feeding the map tasks or vice versa. The next graph shows a directed graph of tasks.
A directed graph gives an unambiguous direction for each edge. This means that we know that the map tasks feed into the reduce task, rather than the other way around. This property is essential for executing complex workflows since we need to know which tasks should be executed in which order.
Lastly, the graph is acyclic, because it does not contain any cycles. A cycle happens when it is possible to loop back to a previous node. Cycles are useful for tasks involving recursion but not as good for large-scale distributed systems. The following are two examples of graphs with cycles.