Is there a name for the process of creating a DAG based on underlying data and code dependencies?

6 days ago 5
ARTICLE AD BOX

I am taking over a monorepo of data processing programs that are run iteratively in stages (for example: stage 1 runs program 1 to program 30 then data is verified. stage 2 starts at program 20 and runs to program 80 with inputs frozen at program 19's state if inital input datasets haven't changed. This continues to a final data product), and will convert the entire process to python. The only mirrors I have for managing code dependencies rely on an internal conda mirror.

The current process has ~300 programs and dependencies such as "Program 40 is only run and writes to dataset Z if program 23 has run, dataset X is marked as ready for parsing, program 37 is not included in this stage, and run stage < 3".

These dependencies are manually tracked. Programmers specify a status for their program from this master list for each iterative run and they need to coordinate with each other to create the input list for any given stage. The starting and stopping point for each stage is determined by what inputs are "frozen", what programs are modified, what environmental flags are set etc. Unexpected side effects are unsurprisingly quite common when programmers fail to update the input file to match changes to their code.

I want to automate and update this process by replacing the manual input list with a DAG of which programs to run in a given stage that I can pass to the job scheduler. I feel like I'm starting at a layer before the dag level though, where program Z doesn't actually care about running after program Y , just after the program that modifies its input dataset variables or artifacts ( config files, env variables, flat dataset, log files etc).

I think as long as each programmer specifies exactly what inputs any particular program X needs in a parsable way, and as long as I have a starting linear list of program order (to know if both program X and Y modify some variable which program should run first), I think I should be able to generate a DAG of the actual programs to run for any given stage of a run order. This also has the benefit that the DAG will always be correct if a programs input variables change or some dependency is updated such that it no longer directly depends on a previous program (this happens fairly often).

I have to assume this is a fairly common issue so before I try to reinvent the wheel: is there an existing tool, process, or design pattern that attempts to solve this problem? And ideally does it have a way to bundle changes to data, configs, and code as dependencies?

The python package called pants seemed very close to this but I wasn't able to test it because of its installation requirements and its lack of support for conda.

Read Entire Article