Elkhound API¶
Elkhound is an opinionated, data-centric workflow engine.
-
class
elkhound.CSVDataFileSpec(code, name, extension='csv', flags=0, schema=None, dialect=<class 'csv.excel'>)¶ Specification of a CSV data file format.
-
is_csv() → bool¶ Returns: Whether the file is a CSV file.
-
-
class
elkhound.DataFileSpec(code, name, extension, flags=0)¶ Specification of a data file format.
-
is_binary() → bool¶ Returns: Whether the file is binary. If the file is gzipped, specifies whether the underlying file (i.e. after unpacking) is binary.
-
is_csv() → bool¶ Returns: Whether the file is a CSV file.
-
is_directory() → bool¶ Returns: Whether the “file” is actually a directory. If yes, binary and gzipped flags should be ignored.
-
is_gzipped() → bool¶ Returns: Whether the file is zipped using gzip.
-
-
class
elkhound.Flag¶ Flags that can be set for a data file specification.
-
class
elkhound.Task¶ Task describes how to produce output data files having input data files.
This is an abstract base class. Each subclass of
Taskshould implement three methods:get_input_data_file_codes(),get_output_data_file_codes(), andrun().-
get_input_data_file_codes() → typing.List[int]¶ Returns data file codes that can be read by the task. One data file can serve as an input for several tasks.
Returns: List of input data file codes.
-
get_output_data_file_codes() → typing.List[int]¶ Returns data file codes that can be written by the task. The lowest output file code has to be greater than the highest input file code. At most one task registered in an engine can write a given data file.
Returns: List of output data file codes.
-
run(input_files: typing.Dict[int, elkhound.file.DataFile], output_files: typing.Dict[int, elkhound.file.DataFile], context: typing.Dict[str, typing.Any])¶ Runs the task.
Parameters: - input_files – Dictionary of
DataFileobjects, one for each input file code. - output_files – Dictionary of
DataFileobjects, one for each output file code. - context – Dictionary of additional information necessary to run the task.
- input_files – Dictionary of
-
-
class
elkhound.Engine(timestamp: typing.Union[int, NoneType] = None)¶ Engine orchestrates execution of tasks. In particular, engine is responsible for:
- managing information about data files, tasks and workflows;
- versioning intermediate data files;
- sequencing tasks and supplying right input files.
Engine configuration containing information about data files, tasks and workflows can be read from a YAML file using
read()method.-
expand_targets(targets: typing.List[str], dependencies: bool = False) → typing.List[int]¶ Resolve workflow names and optionally add dependencies.
Parameters: - targets – List of data file codes and workflow names to resolve.
- dependencies – Whether to add upstream tasks.
Returns: List of data file codes after expanding workflows and (optionally) adding dependencies.
-
read(config_file_name)¶ Reads complete engine configuration from a YAML file.
Parameters: config_file_name – configuration file name.
-
run(workspace: str, targets: typing.List[int], context: typing.Dict[str, typing.Any] = None)¶ Run tasks producing the specified targets (data file codes).
Parameters: - workspace – Path to workspace directory containing data files.
- targets – List of codes of data file to produce.
- context – Dictionary of additional information necessary to run tasks.
-
elkhound.run_engine(timestamp: int = None, callback=None, logs: bool = True)¶ Set up and run an engine instance. Read config files, parse command-line arguments, register file specs and tasks found in the config, set up logging, etc. and run the engine.
It is suggested to run this function in the main function. If additional tweaking is needed between setting up the engine and running it, provide a callback function.
Parameters: - timestamp – Optional timestamp of the run, as integer in YYYYMMDDHHMMSS format.
- callback – Optional function to call just before running the engine. The callback function will receive two arguments: the configured engine instance, and a dictionary of arguments that were about to be passed to the engine’s run() method.
- logs – Whether to configure logging.