gen3_validator package

Subpackages

Submodules

gen3_validator.dict module

class gen3_validator.dict.DataDictionary(schema_path: str)

Bases: object

calculate_node_order()

Call the methods to get node information, node pairs, and node order.

generate_node_lookup() dict

Generate a lookup dictionary for nodes, mapping node names to their categories and properties.

Returns:

A dictionary mapping node names to their category and properties.

Return type:

dict

get_all_node_pairs(excluded_nodes=['_definitions.yaml', '_terms.yaml', '_settings.yaml', 'program.yaml']) list

Retrieve all node pairs, excluding specified nodes.

Parameters:

excluded_nodes (list) – A list of node names to exclude.

Returns:

A list of node pairs.

Return type:

list

get_node_category(node_name: str) tuple

Retrieve the category and ID for a given node, excluding certain nodes.

Parameters:

node_name (str) – The name of the node.

Returns:

A tuple containing the node ID and its category, or None if the node is excluded.

Return type:

tuple

Retrieve the links and ID for a given node.

Parameters:

node_name (str) – The name of the node.

Returns:

A tuple containing the node ID and its links.

Return type:

tuple

get_node_order(edges: list) list

Determine the order of nodes based on their dependencies.

Parameters:

edges (list) – A list of tuples, where each tuple is a node pair (upstream, downstream).

Returns:

A list of nodes in topological order.

Return type:

list

get_node_properties(node_name: str) tuple

Retrieve the properties for a given node.

Parameters:

node_name (str) – The name of the node.

Returns:

A tuple containing the node ID and its properties.

Return type:

tuple

get_nodes() list

Retrieve all node names from the schema.

Returns:

A list of node names.

Return type:

list

get_schema_version(schema: dict) str

Extract the version of the schema from the provided schema dictionary.

Parameters:

schema (dict) – The schema dictionary from which to extract the version.

Returns:

The version of the schema.

Return type:

str

parse_schema()

Read the list of gen3 jsonschema, then split it into individual node schemas.

Data is stored in self.schema and self.schema_list.

read_json(path: str) dict

Read a JSON file and return its contents as a dictionary.

Parameters:

path (str) – The path to the JSON file.

Returns:

The contents of the JSON file.

Return type:

dict

return_schema(schema_id: str) dict

Retrieve the first dictionary from a list where the ‘id’ key matches the schema_id.

Parameters:

schema_id (str) – The value of the ‘id’ key to match.

Returns:

The dictionary that matches the schema_id, or None if not found.

Return type:

dict

schema_list_to_json(schema_list: list) dict

Convert a list of JSON schemas to a dictionary where each key is the schema id with ‘.yaml’ appended, and the value is the schema content.

Parameters:

schema_list (list) – A list of gen3 JSON schemas.

Returns:

A dictionary with schema ids as keys and schema contents as values.

Return type:

dict

split_json() list

Split the schema into a list of individual node schemas.

Returns:

A list of node schemas.

Return type:

list

class gen3_validator.dict.PathInfo(path: List[str], steps: int)

Bases: object

Data structure representing a single path in a directed graph.

path

The sequence of node names (as strings) representing the path from the root to the destination node.

Type:

List[str]

steps

The number of steps (edges) in the path. This is typically len(path) - 1.

Type:

int

path: List[str]
steps: int
gen3_validator.dict.build_graph(edges: List[Tuple[str, str]], ignore_nodes: List[str] | None = None) Tuple[Dict[str, List[str]], Set[str], Set[str]]

Build an adjacency list representation of a directed graph from a list of edges.

Parameters:
  • edges (List[Tuple[str, str]]) – A list of (upstream, downstream) node pairs representing directed edges in the graph.

  • ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore when building the graph. Edges involving these nodes are skipped. Defaults to None.

Returns:

  • graph (Dict[str, List[str]]) – The adjacency list representation of the graph, mapping each node to a list of its downstream neighbors.

  • all_nodes (Set[str]) – The set of all node names present in the graph (including both upstream and downstream nodes).

  • downstream_nodes (Set[str]) – The set of all nodes that appear as downstream nodes in any edge.

Examples

>>> edges = [('A', 'B'), ('B', 'C')]
>>> build_graph(edges)
({'A': ['B'], 'B': ['C']}, {'A', 'B', 'C'}, {'B', 'C'})
gen3_validator.dict.find_all_paths(graph: Dict[str, List[str]], start_node: str, ignore_nodes: List[str] | None = None) List[List[str]]

Find all possible acyclic paths starting from a given node in a directed graph.

Parameters:
  • graph (Dict[str, List[str]]) – The adjacency list representation of the graph.

  • start_node (str) – The node from which to start searching for paths.

  • ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore during traversal. Defaults to None.

Returns:

A list of paths, where each path is a list of node names (strings) from the start_node to a destination node. Each path has at least two nodes (start and destination).

Return type:

List[List[str]]

Notes

  • Cycles are avoided: a node is not revisited in the same path.

  • Nodes in ignore_nodes are not included in any path.

Examples

>>> graph = {'A': ['B', 'C'], 'B': ['C'], 'C': []}
>>> find_all_paths(graph, 'A')
[['A', 'B'], ['A', 'B', 'C'], ['A', 'C']]
gen3_validator.dict.find_root_node(all_nodes: Set[str], downstream_nodes: Set[str], ignore_nodes: List[str] | None = None, root_node: str | None = None) List[str]

Identify the root nodes of a directed graph.

A root node is defined as a node that does not appear as a downstream node in any edge, and is not in the ignore_nodes list. If a specific root_node is provided, only that node is returned.

Parameters:
  • all_nodes (Set[str]) – The set of all node names in the graph.

  • downstream_nodes (Set[str]) – The set of all nodes that appear as downstream nodes in any edge.

  • ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore as possible roots. Defaults to None.

  • root_node (Optional[str], optional) – If provided, this node is returned as the only root node.

Returns:

A list of root node names.

Return type:

List[str]

Examples

>>> all_nodes = {'A', 'B', 'C'}
>>> downstream_nodes = {'B', 'C'}
>>> find_root_node(all_nodes, downstream_nodes)
['A']
gen3_validator.dict.get_min_node_path(edges: list, target_node: str, ignore_nodes: list = ['core_metadata_collection'], root_node: str | None = None) PathInfo

Find the shortest path from a root node (or specified root_node) to a target node in a directed graph.

Parameters:
  • edges (list of tuple) – List of (upstream, downstream) node pairs representing the directed edges of the graph.

  • target_node (str) – The destination node for which the shortest path is sought.

  • ignore_nodes (list, optional) – List of node names to ignore in the graph and in path traversal. Defaults to [“core_metadata_collection”].

  • root_node (Optional[str], optional) – If provided, only paths starting from this node are considered as root paths.

Returns:

The PathInfo object representing the shortest path from a root node to the target_node.

Return type:

PathInfo

Raises:

ValueError – If no path exists from any root node to the target_node.

Examples

>>> edges = [('A', 'B'), ('B', 'C')]
>>> get_min_node_path(edges, 'C')
PathInfo(path=['A', 'B', 'C'], steps=2)
gen3_validator.dict.group_paths_by_destination(edges: list, ignore_nodes: list = ['core_metadata_collection'], root_node: str | None = None) Dict[str, List[PathInfo]]

Find and group all possible acyclic paths in a directed graph by their destination node.

For each destination node, all unique paths from any root node (or a specified root_node) to that destination are collected, ignoring any nodes in ignore_nodes.

Parameters:
  • edges (list of tuple) – List of (upstream, downstream) node pairs representing the directed edges of the graph.

  • ignore_nodes (list, optional) – List of node names to ignore in the graph and in path traversal. Defaults to [“core_metadata_collection”].

  • root_node (Optional[str], optional) – If provided, only paths starting from this node are considered as root paths.

Returns:

A dictionary mapping each destination node name to a list of PathInfo objects, each representing a unique path from a root node to that destination.

Return type:

Dict[str, List[PathInfo]]

Examples

>>> edges = [('A', 'B'), ('B', 'C')]
>>> group_paths_by_destination(edges)
{'B': [PathInfo(path=['A', 'B'], steps=1)], 'C': [PathInfo(path=['A', 'B', 'C'], steps=2)]}

gen3_validator.linkage module

class gen3_validator.linkage.Linkage(root_node: List[str] | None = None)

Bases: object

generate_config(data_map, link_suffix: str = 's') dict

Generates a configuration dictionary for entities based on the data map.

This method creates a configuration dictionary where each key is an entity name and the value is a dictionary containing ‘primary_key’ and ‘foreign_key’ for that entity. The primary key is constructed using the entity name and the provided link suffix. The foreign key is determined by searching for a key in the data that contains a ‘submitter_id’.

Parameters:
  • data_map (dict) – A dictionary where each key is an entity name and the value is a list of data records for that entity.

  • link_suffix (str, optional) – A suffix to append to the primary key. Defaults to ‘s’.

Returns:

A configuration dictionary with primary and foreign keys for each entity.

Return type:

dict

get_foreign_keys(data_map: Dict[str, List[Dict[str, Any]]], config: Dict[str, Any]) dict

Extracts all foreign key values for each entity from the provided data map, using the foreign key field specified in the config for each entity.

Parameters:
  • data_map (Dict[str, List[Dict[str, Any]]]) – A dictionary where each key is an entity name (e.g., “sample”, “subject”), and each value is a list of records (dictionaries) for that entity. Each record should contain the foreign key field as specified in the config.

  • config (Dict[str, Any]) – A dictionary where each key is an entity name, and each value is a dictionary containing at least the key ‘foreign_key’, which specifies the field name in the records to use as the foreign key.

Returns:

A dictionary mapping each entity name to a list of extracted foreign key values. If an entity has no foreign key specified in the config, its value will be an empty list.

Return type:

Dict[str, List[Any]]

Raises:
  • KeyError – If an entity specified in the config is missing from the data_map.

  • Exception – If an unexpected error occurs during extraction for any entity.

Note

  • If a record’s foreign key field is missing, that record is skipped with a warning.

  • If the foreign key value is a dictionary containing a ‘submitter_id’, that value is used.

  • Otherwise, the value of the foreign key field is used directly.

  • If the foreign key field is None in the config, extraction is skipped for that entity.

Example:

data_map = {
    "sample": [
        {"subjects": {"submitter_id": "subject_1"}, ...},
        {"subjects": "subject_2", ...}
    ]
}
config = {
    "sample": {"foreign_key": "subjects", ...}
}
# Returns: {"sample": ["subject_1", "subject_2"]}
get_primary_keys(data_map: Dict[str, List[Dict[str, Any]]], config: Dict[str, Any]) dict

Extracts all primary key values for each entity from the provided data map, using the primary key field specified in the config for each entity.

Parameters:
  • data_map (Dict[str, List[Dict[str, Any]]]) – A dictionary where each key is an entity name (e.g., “sample”, “subject”), and each value is a list of records (dictionaries) for that entity. Each record should contain the primary key field as specified in the config.

  • config (Dict[str, Any]) – A dictionary where each key is an entity name, and each value is a dictionary containing at least the key ‘primary_key’, which specifies the field name in the records to use as the primary key.

Returns:

A dictionary mapping each entity name to a list of extracted primary key values. If an entity has no primary key specified in the config, its value will be an empty list.

Return type:

Dict[str, List[Any]]

Raises:
  • KeyError – If an entity specified in the config is missing from the data_map.

  • Exception – If an unexpected error occurs during extraction for any entity.

Note

  • If a record’s primary key field is missing, that record is skipped with a warning.

  • If the primary key value is a dictionary containing a ‘submitter_id’, that value is used.

  • Otherwise, the value of the primary key field is used directly.

  • If the primary key field is None in the config, extraction is skipped for that entity.

Example:

data_map = {
    "subject": [
        {"subjects": "subject_1", ...},
        {"subjects": {"submitter_id": "subject_2"}, ...}
    ]
}
config = {
    "subject": {"primary_key": "subjects", ...}
}
# Returns: {"subject": ["subject_1", "subject_2"]}

Validates the configuration map by checking the foreign key links between entities.

This method checks if the foreign key of each entity in the config map matches the primary key of any other entity. If a match is not found and the entity is not a root node, it records the broken link. Root nodes are allowed to have unmatched foreign keys.

Parameters:
  • config_map (Dict[str, Any]) – A dictionary containing the configuration of entities, where each key is an entity name and the value is a dictionary with ‘primary_key’ and ‘foreign_key’.

  • root_node (List[str], optional) – A list of root node names that are allowed to have unmatched foreign keys. Defaults to [‘subject’].

Returns:

A dictionary of entities with broken links and their foreign keys if any are found. Returns “valid” if no broken links are detected.

Return type:

dict

Raises:
  • KeyError – If a required key (‘primary_key’ or ‘foreign_key’) is missing in the config for any entity.

  • TypeError – If config_map is not a dictionary or its values are not dictionaries.

Verifies Config file, then extracts primary and foreign key values from the data map. Then uses the foreign key values to validate the primary key values.

First, validates the config map for correct foreign/primary key relationships. Then, for each entity, checks that all its foreign key values exist among the primary key values of any entity. Returns a dictionary mapping each entity to a list of invalid (unmatched) foreign key values.

Parameters:
  • data_map (Dict[str, List[Dict[str, Any]]]) – Contains the data for each entity.

  • config (Dict[str, Any]) – The entity linkage configx.

  • root_node (List[str], optional) – List of root node names that are allowed to have unmatched foreign keys. Defaults to [‘subject’].

Returns:

Dictionary of entities and their validation results. If the config is invalid, returns the config validation result.

Return type:

Dict[str, List[str]]

gen3_validator.resolve_schema module

class gen3_validator.resolve_schema.ResolveSchema(schema_path: str)

Bases: DataDictionary

resolve_all_references() list

Resolve references in all schema node dictionaries using the resolved definitions schema.

Returns:

A list of resolved schema dictionaries, one for each node.

Return type:

list

resolve_references(schema: dict, reference: dict) dict

Recursively resolve all $ref references in a Gen3 JSON schema draft 4 node, using a reference schema that contains no references.

Parameters:
  • schema (dict) – The JSON node to resolve references in.

  • reference (dict) – The schema containing the reference definitions.

Returns:

The resolved JSON node with all references resolved.

Return type:

dict

resolve_schema()

Fully resolve and initialize all schema-related attributes for this instance.

This method performs the following steps:

  1. Reads and parses the raw schema from file.

  2. Extracts the definitions and terms schemas.

  3. Resolves references within the definitions schema using the terms schema.

  4. Resolves all references in each node schema using the resolved definitions.

  5. Converts the fully resolved node schemas into a JSON dictionary format.

After execution, the following instance attributes are set:

  • self.schema: Raw schema dictionary loaded from file.

  • self.schema_list: List of individual node schemas.

  • self.schema_def: Definitions schema dictionary.

  • self.schema_term: Terms schema dictionary.

  • self.schema_def_resolved: Definitions schema with references resolved.

  • self.schema_list_resolved: List of node schemas with all references resolved.

  • self.schema_resolved: Dictionary of resolved node schemas in JSON format.

Returns:

None

return_resolved_schema(schema_id: str) dict

Retrieve the first dictionary from the resolved schema list where the id key matches schema_id.

Parameters:

schema_id (str) – The value of the id key to match. May include or omit the .yaml extension.

Returns:

The dictionary that matches the schema_id, or None if not found.

Return type:

dict or None

gen3_validator.validate module

class gen3_validator.validate.Validate(data_map, resolved_schema)

Bases: object

The Validate class is responsible for validating data objects against a resolved JSON schema using the Draft4Validator from the jsonschema library. It provides methods to validate individual objects, manage validation results, and create key mappings for further processing.

Variables:
  • data_map (dict) – A dictionary of the data objects, where the key is the entity name, and the value is a list of json objects e.g. {‘sample’: [{id: 1, name: ‘sample1’}, {id: 2, name: ‘sample2’}]}

  • resolved_schema (dict) – The resolved gen3 JSON schema to validate against.

Methods:

  • __init__(data_map, resolved_schema): Initializes the Validate class with the provided data and schema, performs validation, and creates a key map.

  • validate_object(obj, idx, validator): Validates a single JSON object against a provided JSON schema validator and returns a list of validation results.

  • validate_schema(data_map, resolved_schema): Validates the entire data_map against the resolved_schema and returns the results.

  • make_keymap(): Generates a mapping of keys from the data_map for reference and lookup.

list_entities() list

Lists all entities present in the validation results.

Returns:

A list of entity names.

Return type:

list

list_index_by_entity(entity: str) list

Lists all index keys for a specified entity.

Parameters:

entity (str) – The name of the entity to list index keys for.

Returns:

A list of index keys for the specified entity.

Return type:

list

make_keymap() dict

Creates a dictionary that maps entities to their corresponding index keys.

Returns:

A dictionary where each key is an entity name and each value is a list of index keys for that entity.

Return type:

dict

pull_entity(entity: str, result_type: str = 'FAIL') list

Retrieves the validation results for a specified entity.

Parameters:
  • entity (str) – The name of the entity to retrieve validation results for.

  • result_type (str) – The type of validation result to return. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.

Returns:

A list of validation results for the specified entity, or None if no results are found.

Return type:

list

pull_index_of_entity(entity: str, index_key: int, result_type: str = 'FAIL', return_failed: bool = True) list

Retrieves the validation result for a specified entity and index key.

Parameters:
  • entity (str) – The name of the entity to retrieve validation results for.

  • index_key (int) – The index key of the validation result to retrieve.

  • result_type (str) – The type of validation result to return. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.

  • return_failed (bool) – Flag to determine if only failed results should be returned. Default is True.

Returns:

List of objects containing each validation result for the specified entity and index key, or None if not found. Each element in the list corresponds to a validation result for a specific property, while the index corresponds to the entry.

Return type:

list

validate_object(obj, idx, validator) list

Validates a single JSON object against a provided JSON schema validator.

Parameters:
  • obj (dict) – The JSON object to validate.

  • idx (int) – The index of the object in the dataset.

  • validator (Draft4Validator) – The JSON schema validator to use for validation.

Returns:

A list of dictionaries containing validation results and log messages.

Return type:

list

validate_schema() dict

Validates the data in self.data_map against the schemas in self.resolved_schema.

Returns:

A dictionary containing validation results for each entity.

Return type:

dict

class gen3_validator.validate.ValidateStats(validate_instance: Validate)

Bases: Validate

count_results_by_entity(entity: str, result_type: str = 'FAIL', print_results: bool = False) int

Counts the number of validation results for a specified entity. Each entry in the entity may produce more than one validation error, which will be counted. For example, one entry, in ‘sample’ may result in 5 validation errors. This function counts the total number of validation errors for a whole entity.

Parameters:
  • entity (str) – The name of the entity to count failed validation results for.

  • result_type (str) – The type of validation result to count. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.

  • print_results (bool) – Flag to print the results. Default is False.

Returns:

The number of failed validation results for the specified entity.

Return type:

int

count_results_by_index(entity: str, index_key: int, result_type: str = 'FAIL', print_results: bool = False)

Counts the number of validation results based on a specified entity and index_key. For example the entity ‘sample’ will have an error in row 1 / index 1, which contains 5 validation errors due to errors in 5 columns for that row. So the method will return 5 validation errors.

Parameters:
  • entity (str) – The name of the entity to count validation results for.

  • index_key (int) – The key/index to count validation results for.

  • result_type (str) – The type of validation result to count. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.

  • print_results (bool) – Flag to print the results. Default is False.

Returns:

The number of validation results for the specified key/index.

Return type:

int

n_errors_per_entry(entity: str, index_key: int) int

Returns the number of validation errors for a given entity and index.

Parameters:
  • entity (str) – The name of the entity to check for validation errors.

  • index_key (int) – The index of the row to check for validation errors.

Returns:

The number of validation errors for the given entity and index.

Return type:

int

n_rows_with_errors(entity: str) int

Returns the number of rows that have validation errors for a given entity.

Parameters:

entity (str) – The name of the entity to check for validation errors.

Returns:

The number of rows with validation errors.

Return type:

int

summary_stats() DataFrame

Generates and prints a summary of validation statistics.

This method calculates the total number of validation errors across all entities and provides detailed statistics for each entity, including the number of rows with errors and the total number of errors per entity. The results are printed to the console and returned as a pandas DataFrame.

Returns:

A DataFrame containing the summary statistics with columns ‘entity’, ‘number_of_rows_with_errors’, and ‘number_of_errors_per_entity’.

Return type:

pandas.DataFrame

total_validation_errors() int

Calculates the total number of validation errors across all entities.

Returns:

The total number of validation errors.

Return type:

int

class gen3_validator.validate.ValidateSummary(validate_instance: Validate)

Bases: Validate

collapse_flatten_results_to_pd() DataFrame

Collapses the flattened validation results into a summarized pandas DataFrame.

This method groups the flattened validation results by ‘validation_error’ and aggregates other columns to provide a summary of the validation errors, including the count of occurrences for each error type.

Returns:

A DataFrame containing the collapsed summary of validation errors, sorted by entity, validation error, and count.

Return type:

pandas.DataFrame

flatten_validation_results(result_type: str = 'FAIL') dict

Flattens the validation results created when initializing the Validate class.

This method extracts all the validation results for each entity, each index row, and each entry in the index row. It effectively pulls all the entries for a particular entity, row, and column, where one row can produce validation errors in multiple columns.

Parameters:

result_type (str) – The type of validation result to filter by, default is “FAIL”.

Returns:

A dictionary containing flattened validation results with a unique GUID for each entry, along with the entity and other relevant validation details.

Return type:

dict

flattened_results_to_pd() DataFrame

Transforms the flattened validation results into a pandas DataFrame.

This function retrieves the flattened validation results stored in the instance and converts them into a pandas DataFrame. The DataFrame is then sorted by ‘entity’ and ‘row’ for organized analysis or processing.

Returns:

A DataFrame containing the sorted and indexed flattened validation results.

Return type:

pandas.DataFrame

Module contents