gen3_validator package¶
Subpackages¶
Submodules¶
gen3_validator.dict module¶
- class gen3_validator.dict.DataDictionary(schema_path: str)¶
Bases:
object
- calculate_node_order()¶
Call the methods to get node information, node pairs, and node order.
- generate_node_lookup() dict ¶
Generate a lookup dictionary for nodes, mapping node names to their categories and properties.
- Returns:
A dictionary mapping node names to their category and properties.
- Return type:
dict
- get_all_node_pairs(excluded_nodes=['_definitions.yaml', '_terms.yaml', '_settings.yaml', 'program.yaml']) list ¶
Retrieve all node pairs, excluding specified nodes.
- Parameters:
excluded_nodes (list) – A list of node names to exclude.
- Returns:
A list of node pairs.
- Return type:
list
- get_node_category(node_name: str) tuple ¶
Retrieve the category and ID for a given node, excluding certain nodes.
- Parameters:
node_name (str) – The name of the node.
- Returns:
A tuple containing the node ID and its category, or None if the node is excluded.
- Return type:
tuple
- get_node_link(node_name: str) tuple ¶
Retrieve the links and ID for a given node.
- Parameters:
node_name (str) – The name of the node.
- Returns:
A tuple containing the node ID and its links.
- Return type:
tuple
- get_node_order(edges: list) list ¶
Determine the order of nodes based on their dependencies.
- Parameters:
edges (list) – A list of tuples, where each tuple is a node pair (upstream, downstream).
- Returns:
A list of nodes in topological order.
- Return type:
list
- get_node_properties(node_name: str) tuple ¶
Retrieve the properties for a given node.
- Parameters:
node_name (str) – The name of the node.
- Returns:
A tuple containing the node ID and its properties.
- Return type:
tuple
- get_nodes() list ¶
Retrieve all node names from the schema.
- Returns:
A list of node names.
- Return type:
list
- get_schema_version(schema: dict) str ¶
Extract the version of the schema from the provided schema dictionary.
- Parameters:
schema (dict) – The schema dictionary from which to extract the version.
- Returns:
The version of the schema.
- Return type:
str
- parse_schema()¶
Read the list of gen3 jsonschema, then split it into individual node schemas.
Data is stored in
self.schema
andself.schema_list
.
- read_json(path: str) dict ¶
Read a JSON file and return its contents as a dictionary.
- Parameters:
path (str) – The path to the JSON file.
- Returns:
The contents of the JSON file.
- Return type:
dict
- return_schema(schema_id: str) dict ¶
Retrieve the first dictionary from a list where the ‘id’ key matches the schema_id.
- Parameters:
schema_id (str) – The value of the ‘id’ key to match.
- Returns:
The dictionary that matches the schema_id, or None if not found.
- Return type:
dict
- schema_list_to_json(schema_list: list) dict ¶
Convert a list of JSON schemas to a dictionary where each key is the schema id with ‘.yaml’ appended, and the value is the schema content.
- Parameters:
schema_list (list) – A list of gen3 JSON schemas.
- Returns:
A dictionary with schema ids as keys and schema contents as values.
- Return type:
dict
- split_json() list ¶
Split the schema into a list of individual node schemas.
- Returns:
A list of node schemas.
- Return type:
list
- class gen3_validator.dict.PathInfo(path: List[str], steps: int)¶
Bases:
object
Data structure representing a single path in a directed graph.
- path¶
The sequence of node names (as strings) representing the path from the root to the destination node.
- Type:
List[str]
- steps¶
The number of steps (edges) in the path. This is typically len(path) - 1.
- Type:
int
- path: List[str]¶
- steps: int¶
- gen3_validator.dict.build_graph(edges: List[Tuple[str, str]], ignore_nodes: List[str] | None = None) Tuple[Dict[str, List[str]], Set[str], Set[str]] ¶
Build an adjacency list representation of a directed graph from a list of edges.
- Parameters:
edges (List[Tuple[str, str]]) – A list of (upstream, downstream) node pairs representing directed edges in the graph.
ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore when building the graph. Edges involving these nodes are skipped. Defaults to None.
- Returns:
graph (Dict[str, List[str]]) – The adjacency list representation of the graph, mapping each node to a list of its downstream neighbors.
all_nodes (Set[str]) – The set of all node names present in the graph (including both upstream and downstream nodes).
downstream_nodes (Set[str]) – The set of all nodes that appear as downstream nodes in any edge.
Examples
>>> edges = [('A', 'B'), ('B', 'C')] >>> build_graph(edges) ({'A': ['B'], 'B': ['C']}, {'A', 'B', 'C'}, {'B', 'C'})
- gen3_validator.dict.find_all_paths(graph: Dict[str, List[str]], start_node: str, ignore_nodes: List[str] | None = None) List[List[str]] ¶
Find all possible acyclic paths starting from a given node in a directed graph.
- Parameters:
graph (Dict[str, List[str]]) – The adjacency list representation of the graph.
start_node (str) – The node from which to start searching for paths.
ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore during traversal. Defaults to None.
- Returns:
A list of paths, where each path is a list of node names (strings) from the start_node to a destination node. Each path has at least two nodes (start and destination).
- Return type:
List[List[str]]
Notes
Cycles are avoided: a node is not revisited in the same path.
Nodes in ignore_nodes are not included in any path.
Examples
>>> graph = {'A': ['B', 'C'], 'B': ['C'], 'C': []} >>> find_all_paths(graph, 'A') [['A', 'B'], ['A', 'B', 'C'], ['A', 'C']]
- gen3_validator.dict.find_root_node(all_nodes: Set[str], downstream_nodes: Set[str], ignore_nodes: List[str] | None = None, root_node: str | None = None) List[str] ¶
Identify the root nodes of a directed graph.
A root node is defined as a node that does not appear as a downstream node in any edge, and is not in the ignore_nodes list. If a specific root_node is provided, only that node is returned.
- Parameters:
all_nodes (Set[str]) – The set of all node names in the graph.
downstream_nodes (Set[str]) – The set of all nodes that appear as downstream nodes in any edge.
ignore_nodes (Optional[List[str]], optional) – A list of node names to ignore as possible roots. Defaults to None.
root_node (Optional[str], optional) – If provided, this node is returned as the only root node.
- Returns:
A list of root node names.
- Return type:
List[str]
Examples
>>> all_nodes = {'A', 'B', 'C'} >>> downstream_nodes = {'B', 'C'} >>> find_root_node(all_nodes, downstream_nodes) ['A']
- gen3_validator.dict.get_min_node_path(edges: list, target_node: str, ignore_nodes: list = ['core_metadata_collection'], root_node: str | None = None) PathInfo ¶
Find the shortest path from a root node (or specified root_node) to a target node in a directed graph.
- Parameters:
edges (list of tuple) – List of (upstream, downstream) node pairs representing the directed edges of the graph.
target_node (str) – The destination node for which the shortest path is sought.
ignore_nodes (list, optional) – List of node names to ignore in the graph and in path traversal. Defaults to [“core_metadata_collection”].
root_node (Optional[str], optional) – If provided, only paths starting from this node are considered as root paths.
- Returns:
The PathInfo object representing the shortest path from a root node to the target_node.
- Return type:
- Raises:
ValueError – If no path exists from any root node to the target_node.
Examples
>>> edges = [('A', 'B'), ('B', 'C')] >>> get_min_node_path(edges, 'C') PathInfo(path=['A', 'B', 'C'], steps=2)
- gen3_validator.dict.group_paths_by_destination(edges: list, ignore_nodes: list = ['core_metadata_collection'], root_node: str | None = None) Dict[str, List[PathInfo]] ¶
Find and group all possible acyclic paths in a directed graph by their destination node.
For each destination node, all unique paths from any root node (or a specified root_node) to that destination are collected, ignoring any nodes in ignore_nodes.
- Parameters:
edges (list of tuple) – List of (upstream, downstream) node pairs representing the directed edges of the graph.
ignore_nodes (list, optional) – List of node names to ignore in the graph and in path traversal. Defaults to [“core_metadata_collection”].
root_node (Optional[str], optional) – If provided, only paths starting from this node are considered as root paths.
- Returns:
A dictionary mapping each destination node name to a list of PathInfo objects, each representing a unique path from a root node to that destination.
- Return type:
Dict[str, List[PathInfo]]
Examples
>>> edges = [('A', 'B'), ('B', 'C')] >>> group_paths_by_destination(edges) {'B': [PathInfo(path=['A', 'B'], steps=1)], 'C': [PathInfo(path=['A', 'B', 'C'], steps=2)]}
gen3_validator.linkage module¶
- class gen3_validator.linkage.Linkage(root_node: List[str] | None = None)¶
Bases:
object
- generate_config(data_map, link_suffix: str = 's') dict ¶
Generates a configuration dictionary for entities based on the data map.
This method creates a configuration dictionary where each key is an entity name and the value is a dictionary containing ‘primary_key’ and ‘foreign_key’ for that entity. The primary key is constructed using the entity name and the provided link suffix. The foreign key is determined by searching for a key in the data that contains a ‘submitter_id’.
- Parameters:
data_map (dict) – A dictionary where each key is an entity name and the value is a list of data records for that entity.
link_suffix (str, optional) – A suffix to append to the primary key. Defaults to ‘s’.
- Returns:
A configuration dictionary with primary and foreign keys for each entity.
- Return type:
dict
- get_foreign_keys(data_map: Dict[str, List[Dict[str, Any]]], config: Dict[str, Any]) dict ¶
Extracts all foreign key values for each entity from the provided data map, using the foreign key field specified in the config for each entity.
- Parameters:
data_map (Dict[str, List[Dict[str, Any]]]) – A dictionary where each key is an entity name (e.g., “sample”, “subject”), and each value is a list of records (dictionaries) for that entity. Each record should contain the foreign key field as specified in the config.
config (Dict[str, Any]) – A dictionary where each key is an entity name, and each value is a dictionary containing at least the key ‘foreign_key’, which specifies the field name in the records to use as the foreign key.
- Returns:
A dictionary mapping each entity name to a list of extracted foreign key values. If an entity has no foreign key specified in the config, its value will be an empty list.
- Return type:
Dict[str, List[Any]]
- Raises:
KeyError – If an entity specified in the config is missing from the data_map.
Exception – If an unexpected error occurs during extraction for any entity.
Note
If a record’s foreign key field is missing, that record is skipped with a warning.
If the foreign key value is a dictionary containing a ‘submitter_id’, that value is used.
Otherwise, the value of the foreign key field is used directly.
If the foreign key field is None in the config, extraction is skipped for that entity.
Example:
data_map = { "sample": [ {"subjects": {"submitter_id": "subject_1"}, ...}, {"subjects": "subject_2", ...} ] } config = { "sample": {"foreign_key": "subjects", ...} } # Returns: {"sample": ["subject_1", "subject_2"]}
- get_primary_keys(data_map: Dict[str, List[Dict[str, Any]]], config: Dict[str, Any]) dict ¶
Extracts all primary key values for each entity from the provided data map, using the primary key field specified in the config for each entity.
- Parameters:
data_map (Dict[str, List[Dict[str, Any]]]) – A dictionary where each key is an entity name (e.g., “sample”, “subject”), and each value is a list of records (dictionaries) for that entity. Each record should contain the primary key field as specified in the config.
config (Dict[str, Any]) – A dictionary where each key is an entity name, and each value is a dictionary containing at least the key ‘primary_key’, which specifies the field name in the records to use as the primary key.
- Returns:
A dictionary mapping each entity name to a list of extracted primary key values. If an entity has no primary key specified in the config, its value will be an empty list.
- Return type:
Dict[str, List[Any]]
- Raises:
KeyError – If an entity specified in the config is missing from the data_map.
Exception – If an unexpected error occurs during extraction for any entity.
Note
If a record’s primary key field is missing, that record is skipped with a warning.
If the primary key value is a dictionary containing a ‘submitter_id’, that value is used.
Otherwise, the value of the primary key field is used directly.
If the primary key field is None in the config, extraction is skipped for that entity.
Example:
data_map = { "subject": [ {"subjects": "subject_1", ...}, {"subjects": {"submitter_id": "subject_2"}, ...} ] } config = { "subject": {"primary_key": "subjects", ...} } # Returns: {"subject": ["subject_1", "subject_2"]}
- test_config_links(config_map: Dict[str, Any], root_node: List[str] | None = None) dict ¶
Validates the configuration map by checking the foreign key links between entities.
This method checks if the foreign key of each entity in the config map matches the primary key of any other entity. If a match is not found and the entity is not a root node, it records the broken link. Root nodes are allowed to have unmatched foreign keys.
- Parameters:
config_map (Dict[str, Any]) – A dictionary containing the configuration of entities, where each key is an entity name and the value is a dictionary with ‘primary_key’ and ‘foreign_key’.
root_node (List[str], optional) – A list of root node names that are allowed to have unmatched foreign keys. Defaults to [‘subject’].
- Returns:
A dictionary of entities with broken links and their foreign keys if any are found. Returns “valid” if no broken links are detected.
- Return type:
dict
- Raises:
KeyError – If a required key (‘primary_key’ or ‘foreign_key’) is missing in the config for any entity.
TypeError – If config_map is not a dictionary or its values are not dictionaries.
- validate_links(data_map: Dict[str, List[Dict[str, Any]]], config: Dict[str, Any], root_node: List[str] | None = None) Dict[str, List[str]] ¶
Verifies Config file, then extracts primary and foreign key values from the data map. Then uses the foreign key values to validate the primary key values.
First, validates the config map for correct foreign/primary key relationships. Then, for each entity, checks that all its foreign key values exist among the primary key values of any entity. Returns a dictionary mapping each entity to a list of invalid (unmatched) foreign key values.
- Parameters:
data_map (Dict[str, List[Dict[str, Any]]]) – Contains the data for each entity.
config (Dict[str, Any]) – The entity linkage configx.
root_node (List[str], optional) – List of root node names that are allowed to have unmatched foreign keys. Defaults to [‘subject’].
- Returns:
Dictionary of entities and their validation results. If the config is invalid, returns the config validation result.
- Return type:
Dict[str, List[str]]
gen3_validator.resolve_schema module¶
- class gen3_validator.resolve_schema.ResolveSchema(schema_path: str)¶
Bases:
DataDictionary
- resolve_all_references() list ¶
Resolve references in all schema node dictionaries using the resolved definitions schema.
- Returns:
A list of resolved schema dictionaries, one for each node.
- Return type:
list
- resolve_references(schema: dict, reference: dict) dict ¶
Recursively resolve all
$ref
references in a Gen3 JSON schema draft 4 node, using a reference schema that contains no references.- Parameters:
schema (dict) – The JSON node to resolve references in.
reference (dict) – The schema containing the reference definitions.
- Returns:
The resolved JSON node with all references resolved.
- Return type:
dict
- resolve_schema()¶
Fully resolve and initialize all schema-related attributes for this instance.
This method performs the following steps:
Reads and parses the raw schema from file.
Extracts the definitions and terms schemas.
Resolves references within the definitions schema using the terms schema.
Resolves all references in each node schema using the resolved definitions.
Converts the fully resolved node schemas into a JSON dictionary format.
After execution, the following instance attributes are set:
self.schema
: Raw schema dictionary loaded from file.self.schema_list
: List of individual node schemas.self.schema_def
: Definitions schema dictionary.self.schema_term
: Terms schema dictionary.self.schema_def_resolved
: Definitions schema with references resolved.self.schema_list_resolved
: List of node schemas with all references resolved.self.schema_resolved
: Dictionary of resolved node schemas in JSON format.
- Returns:
None
- return_resolved_schema(schema_id: str) dict ¶
Retrieve the first dictionary from the resolved schema list where the
id
key matchesschema_id
.- Parameters:
schema_id (str) – The value of the
id
key to match. May include or omit the.yaml
extension.- Returns:
The dictionary that matches the schema_id, or None if not found.
- Return type:
dict or None
gen3_validator.validate module¶
- class gen3_validator.validate.Validate(data_map, resolved_schema)¶
Bases:
object
The Validate class is responsible for validating data objects against a resolved JSON schema using the Draft4Validator from the jsonschema library. It provides methods to validate individual objects, manage validation results, and create key mappings for further processing.
- Variables:
data_map (dict) – A dictionary of the data objects, where the key is the entity name, and the value is a list of json objects e.g. {‘sample’: [{id: 1, name: ‘sample1’}, {id: 2, name: ‘sample2’}]}
resolved_schema (dict) – The resolved gen3 JSON schema to validate against.
Methods:
__init__(data_map, resolved_schema): Initializes the Validate class with the provided data and schema, performs validation, and creates a key map.
validate_object(obj, idx, validator): Validates a single JSON object against a provided JSON schema validator and returns a list of validation results.
validate_schema(data_map, resolved_schema): Validates the entire data_map against the resolved_schema and returns the results.
make_keymap(): Generates a mapping of keys from the data_map for reference and lookup.
- list_entities() list ¶
Lists all entities present in the validation results.
- Returns:
A list of entity names.
- Return type:
list
- list_index_by_entity(entity: str) list ¶
Lists all index keys for a specified entity.
- Parameters:
entity (str) – The name of the entity to list index keys for.
- Returns:
A list of index keys for the specified entity.
- Return type:
list
- make_keymap() dict ¶
Creates a dictionary that maps entities to their corresponding index keys.
- Returns:
A dictionary where each key is an entity name and each value is a list of index keys for that entity.
- Return type:
dict
- pull_entity(entity: str, result_type: str = 'FAIL') list ¶
Retrieves the validation results for a specified entity.
- Parameters:
entity (str) – The name of the entity to retrieve validation results for.
result_type (str) – The type of validation result to return. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.
- Returns:
A list of validation results for the specified entity, or None if no results are found.
- Return type:
list
- pull_index_of_entity(entity: str, index_key: int, result_type: str = 'FAIL', return_failed: bool = True) list ¶
Retrieves the validation result for a specified entity and index key.
- Parameters:
entity (str) – The name of the entity to retrieve validation results for.
index_key (int) – The index key of the validation result to retrieve.
result_type (str) – The type of validation result to return. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.
return_failed (bool) – Flag to determine if only failed results should be returned. Default is True.
- Returns:
List of objects containing each validation result for the specified entity and index key, or None if not found. Each element in the list corresponds to a validation result for a specific property, while the index corresponds to the entry.
- Return type:
list
- validate_object(obj, idx, validator) list ¶
Validates a single JSON object against a provided JSON schema validator.
- Parameters:
obj (dict) – The JSON object to validate.
idx (int) – The index of the object in the dataset.
validator (Draft4Validator) – The JSON schema validator to use for validation.
- Returns:
A list of dictionaries containing validation results and log messages.
- Return type:
list
- validate_schema() dict ¶
Validates the data in self.data_map against the schemas in self.resolved_schema.
- Returns:
A dictionary containing validation results for each entity.
- Return type:
dict
- class gen3_validator.validate.ValidateStats(validate_instance: Validate)¶
Bases:
Validate
- count_results_by_entity(entity: str, result_type: str = 'FAIL', print_results: bool = False) int ¶
Counts the number of validation results for a specified entity. Each entry in the entity may produce more than one validation error, which will be counted. For example, one entry, in ‘sample’ may result in 5 validation errors. This function counts the total number of validation errors for a whole entity.
- Parameters:
entity (str) – The name of the entity to count failed validation results for.
result_type (str) – The type of validation result to count. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.
print_results (bool) – Flag to print the results. Default is False.
- Returns:
The number of failed validation results for the specified entity.
- Return type:
int
- count_results_by_index(entity: str, index_key: int, result_type: str = 'FAIL', print_results: bool = False)¶
Counts the number of validation results based on a specified entity and index_key. For example the entity ‘sample’ will have an error in row 1 / index 1, which contains 5 validation errors due to errors in 5 columns for that row. So the method will return 5 validation errors.
- Parameters:
entity (str) – The name of the entity to count validation results for.
index_key (int) – The key/index to count validation results for.
result_type (str) – The type of validation result to count. Either [“PASS”, “FAIL”, “ALL”]. Default is “FAIL”.
print_results (bool) – Flag to print the results. Default is False.
- Returns:
The number of validation results for the specified key/index.
- Return type:
int
- n_errors_per_entry(entity: str, index_key: int) int ¶
Returns the number of validation errors for a given entity and index.
- Parameters:
entity (str) – The name of the entity to check for validation errors.
index_key (int) – The index of the row to check for validation errors.
- Returns:
The number of validation errors for the given entity and index.
- Return type:
int
- n_rows_with_errors(entity: str) int ¶
Returns the number of rows that have validation errors for a given entity.
- Parameters:
entity (str) – The name of the entity to check for validation errors.
- Returns:
The number of rows with validation errors.
- Return type:
int
- summary_stats() DataFrame ¶
Generates and prints a summary of validation statistics.
This method calculates the total number of validation errors across all entities and provides detailed statistics for each entity, including the number of rows with errors and the total number of errors per entity. The results are printed to the console and returned as a pandas DataFrame.
- Returns:
A DataFrame containing the summary statistics with columns ‘entity’, ‘number_of_rows_with_errors’, and ‘number_of_errors_per_entity’.
- Return type:
pandas.DataFrame
- total_validation_errors() int ¶
Calculates the total number of validation errors across all entities.
- Returns:
The total number of validation errors.
- Return type:
int
- class gen3_validator.validate.ValidateSummary(validate_instance: Validate)¶
Bases:
Validate
- collapse_flatten_results_to_pd() DataFrame ¶
Collapses the flattened validation results into a summarized pandas DataFrame.
This method groups the flattened validation results by ‘validation_error’ and aggregates other columns to provide a summary of the validation errors, including the count of occurrences for each error type.
- Returns:
A DataFrame containing the collapsed summary of validation errors, sorted by entity, validation error, and count.
- Return type:
pandas.DataFrame
- flatten_validation_results(result_type: str = 'FAIL') dict ¶
Flattens the validation results created when initializing the Validate class.
This method extracts all the validation results for each entity, each index row, and each entry in the index row. It effectively pulls all the entries for a particular entity, row, and column, where one row can produce validation errors in multiple columns.
- Parameters:
result_type (str) – The type of validation result to filter by, default is “FAIL”.
- Returns:
A dictionary containing flattened validation results with a unique GUID for each entry, along with the entity and other relevant validation details.
- Return type:
dict
- flattened_results_to_pd() DataFrame ¶
Transforms the flattened validation results into a pandas DataFrame.
This function retrieves the flattened validation results stored in the instance and converts them into a pandas DataFrame. The DataFrame is then sorted by ‘entity’ and ‘row’ for organized analysis or processing.
- Returns:
A DataFrame containing the sorted and indexed flattened validation results.
- Return type:
pandas.DataFrame