olaf.pipeline.pipeline_component.concept_relation_extraction package

Submodules

olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_concept_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_concept_extraction.AgglomerativeClusteringConceptExtraction(nb_clusters: int | None = None, metric: str | None = None, linkage: str | None = 'average', distance_threshold: float | None = None, embedding_model: str | None = None)[source]

Bases: PipelineComponent

Extract concept based candidate terms with agglomerative clustering.

Attributes

candidate_terms: List[CandidateTerm]

List of candidate terms to extract concepts from.

nb_clusters: int, optional

Number of clusters to find with the agglomerative clustering algorithm. It must be None if distance_threshold is not None, by default 2.

metric: str, optional

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”, by default “cosine”.

linkage: str, optional

Distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Can be “ward”, “complete”, “average”, “single”, by default “average”.

distance_threshold: float, optional

The linkage distance threshold at or above which clusters will not be merged. If not None, n_clusters must be None, by default None.

embedding_model: str, optional

Name of the embedding model to use. The list of available models can be found here : https://www.sbert.net/docs/pretrained_models.html, by default None.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the agglomerative clustering algorithm on candidate terms embedded. Concepts creation and candidate terms purge.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_relation_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_relation_extraction.AgglomerativeClusteringRelationExtraction(nb_clusters: int | None = None, metric: str | None = None, linkage: str | None = 'average', distance_threshold: float | None = None, embedding_model: str | None = None, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]

Bases: PipelineComponent

Extract relation based on candidate terms with agglomerative clustering.

Attributes

candidate_relations: List[CandidateRelations], optional

List of candidate relations to extract relations from, by default None.

nb_clusters: int, optional

Number of clusters to find with the agglomerative clustering algorithm. It must be None if distance_threshold is not None, by default 2.

metric: str, optional

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”, by default cosine.

linkage: str, optional

Distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Can be “ward”, “complete”, “average”, “single”, by default “average”.

distance_threshold: float, optional

The linkage distance threshold at or above which clusters will not be merged. If not None, n_clusters must be None, by default None.

embedding_model: str, optional

Name of the embedding model to use. The list of available models can be found here : https://www.sbert.net/docs/pretrained_models.html, by default None.

concept_max_distance: int, optional

The maximum distance between the candidate term and the concept sought, by defautl 5.

scope: str, optional

Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the agglomerative clustering algorithm on candidate terms embedded. Relations creation and candidate terms purge.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_concepts module

class olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_concepts.CTsToConceptExtraction[source]

Bases: PipelineComponent

A pipeline component to create concepts directly from the candidate terms.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the concept extraction directly from existing candidate terms. The pipeline candidate terms are consumed.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_relations module

class olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_relations.CTsToRelationExtraction(concept_max_distance: int | None = None, scope: str | None = 'doc')[source]

Bases: PipelineComponent

A pipeline component to create relations directly from the candidate terms.

Attributes

concept_max_distance: int, optional

The maximum distance between the candidate term and the concept sought, by default 5.

scope: str, optional

Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the relation extraction directly from existing candidate terms. Candidate terms are first converted into candidate relations. Then the candidate relations are converted into relations. The pipeline candidate terms are consumed.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.concept_cooc_metarelation_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.concept_cooc_metarelation_extraction.ConceptCoocMetarelationExtraction(custom_metarelation_creation_metric: Callable[[int], bool] | None = None, window_size: int | None = None, threshold: int | None = None, scope: str | None = 'doc', metarelation_label: str | None = 'RELATED_TO', create_symmetric_metarelation: bool | None = False)[source]

Bases: PipelineComponent

A pipeline component to extract metarelations based on concept co-occurrence.

Attributes

metarelation_creation_metric: Callable[[int], bool], optional

The function to define based on the concept co-occurrence count whether or not to create a metarelation, by default co-occurrence count > self.threshold.

window_size: int, optional

The token window size to consider for concept co-occurrence. Minimum is 2, by default None.

threshold: int, optional

The co-occurrence minimum count threshold for metarelation construction, by default 0.

scope: str, optional

The corpus scope to consider. Either ‘doc’ or ‘sent’, by default ‘doc’.

metarelation_label: str, optional

The metarelation label to use, by default ‘RELATED_TO’.

create_symmetric_metarelation: bool, optional

Whether to create the symmetric metarelation, by default False. WARNING! this option can create a lot of metarelation that can easily be created in a later process.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the metarelation extraction based on concept co-occurrence. Metarelations are created and added to the pipeline knowledge representation.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_concept_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_concept_extraction.KnowledgeBasedConceptExtraction(knowledge_source: KnowledgeSource, group_ct_on_synonyms: bool | None = True)[source]

Bases: PipelineComponent

Pipeline component to extract concepts based on an external source of knowledge, e.g., a KG.

Attributes

knowledge_sourceKnowledgeSource

The source of knowledge to use for concept matching.

group_ct_on_synonyms: bool, optional

Wether or not to group the candidate terms on synonyms before proceeding to the concept matching with the external source of knowledge, by default True.

c_terms_texts_to_match(ct_group: Set[CandidateTerm]) Set[str][source]

Extract from a set of candidate terms the strings to use for concept matching.

Parameters

ct_groupSet[CandidateTerm]

The set of candidate terms.

Returns

Set[str]

The set of strings to use for concept matching.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_relation_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_relation_extraction.KnowledgeBasedRelationExtraction(knowledge_source: KnowledgeSource, group_ct_on_synonyms: bool | None = True, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]

Bases: PipelineComponent

Pipeline component to extract relations based on an external source of knowledge, e.g., a KG. Candidate terms are converted into candidate relations. Then, candidate relations are validated as relations if their labels match the external source of knowledge.

Attributes

knowledge_sourceKnowledgeSource

The source of knowledge to use for relation matching.

group_ct_on_synonyms: bool, optional

Whether or not to group the candidate terms on synonyms before proceeding to the relation matching with the external source of knowledge, by default True.

concept_max_distance: int, optional

The maximum distance between the candidate term and the concept sought, by default 5.

scope: str

Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.

c_terms_texts_to_match(cr_group: Set[CandidateRelation]) Set[str][source]

Extract from a set of candidate relations the strings to use for concept matching.

Parameters

cr_groupSet[CandidateRelation]

The set of candidate relations.

Returns

Set[str]

The set of strings to use for relation matching.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_concept_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_concept_extraction.LLMBasedConceptExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, doc_context_max_len: int | None = 4000)[source]

Bases: PipelineComponent

LLM based concept extraction.

Attributes

prompt_template: Callable[[str], List[Dict[str, str]]]

Prompt template used to give instructions and context to the LLM.

llm_generator: LLMGenerator

The LLM model used to generate the concepts.

doc_context_max_len: int

Maximum number of characters for the document context in the prompt.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]

A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the configuration.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component. Concepts are created and candidate terms are purged.

Parameters

pipeline: Pipeline

The pipeline to run the component with.

olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_relation_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_relation_extraction.LLMBasedRelationExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, doc_context_max_len: int | None = 4000, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]

Bases: PipelineComponent

LLM based relation extraction.

Attributes

prompt_template: Callable[[str], List[Dict[str, str]]], optional

Prompt template used to give instructions and context to the LLM, by default None.

llm_generator: LLMGenerator, optional

The LLM model used to generate the relation, by default None.

doc_context_max_len: int, optional

Maximum number of characters for the document context in the prompt, by default 4000.

concept_max_distance: int, optional

The maximum distance between the candidate term and the concept sought, by default 5.

scope: str, optional

Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]

A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the configuration.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component. Relations are created and candidate terms are purged.

Parameters

pipeline: Pipeline

The pipeline to run the component with.

olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_concept_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_concept_extraction.SynonymConceptExtraction[source]

Bases: PipelineComponent

Extract concepts based on synonyms grouping.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the synonyms grouping for concept extraction on candidate terms. Concepts are created and candidate terms are purged.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_relation_extraction module

class olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_relation_extraction.SynonymRelationExtraction(concept_max_distance: int | None = None, scope: str | None = 'doc')[source]

Bases: PipelineComponent

Extract relations based on synonyms grouping.

Attributes

concept_max_distance: int, optional

The maximum distance between the candidate term and the concept sought, by default 5.

scope: str

Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the synonyms grouping for relation extraction on candidate terms. Candidate terms are converted into candidate relations. Candidate relations with same synonyms, source and destination concepts are grouped together as a new relation. Candidate terms are purged.

Parameters

pipelinePipeline

The pipeline running.

Module contents