olaf.pipeline.pipeline_component.concept_relation_extraction package¶
Submodules¶
olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_concept_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_concept_extraction.AgglomerativeClusteringConceptExtraction(nb_clusters: int | None = None, metric: str | None = None, linkage: str | None = 'average', distance_threshold: float | None = None, embedding_model: str | None = None)[source]¶
Bases:
PipelineComponent
Extract concept based candidate terms with agglomerative clustering.
Attributes¶
- candidate_terms: List[CandidateTerm]
List of candidate terms to extract concepts from.
- nb_clusters: int, optional
Number of clusters to find with the agglomerative clustering algorithm. It must be None if distance_threshold is not None, by default 2.
- metric: str, optional
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”, by default “cosine”.
- linkage: str, optional
Distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Can be “ward”, “complete”, “average”, “single”, by default “average”.
- distance_threshold: float, optional
The linkage distance threshold at or above which clusters will not be merged. If not None, n_clusters must be None, by default None.
- embedding_model: str, optional
Name of the embedding model to use. The list of available models can be found here : https://www.sbert.net/docs/pretrained_models.html, by default None.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_relation_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.agglomerative_clustering_relation_extraction.AgglomerativeClusteringRelationExtraction(nb_clusters: int | None = None, metric: str | None = None, linkage: str | None = 'average', distance_threshold: float | None = None, embedding_model: str | None = None, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]¶
Bases:
PipelineComponent
Extract relation based on candidate terms with agglomerative clustering.
Attributes¶
- candidate_relations: List[CandidateRelations], optional
List of candidate relations to extract relations from, by default None.
- nb_clusters: int, optional
Number of clusters to find with the agglomerative clustering algorithm. It must be None if distance_threshold is not None, by default 2.
- metric: str, optional
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”, by default cosine.
- linkage: str, optional
Distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion. Can be “ward”, “complete”, “average”, “single”, by default “average”.
- distance_threshold: float, optional
The linkage distance threshold at or above which clusters will not be merged. If not None, n_clusters must be None, by default None.
- embedding_model: str, optional
Name of the embedding model to use. The list of available models can be found here : https://www.sbert.net/docs/pretrained_models.html, by default None.
- concept_max_distance: int, optional
The maximum distance between the candidate term and the concept sought, by defautl 5.
- scope: str, optional
Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_concepts module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_concepts.CTsToConceptExtraction[source]¶
Bases:
PipelineComponent
A pipeline component to create concepts directly from the candidate terms.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_relations module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.candidate_terms_to_relations.CTsToRelationExtraction(concept_max_distance: int | None = None, scope: str | None = 'doc')[source]¶
Bases:
PipelineComponent
A pipeline component to create relations directly from the candidate terms.
Attributes¶
- concept_max_distance: int, optional
The maximum distance between the candidate term and the concept sought, by default 5.
- scope: str, optional
Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
- run(pipeline: Pipeline) None [source]¶
Execution of the relation extraction directly from existing candidate terms. Candidate terms are first converted into candidate relations. Then the candidate relations are converted into relations. The pipeline candidate terms are consumed.
Parameters¶
- pipelinePipeline
The pipeline running.
olaf.pipeline.pipeline_component.concept_relation_extraction.concept_cooc_metarelation_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.concept_cooc_metarelation_extraction.ConceptCoocMetarelationExtraction(custom_metarelation_creation_metric: Callable[[int], bool] | None = None, window_size: int | None = None, threshold: int | None = None, scope: str | None = 'doc', metarelation_label: str | None = 'RELATED_TO', create_symmetric_metarelation: bool | None = False)[source]¶
Bases:
PipelineComponent
A pipeline component to extract metarelations based on concept co-occurrence.
Attributes¶
- metarelation_creation_metric: Callable[[int], bool], optional
The function to define based on the concept co-occurrence count whether or not to create a metarelation, by default co-occurrence count > self.threshold.
- window_size: int, optional
The token window size to consider for concept co-occurrence. Minimum is 2, by default None.
- threshold: int, optional
The co-occurrence minimum count threshold for metarelation construction, by default 0.
- scope: str, optional
The corpus scope to consider. Either ‘doc’ or ‘sent’, by default ‘doc’.
- metarelation_label: str, optional
The metarelation label to use, by default ‘RELATED_TO’.
- create_symmetric_metarelation: bool, optional
Whether to create the symmetric metarelation, by default False. WARNING! this option can create a lot of metarelation that can easily be created in a later process.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_concept_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_concept_extraction.KnowledgeBasedConceptExtraction(knowledge_source: KnowledgeSource, group_ct_on_synonyms: bool | None = True)[source]¶
Bases:
PipelineComponent
Pipeline component to extract concepts based on an external source of knowledge, e.g., a KG.
Attributes¶
- knowledge_sourceKnowledgeSource
The source of knowledge to use for concept matching.
- group_ct_on_synonyms: bool, optional
Wether or not to group the candidate terms on synonyms before proceeding to the concept matching with the external source of knowledge, by default True.
- c_terms_texts_to_match(ct_group: Set[CandidateTerm]) Set[str] [source]¶
Extract from a set of candidate terms the strings to use for concept matching.
Parameters¶
- ct_groupSet[CandidateTerm]
The set of candidate terms.
Returns¶
- Set[str]
The set of strings to use for concept matching.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_relation_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.knowledge_based_relation_extraction.KnowledgeBasedRelationExtraction(knowledge_source: KnowledgeSource, group_ct_on_synonyms: bool | None = True, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]¶
Bases:
PipelineComponent
Pipeline component to extract relations based on an external source of knowledge, e.g., a KG. Candidate terms are converted into candidate relations. Then, candidate relations are validated as relations if their labels match the external source of knowledge.
Attributes¶
- knowledge_sourceKnowledgeSource
The source of knowledge to use for relation matching.
- group_ct_on_synonyms: bool, optional
Whether or not to group the candidate terms on synonyms before proceeding to the relation matching with the external source of knowledge, by default True.
- concept_max_distance: int, optional
The maximum distance between the candidate term and the concept sought, by default 5.
- scope: str
Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.
- c_terms_texts_to_match(cr_group: Set[CandidateRelation]) Set[str] [source]¶
Extract from a set of candidate relations the strings to use for concept matching.
Parameters¶
- cr_groupSet[CandidateRelation]
The set of candidate relations.
Returns¶
- Set[str]
The set of strings to use for relation matching.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_concept_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_concept_extraction.LLMBasedConceptExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, doc_context_max_len: int | None = 4000)[source]¶
Bases:
PipelineComponent
LLM based concept extraction.
Attributes¶
- prompt_template: Callable[[str], List[Dict[str, str]]]
Prompt template used to give instructions and context to the LLM.
- llm_generator: LLMGenerator
The LLM model used to generate the concepts.
- doc_context_max_len: int
Maximum number of characters for the document context in the prompt.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_relation_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.llm_based_relation_extraction.LLMBasedRelationExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, doc_context_max_len: int | None = 4000, concept_max_distance: int | None = None, scope: str | None = 'doc')[source]¶
Bases:
PipelineComponent
LLM based relation extraction.
Attributes¶
- prompt_template: Callable[[str], List[Dict[str, str]]], optional
Prompt template used to give instructions and context to the LLM, by default None.
- llm_generator: LLMGenerator, optional
The LLM model used to generate the relation, by default None.
- doc_context_max_len: int, optional
Maximum number of characters for the document context in the prompt, by default 4000.
- concept_max_distance: int, optional
The maximum distance between the candidate term and the concept sought, by default 5.
- scope: str, optional
Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_concept_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_concept_extraction.SynonymConceptExtraction[source]¶
Bases:
PipelineComponent
Extract concepts based on synonyms grouping.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_relation_extraction module¶
- class olaf.pipeline.pipeline_component.concept_relation_extraction.synonym_relation_extraction.SynonymRelationExtraction(concept_max_distance: int | None = None, scope: str | None = 'doc')[source]¶
Bases:
PipelineComponent
Extract relations based on synonyms grouping.
Attributes¶
- concept_max_distance: int, optional
The maximum distance between the candidate term and the concept sought, by default 5.
- scope: str
Scope used to search concepts. Can be “doc” for the entire document or “sent” for the candidate term “sentence”, by default “doc”.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
- run(pipeline: Pipeline) None [source]¶
Execution of the synonyms grouping for relation extraction on candidate terms. Candidate terms are converted into candidate relations. Candidate relations with same synonyms, source and destination concepts are grouped together as a new relation. Candidate terms are purged.
Parameters¶
- pipelinePipeline
The pipeline running.