olaf.pipeline.pipeline_component.term_extraction package

Submodules

olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction module

class olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction.CvalueTermExtraction(candidate_term_threshold: float | None = 0.0, max_term_token_length: int | None = 5, token_sequences_doc_attribute: str | None = None, c_value_threshold: float | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, stop_token_list: Set[str] | None = None)[source]

Bases: TermExtractionPipelineComponent

Extract candidate terms using C-value scores computed based on the corpus.

Attributes

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

_token_sequences_doc_attributestr, optional

The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.

_candidate_term_thresholdfloat, optional

The c-value score threshold below which terms will be ignored.

_c_value_thresholdfloat, optional

The threshold used during the c-value scores computation process, by defaut 0.0.

_max_term_token_lengthint, optional

The maximum number of tokens a term can have, by defaut 5.

check_resources() None[source]

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipeline: Pipeline

The pipeline to run the component with.

olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction module

class olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction.LLMTermExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]

Bases: TermExtractionPipelineComponent

Extract candidate terms using LLM based on the corpus.

Attributes

prompt_template: Callable[[str], List[Dict[str, str]]]

Prompt template used to give instructions and context to the LLM.

llm_generator: LLMGenerator

The LLM model used to generate the candidate terms.

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline. Default to None.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]

A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the configuration.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipeline: Pipeline

The pipeline to run the component with.

olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms module

class olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms.ManualCandidateTermExtraction(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, ct_label_strings_map: Dict[str, Set[str]] | None = None, phrase_matcher: PhraseMatcher | None = None)[source]

Bases: TermExtractionPipelineComponent

A pipeline component to manually add candidate terms.

Attributes

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

ct_label_strings_map: Dict[str, Set[str]], optional

The mapping of candidate term label and their matching strings. Optional only if a custom spaCy phrase matcher is provided.

phrase_matcher: PhraseMatcher, optional

The spaCy phrase matcher for new candidate term corpus occurrence matching. Default to matching the label provided strings.

check_resources() None[source]

Method to check that the component has access to all its required resources.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise() None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the candidate term extraction based manually provided strings.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction module

class olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction.POSTermExtraction(span_processing: Callable[[Span], str] | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, pos_selection: List[str] | None = ['NOUN'], token_sequences_doc_attribute: str | None = None)[source]

Bases: TermExtractionPipelineComponent

Extract candidate terms with part-of-speech (POS) tagging.

Attributes

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

span_processing: Callable[[spacy.tokens.Span],str], optional

A function to process span, by default None.

_pos_selection: List[str]; optional

List of POS tags to select in the corpus, by default [“NOUN”].

_token_sequences_doc_attribute: str, optional

Attribute indicating which sequences to use for processing. If None, the entire doc is used.

check_resources() None[source]

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Execution of the POS term extraction on the corpus. Pipeline candidate terms are updated.

Parameters

pipelinePipeline

The pipeline running.

olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema module

class olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema.TermExtractionPipelineComponent(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]

Bases: PipelineComponent

A pipeline component schema for term extraction tasks.

Attributes

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

apply_post_processing(candidate_terms: Set[CandidateTerm]) Set[CandidateTerm][source]

Apply candidate terms post processing functions.

Parameters

candidate_termsSet[CandidateTerm]

The set of candidate terms to post process.

Returns

Set[CandidateTerm]

The post processed set of candidate terms.

abstract check_resources() None[source]

Method to check that the component has access to all its required resources.

abstract get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

abstract optimise() None[source]

A method to optimise the pipeline component by tuning the options.

abstract run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipelinePipeline

The pipeline running

olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction module

class olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction.TFIDFTermExtraction(token_sequence_preprocessing: Callable[[Span], Tuple[str]] | None = None, token_sequences_doc_attribute: str | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, max_term_token_length: int | None = None, tfidf_agg_type: str | None = 'MEAN', candidate_term_threshold: float | None = None, tfidf_vectorizer: TfidfVectorizer | None = None)[source]

Bases: TermExtractionPipelineComponent

Extract candidate terms using TF-IDF based scores computed on the corpus.

Attributes

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional

A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

token_sequence_preprocessingCallable[[spacy.tokens.span.Span],Tuple[str]], optional

By default None.

_token_sequences_doc_attributestr

The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.

_max_term_token_lengthint

The maximum number of tokens a term can have, by default 1.

tfidf_agg_typeUnion[“MEAN”, “MAX”]

The operation to use to aggregate TF-IDF values of candidate terms. can be “MEAN” to aggregate by mean values or “MAX” to aggregate by max values, by default “MEAN”.

candidate_term_thresholdfloat

The TF-IDF score threshold below which terms will be ignored, by default 0.0.

_ngram_rangeTuple[int, int]

The ngram range for the TF-IDF vectorizer.

_custom_tokenizerCallable[[str], List[str]]

Tokenizer for the TF-IDF vectorizer.

tfidf_vectorizersklearn.feature_extraction.text.TfidfVectorizer, optional

The TF-IDF vectorizer to compute TF-IDF scores.

check_resources() None[source]

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() Dict[str, Any][source]
A getter for the pipeline component performance report.

If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns

Dict[str, Any]

The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) None[source]

A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipeline: Pipeline

The pipeline to run the component with.

Module contents