olaf.pipeline.pipeline_component.term_extraction package¶

Submodules¶

olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction module¶

class olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction.CvalueTermExtraction(candidate_term_threshold: float | None = 0.0, max_term_token_length: int | None = 5, token_sequences_doc_attribute: str | None = None, c_value_threshold: float | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, stop_token_list: Set[str] | None = None)[source]¶

Bases: TermExtractionPipelineComponent

Extract candidate terms using C-value scores computed based on the corpus.

Attributes¶

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
_token_sequences_doc_attributestr, optional: The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.
_candidate_term_thresholdfloat, optional: The c-value score threshold below which terms will be ignored.
_c_value_thresholdfloat, optional: The threshold used during the c-value scores computation process, by defaut 0.0.
_max_term_token_lengthint, optional: The maximum number of tokens a term can have, by defaut 5.

check_resources() → None[source]¶

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report.: If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns¶

Dict[str, Any]: The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) → None[source]¶: A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) → None[source]¶

Method that is responsible for the execution of the component.

Parameters¶

pipeline: Pipeline: The pipeline to run the component with.

olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction module¶

class olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction.LLMTermExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]¶

Bases: TermExtractionPipelineComponent

Extract candidate terms using LLM based on the corpus.

Attributes¶

prompt_template: Callable[[str], List[Dict[str, str]]]: Prompt template used to give instructions and context to the LLM.
llm_generator: LLMGenerator: The LLM model used to generate the candidate terms.
cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline. Default to None.

check_resources() → None[source]¶: Method to check that the component has access to all its required resources.

get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns¶

Dict[str, Any]: The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) → None[source]¶: A method to optimise the pipeline component by tuning the configuration.

run(pipeline: Pipeline) → None[source]¶

Method that is responsible for the execution of the component.

Parameters¶

pipeline: Pipeline: The pipeline to run the component with.

olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms module¶

class olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms.ManualCandidateTermExtraction(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, ct_label_strings_map: Dict[str, Set[str]] | None = None, phrase_matcher: PhraseMatcher | None = None)[source]¶

Bases: TermExtractionPipelineComponent

A pipeline component to manually add candidate terms.

Attributes¶

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
ct_label_strings_map: Dict[str, Set[str]], optional: The mapping of candidate term label and their matching strings. Optional only if a custom spaCy phrase matcher is provided.
phrase_matcher: PhraseMatcher, optional: The spaCy phrase matcher for new candidate term corpus occurrence matching. Default to matching the label provided strings.

check_resources() → None[source]¶: Method to check that the component has access to all its required resources.

get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report.: If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.

Returns¶

Dict[str, Any]: The pipeline component performance report.

optimise() → None[source]¶: A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) → None[source]¶

Execution of the candidate term extraction based manually provided strings.

Parameters¶

pipelinePipeline: The pipeline running.

olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction module¶

class olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction.POSTermExtraction(span_processing: Callable[[Span], str] | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, pos_selection: List[str] | None = ['NOUN'], token_sequences_doc_attribute: str | None = None)[source]¶

Bases: TermExtractionPipelineComponent

Extract candidate terms with part-of-speech (POS) tagging.

Attributes¶

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
span_processing: Callable[[spacy.tokens.Span],str], optional: A function to process span, by default None.
_pos_selection: List[str]; optional: List of POS tags to select in the corpus, by default [“NOUN”].
_token_sequences_doc_attribute: str, optional: Attribute indicating which sequences to use for processing. If None, the entire doc is used.

check_resources() → None[source]¶

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report.: If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns¶

Dict[str, Any]: The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) → None[source]¶: A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) → None[source]¶

Execution of the POS term extraction on the corpus. Pipeline candidate terms are updated.

Parameters¶

pipelinePipeline: The pipeline running.

olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema module¶

class olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema.TermExtractionPipelineComponent(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]¶

Bases: PipelineComponent

A pipeline component schema for term extraction tasks.

Attributes¶

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.

apply_post_processing(candidate_terms: Set[CandidateTerm]) → Set[CandidateTerm][source]¶

Apply candidate terms post processing functions.

Parameters¶

candidate_termsSet[CandidateTerm]: The set of candidate terms to post process.

Returns¶

Set[CandidateTerm]: The post processed set of candidate terms.

abstract check_resources() → None[source]¶: Method to check that the component has access to all its required resources.

abstract get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report.: If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns¶

Dict[str, Any]: The pipeline component performance report.

abstract optimise() → None[source]¶: A method to optimise the pipeline component by tuning the options.

abstract run(pipeline: Pipeline) → None[source]¶

Method that is responsible for the execution of the component.

Parameters¶

pipelinePipeline: The pipeline running

olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction module¶

class olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction.TFIDFTermExtraction(token_sequence_preprocessing: Callable[[Span], Tuple[str]] | None = None, token_sequences_doc_attribute: str | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, max_term_token_length: int | None = None, tfidf_agg_type: str | None = 'MEAN', candidate_term_threshold: float | None = None, tfidf_vectorizer: TfidfVectorizer | None = None)[source]¶

Bases: TermExtractionPipelineComponent

Extract candidate terms using TF-IDF based scores computed on the corpus.

Attributes¶

cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional: A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
token_sequence_preprocessingCallable[[spacy.tokens.span.Span],Tuple[str]], optional: By default None.
_token_sequences_doc_attributestr: The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.
_max_term_token_lengthint: The maximum number of tokens a term can have, by default 1.
tfidf_agg_typeUnion[“MEAN”, “MAX”]: The operation to use to aggregate TF-IDF values of candidate terms. can be “MEAN” to aggregate by mean values or “MAX” to aggregate by max values, by default “MEAN”.
candidate_term_thresholdfloat: The TF-IDF score threshold below which terms will be ignored, by default 0.0.
_ngram_rangeTuple[int, int]: The ngram range for the TF-IDF vectorizer.
_custom_tokenizerCallable[[str], List[str]]: Tokenizer for the TF-IDF vectorizer.
tfidf_vectorizersklearn.feature_extraction.text.TfidfVectorizer, optional: The TF-IDF vectorizer to compute TF-IDF scores.

check_resources() → None[source]¶

Method to check that the component has access to all its required resources.

This pipeline component does not need any access to any external resource.

get_performance_report() → Dict[str, Any][source]¶

A getter for the pipeline component performance report.: If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.

Returns¶

Dict[str, Any]: The pipeline component performance report.

optimise(validation_terms: Set[str], option_values_map: Set[float]) → None[source]¶: A method to optimise the pipeline component by tuning the options.

run(pipeline: Pipeline) → None[source]¶

Method that is responsible for the execution of the component.

Parameters¶

pipeline: Pipeline: The pipeline to run the component with.

olaf.pipeline.pipeline_component.term_extraction package¶

Submodules¶

olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction module¶

Attributes¶

Returns¶

Parameters¶

olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction module¶

Attributes¶

Returns¶

Parameters¶

olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms module¶

Attributes¶

Returns¶

Parameters¶

olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction module¶

Attributes¶

Returns¶

Parameters¶

olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema module¶

Attributes¶

Parameters¶

Returns¶

Returns¶

Parameters¶

olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction module¶

Attributes¶

Returns¶

Parameters¶

Module contents¶