olaf.pipeline.pipeline_component.term_extraction package¶
Submodules¶
olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction module¶
- class olaf.pipeline.pipeline_component.term_extraction.c_value_term_extraction.CvalueTermExtraction(candidate_term_threshold: float | None = 0.0, max_term_token_length: int | None = 5, token_sequences_doc_attribute: str | None = None, c_value_threshold: float | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, stop_token_list: Set[str] | None = None)[source]¶
Bases:
TermExtractionPipelineComponent
Extract candidate terms using C-value scores computed based on the corpus.
Attributes¶
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
- _token_sequences_doc_attributestr, optional
The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.
- _candidate_term_thresholdfloat, optional
The c-value score threshold below which terms will be ignored.
- _c_value_thresholdfloat, optional
The threshold used during the c-value scores computation process, by defaut 0.0.
- _max_term_token_lengthint, optional
The maximum number of tokens a term can have, by defaut 5.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
This pipeline component does not need any access to any external resource.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction module¶
- class olaf.pipeline.pipeline_component.term_extraction.llm_term_extraction.LLMTermExtraction(prompt_template: Callable[[str], List[Dict[str, str]]] | None = None, llm_generator: LLMGenerator | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]¶
Bases:
TermExtractionPipelineComponent
Extract candidate terms using LLM based on the corpus.
Attributes¶
- prompt_template: Callable[[str], List[Dict[str, str]]]
Prompt template used to give instructions and context to the LLM.
- llm_generator: LLMGenerator
The LLM model used to generate the candidate terms.
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline. Default to None.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
A getter for the pipeline component performance report. If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms module¶
- class olaf.pipeline.pipeline_component.term_extraction.manual_candidate_terms.ManualCandidateTermExtraction(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, ct_label_strings_map: Dict[str, Set[str]] | None = None, phrase_matcher: PhraseMatcher | None = None)[source]¶
Bases:
TermExtractionPipelineComponent
A pipeline component to manually add candidate terms.
Attributes¶
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
- ct_label_strings_map: Dict[str, Set[str]], optional
The mapping of candidate term label and their matching strings. Optional only if a custom spaCy phrase matcher is provided.
- phrase_matcher: PhraseMatcher, optional
The spaCy phrase matcher for new candidate term corpus occurrence matching. Default to matching the label provided strings.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the parameters set.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction module¶
- class olaf.pipeline.pipeline_component.term_extraction.pos_term_extraction.POSTermExtraction(span_processing: Callable[[Span], str] | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, pos_selection: List[str] | None = ['NOUN'], token_sequences_doc_attribute: str | None = None)[source]¶
Bases:
TermExtractionPipelineComponent
Extract candidate terms with part-of-speech (POS) tagging.
Attributes¶
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
- span_processing: Callable[[spacy.tokens.Span],str], optional
A function to process span, by default None.
- _pos_selection: List[str]; optional
List of POS tags to select in the corpus, by default [“NOUN”].
- _token_sequences_doc_attribute: str, optional
Attribute indicating which sequences to use for processing. If None, the entire doc is used.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
This pipeline component does not need any access to any external resource.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema module¶
- class olaf.pipeline.pipeline_component.term_extraction.term_extraction_schema.TermExtractionPipelineComponent(cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None)[source]¶
Bases:
PipelineComponent
A pipeline component schema for term extraction tasks.
Attributes¶
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
- apply_post_processing(candidate_terms: Set[CandidateTerm]) Set[CandidateTerm] [source]¶
Apply candidate terms post processing functions.
Parameters¶
- candidate_termsSet[CandidateTerm]
The set of candidate terms to post process.
Returns¶
- Set[CandidateTerm]
The post processed set of candidate terms.
- abstract check_resources() None [source]¶
Method to check that the component has access to all its required resources.
- abstract get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.
olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction module¶
- class olaf.pipeline.pipeline_component.term_extraction.tfidf_term_extraction.TFIDFTermExtraction(token_sequence_preprocessing: Callable[[Span], Tuple[str]] | None = None, token_sequences_doc_attribute: str | None = None, cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]] | None = None, max_term_token_length: int | None = None, tfidf_agg_type: str | None = 'MEAN', candidate_term_threshold: float | None = None, tfidf_vectorizer: TfidfVectorizer | None = None)[source]¶
Bases:
TermExtractionPipelineComponent
Extract candidate terms using TF-IDF based scores computed on the corpus.
Attributes¶
- cts_post_processing_functions: List[Callable[[Set[CandidateTerm]], Set[CandidateTerm]]], optional
A list of candidate term post processing functions to run after candidate term extraction and before assigning the extracted candidate terms to the pipeline, by default None.
- token_sequence_preprocessingCallable[[spacy.tokens.span.Span],Tuple[str]], optional
By default None.
- _token_sequences_doc_attributestr
The name of the spaCy doc custom attribute containing the sequences of tokens to form the corpus for the c-value computation. Default is None which default to the full doc.
- _max_term_token_lengthint
The maximum number of tokens a term can have, by default 1.
- tfidf_agg_typeUnion[“MEAN”, “MAX”]
The operation to use to aggregate TF-IDF values of candidate terms. can be “MEAN” to aggregate by mean values or “MAX” to aggregate by max values, by default “MEAN”.
- candidate_term_thresholdfloat
The TF-IDF score threshold below which terms will be ignored, by default 0.0.
- _ngram_rangeTuple[int, int]
The ngram range for the TF-IDF vectorizer.
- _custom_tokenizerCallable[[str], List[str]]
Tokenizer for the TF-IDF vectorizer.
- tfidf_vectorizersklearn.feature_extraction.text.TfidfVectorizer, optional
The TF-IDF vectorizer to compute TF-IDF scores.
- check_resources() None [source]¶
Method to check that the component has access to all its required resources.
This pipeline component does not need any access to any external resource.
- get_performance_report() Dict[str, Any] [source]¶
- A getter for the pipeline component performance report.
If the component has been optimised, it only returns the best performance. Otherwise, it returns the results obtained with the set parameters.
Returns¶
- Dict[str, Any]
The pipeline component performance report.