olaf.pipeline.data_preprocessing package

Submodules

olaf.pipeline.data_preprocessing.data_preprocessing_schema module

class olaf.pipeline.data_preprocessing.data_preprocessing_schema.DataPreprocessing[source]

Bases: ABC

Component specific to the data preprocessing. The sequence of data preprocessing tasks should result in a corpus object, i.e., a List[spacy.tokens.doc.Doc].

abstract run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component.

Parameters

pipelinePipeline

The running pipeline. Type Any is used instead of Pipeline to avoid circular import.

olaf.pipeline.data_preprocessing.token_selector_data_preprocessing module

class olaf.pipeline.data_preprocessing.token_selector_data_preprocessing.TokenSelectorDataPreprocessing(selector: Callable[[Token], bool], token_sequence_doc_attribute: str | None = None)[source]

Bases: DataPreprocessing

Preprocess data with token selector method.

Attributes

corpus: spacy.tokens.Doc

spaCy corpus to process.

token_selector: Callable[[spacy.tokens.Token], bool]

Callable function that implements the token selection criterion.

token_sequence_doc_attribute: str, Optional

Name of the spaCy doc attribute containing the selected tokens, by default “selected_tokens”.

run(pipeline: Pipeline) None[source]

Method that is responsible for the execution of the component to preprocess all corpus documents based on a token selector.

Parameters

pipelinePipeline

The pipeline running.

Module contents