`Lbl2Vec`

class lbl2vec.lbl2vec.Lbl2Vec(keywords_list: List[List[str]], tagged_documents: Optional[List[TaggedDocument]] = None, label_names: Optional[List[str]] = None, epochs: int = 10, vector_size: int = 300, min_count: int = 50, window: int = 15, sample: float = 1e-05, negative: int = 5, workers: int = -1, doc2vec_model: Optional[Doc2Vec] = None, num_docs: Optional[int] = None, similarity_threshold: Optional[float] = None, similarity_threshold_offset: float = 0, min_num_docs: int = 1, clean_outliers: bool = False, verbose: bool = True)

Creates jointly embedded label, document and word vectors. Once the model is trained it contains document and label vectors.

Parameters

keywords_list (iterable list of lists with descriptive keywords of type str.) – For each label at least one descriptive keyword has to be added as list of str.
tagged_documents (iterable list of gensim.models.doc2vec.TaggedDocument elements, optional) – If you wish to train word and document vectors from scratch this parameter can not be None, whereas the doc2vec_model parameter must be None. If you use a pretrained Doc2Vec model to load its learned word and document vectors this parameter has to be None. Input corpus, can be simply a list of elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network.
label_names (iterable list of str, optional) – Custom names can be defined for each label. Parameter values of label names and keywords of the same topic must have the same index. Default is to use generic label names.
epochs (int, optional) – Number of iterations (epochs) over the corpus.
vector_size (int, optional) – Dimensionality of the feature vectors.
min_count (int, optional) – Ignores all words with total frequency lower than this.
window (int, optional) – The maximum distance between the current and predicted word within a sentence.
sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
workers (int, optional) – The amount of worker threads to be used in training the model. Larger amount will lead to faster training. If set to -1, use all available worker threads of the machine.
doc2vec_model (gensim.models.doc2vec.Doc2Vec, optional) – If given a pretrained Doc2Vec model, Lbl2Vec uses its word and document vectors to compute the label vectors. If this parameter is defined, tagged_documents has to be None. In order to get optimal Lbl2Vec results the given Doc2Vec model should be trained with the parameters “dbow_words=1” and “dm=0”.
num_docs (int, optional) – Maximum number of documents to calculate label embedding from. Default is all available documents.
similarity_threshold (float, default=None) – Only documents with a higher similarity to the respective description keywords than this threshold are used to calculate the label embeddings.
similarity_threshold_offset (float, default=0) – Sets similarity threshold to n-similarity_threshold_offset with n = (smiliarity of keyphrase_vector to most similar document_vector).
min_num_docs (int, optional) – Minimum number of documents that are used to calculate the label embedding. Adds documents until requirement is fulfilled if simiarilty threshold is choosen too restrictive. This value should be chosen to be at least 1 in order to be able to calculate the label embedding. If this value is < 1 it can happen that no document is selected for label embedding calculation and therefore no label embedding is generated.
clean_outliers (boolean, optional) – Whether to clean outlier candidate documents for label embedding calculation. Setting to False can shorten the training time. However, a loss of accuracy in the calculation of the label vectors may be possible.
verbose (boolean, optional) – Whether to print status during training and prediction.

add_lbl_thresholds(lbl_similarities_df: DataFrame, lbl_thresholds: List[Tuple[str, float]]) → DataFrame

Adds threshold column with the threshold value of the most similar classification label.

Parameters

lbl_similarities_df (pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels.) – This pandas.DataFrame type is returned by the predict_model_docs() and predict_new_docs() functions.
lbl_thresholds (list of tuples) – First tuple element consists of the label name and the second tuple element of the threshold value.

Returns

lbl_similarities_df

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label, fourth column of the label threshold values and the following columns with the respective labels and the similarity scores of the documents to the labels.

fit(): Trains the Lbl2Vec model which creates jointly embedded label, document and word vectors.

classmethod load(filepath: str) → object

Loads the Lbl2Vec model from disk.

Parameters: filepath (str) – Path of file.
Returns: lbl2vec_model
Return type: Lbl2Vec model loaded from disk.

predict_model_docs(doc_keys: Optional[Union[List[int], List[str]]] = None, multiprocessing: bool = False) → DataFrame

Computes similarity scores of documents that are used to train the Lbl2Vec model to each label.

Parameters

doc_keys (list of document keys, optional) – If None: return the similarity scores for all documents that are used to train the Lbl2Vec model. Else: only return the similarity scores of training documents with the given keys.
multiprocessing (boolean, optional) – Whether to use the ray multiprocessing library during prediction. If True, ray uses all available workers for prediction. If False, just use a single core for prediction.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

predict_new_docs(tagged_docs: List[TaggedDocument], multiprocessing: bool = False) → DataFrame

Computes similarity scores of given new documents that are not used to train the Lbl2Vec model to each label.

Parameters

tagged_docs (iterable list of gensim.models.doc2vec.TaggedDocument elements) – New documents that are not used to train the model.
multiprocessing (boolean, optional) – Whether to use the ray multiprocessing library during prediction. If True, ray uses all available workers for prediction. If False, just use a single core for prediction.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

save(filepath: str)

Saves the Lbl2Vec model to disk.

Parameters: filepath (str) – Path of file.

`Lbl2TransformerVec`

class lbl2vec.lbl2transformervec.Lbl2TransformerVec(keywords_list: ~typing.List[~typing.List[str]], documents: ~typing.List[str], transformer_model: ~typing.Union[~sentence_transformers.SentenceTransformer.SentenceTransformer, ~transformers.models.auto.modeling_auto.AutoModel] = SentenceTransformer( (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) (2): Normalize() ), label_names: ~typing.Optional[~typing.List[str]] = None, similarity_threshold: ~typing.Optional[float] = None, similarity_threshold_offset: float = 0, min_num_docs: int = 1, max_num_docs: ~typing.Optional[int] = None, clean_outliers: bool = False, workers: int = -1, device: ~torch.device = device(type='cpu'), verbose: bool = True)

Creates jointly embedded label and document vectors with transformer language models. Once the model is trained it contains document and label vectors.

Parameters

keywords_list (iterable list of lists with descriptive keywords of type str) – For each label at least one descriptive keyword has to be added as list of str.
documents (iterable list of strings) – Iterable list of text documents
transformer_model (Union[SentenceTransformer, transformers.AutoModel], default=SentenceTransformer(‘all-MiniLM-L6-v2’)) – Transformer model used to embed the labels, documents and keywords.
label_names (iterable list of str, default=None) – Custom names can be defined for each label. Parameter values of label names and keywords of the same topic must have the same index. Default is to use generic label names.
similarity_threshold (float, default=None) – Only documents with a higher similarity to the respective description keywords than this threshold are used to calculate the label embeddings.
similarity_threshold_offset (float, default=0) – Sets similarity threshold to n-similarity_threshold_offset with n = (smiliarity of keyphrase_vector to most similar document_vector).
min_num_docs (int, default=1) – Minimum number of documents that are used to calculate the label embedding. Adds documents until requirement is fulfilled if simiarilty threshold is choosen too restrictive. This value should be chosen to be at least 1 in order to be able to calculate the label embedding. If this value is < 1 it can happen that no document is selected for label embedding calculation and therefore no label embedding is generated.
max_num_docs (int, default=None) – Maximum number of documents to calculate label embedding from. Default is all available documents.
clean_outliers (boolean, default=False) – Whether to clean outlier candidate documents for label embedding calculation. Setting to False can shorten the training time. However, a loss of accuracy in the calculation of the label vectors may be possible.
workers (int, default=-1) – Use these many worker threads to train the model (=faster training with multicore machines). Setting this parameter to -1 uses all available cpu cores. If using GPU, this parameter is ignored.
device (torch.device, default=torch.device('cpu')) – Specify the device that should be used for training the model. Default is to use the CPU device. To use CPU, set device to ‘torch.device(‘cpu’)’. To use GPU, you can e.g. specify ‘torch.device(‘cuda:0’)’.
verbose (boolean, default=True) – Whether to print status during training and prediction.

add_lbl_thresholds(lbl_similarities_df: DataFrame, lbl_thresholds: List[Tuple[str, float]]) → DataFrame

Adds threshold column with the threshold value of the most similar classification label.

Parameters

lbl_similarities_df (pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels.) – This pandas.DataFrame type is returned by the predict_model_docs() and predict_new_docs() functions.
lbl_thresholds (list of tuples) – First tuple element consists of the label name and the second tuple element of the threshold value.

Returns

lbl_similarities_df

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label, fourth column of the label threshold values and the following columns with the respective labels and the similarity scores of the documents to the labels.

fit(): Trains the Lbl2TransformerVec model which creates jointly embedded label and document vectors.

classmethod load(filepath: str) → object

Loads the Lbl2Vec model from disk.

Parameters: filepath (str) – Path of file.
Returns: lbl2vec_model
Return type: Lbl2Vec model loaded from disk.

predict_model_docs(doc_idxs: Optional[List[int]] = None) → DataFrame

Computes similarity scores of documents that are used to train the Lbl2TransformerVec model to each label.

Parameters: doc_idxs (list of document indices, default=None) – If None: return the similarity scores for all documents that are used to train the Lbl2TransformerVec model. Else: only return the similarity scores of training documents with the given indices.
Returns: labeled_docs
Return type: pandas.DataFrame with first column of document texts, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

predict_new_docs(documents: List[str], workers: int = -1, device: device = device(type='cpu')) → DataFrame

Computes similarity scores of given new documents that are not used to train the Lbl2TransformerVec model to each label.

Parameters

documents (iterable list of strings) – New documents that are not used to train the model.
workers (int, default=-1) – Use these many worker threads to train the model (=faster training with multicore machines). Setting this parameter to -1 uses all available cpu cores. If using GPU, this parameter is ignored.
device (torch.device, default=torch.device('cpu')) – Specify the device that should be used for training the model. Default is to use the CPU device. To use CPU, set device to ‘torch.device(‘cpu’)’. To use GPU, you can e.g. specify ‘torch.device(‘cuda:0’)’.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document texts, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

save(filepath: str)

Saves the Lbl2Vec model to disk.

Parameters: filepath (str) – Path of file.

Lbl2Vec

Lbl2TransformerVec

`Lbl2Vec`

`Lbl2TransformerVec`