Lbl2Vec

class lbl2vec.lbl2vec.Lbl2Vec(keywords_list: List[List[str]], tagged_documents: Optional[List[TaggedDocument]] = None, label_names: Optional[List[str]] = None, epochs: int = 10, vector_size: int = 300, min_count: int = 50, window: int = 15, sample: float = 1e-05, negative: int = 5, workers: int = -1, doc2vec_model: Optional[Doc2Vec] = None, num_docs: Optional[int] = None, similarity_threshold: Optional[float] = None, similarity_threshold_offset: float = 0, min_num_docs: int = 1, clean_outliers: bool = False, verbose: bool = True)

Creates jointly embedded label, document and word vectors. Once the model is trained it contains document and label vectors.

Parameters
  • keywords_list (iterable list of lists with descriptive keywords of type str.) – For each label at least one descriptive keyword has to be added as list of str.

  • tagged_documents (iterable list of gensim.models.doc2vec.TaggedDocument elements, optional) – If you wish to train word and document vectors from scratch this parameter can not be None, whereas the doc2vec_model parameter must be None. If you use a pretrained Doc2Vec model to load its learned word and document vectors this parameter has to be None. Input corpus, can be simply a list of elements, but for larger corpora, consider an iterable that streams the documents directly from disk/network.

  • label_names (iterable list of str, optional) – Custom names can be defined for each label. Parameter values of label names and keywords of the same topic must have the same index. Default is to use generic label names.

  • epochs (int, optional) – Number of iterations (epochs) over the corpus.

  • vector_size (int, optional) – Dimensionality of the feature vectors.

  • min_count (int, optional) – Ignores all words with total frequency lower than this.

  • window (int, optional) – The maximum distance between the current and predicted word within a sentence.

  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

  • workers (int, optional) – The amount of worker threads to be used in training the model. Larger amount will lead to faster training. If set to -1, use all available worker threads of the machine.

  • doc2vec_model (gensim.models.doc2vec.Doc2Vec, optional) – If given a pretrained Doc2Vec model, Lbl2Vec uses its word and document vectors to compute the label vectors. If this parameter is defined, tagged_documents has to be None. In order to get optimal Lbl2Vec results the given Doc2Vec model should be trained with the parameters “dbow_words=1” and “dm=0”.

  • num_docs (int, optional) – Maximum number of documents to calculate label embedding from. Default is all available documents.

  • similarity_threshold (float, default=None) – Only documents with a higher similarity to the respective description keywords than this threshold are used to calculate the label embeddings.

  • similarity_threshold_offset (float, default=0) – Sets similarity threshold to n-similarity_threshold_offset with n = (smiliarity of keyphrase_vector to most similar document_vector).

  • min_num_docs (int, optional) – Minimum number of documents that are used to calculate the label embedding. Adds documents until requirement is fulfilled if simiarilty threshold is choosen too restrictive. This value should be chosen to be at least 1 in order to be able to calculate the label embedding. If this value is < 1 it can happen that no document is selected for label embedding calculation and therefore no label embedding is generated.

  • clean_outliers (boolean, optional) – Whether to clean outlier candidate documents for label embedding calculation. Setting to False can shorten the training time. However, a loss of accuracy in the calculation of the label vectors may be possible.

  • verbose (boolean, optional) – Whether to print status during training and prediction.

add_lbl_thresholds(lbl_similarities_df: DataFrame, lbl_thresholds: List[Tuple[str, float]]) DataFrame

Adds threshold column with the threshold value of the most similar classification label.

Parameters
  • lbl_similarities_df (pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels.) – This pandas.DataFrame type is returned by the predict_model_docs() and predict_new_docs() functions.

  • lbl_thresholds (list of tuples) – First tuple element consists of the label name and the second tuple element of the threshold value.

Returns

lbl_similarities_df

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label, fourth column of the label threshold values and the following columns with the respective labels and the similarity scores of the documents to the labels.

fit()

Trains the Lbl2Vec model which creates jointly embedded label, document and word vectors.

classmethod load(filepath: str) object

Loads the Lbl2Vec model from disk.

Parameters

filepath (str) – Path of file.

Returns

lbl2vec_model

Return type

Lbl2Vec model loaded from disk.

predict_model_docs(doc_keys: Optional[Union[List[int], List[str]]] = None, multiprocessing: bool = False) DataFrame

Computes similarity scores of documents that are used to train the Lbl2Vec model to each label.

Parameters
  • doc_keys (list of document keys, optional) – If None: return the similarity scores for all documents that are used to train the Lbl2Vec model. Else: only return the similarity scores of training documents with the given keys.

  • multiprocessing (boolean, optional) – Whether to use the ray multiprocessing library during prediction. If True, ray uses all available workers for prediction. If False, just use a single core for prediction.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

predict_new_docs(tagged_docs: List[TaggedDocument], multiprocessing: bool = False) DataFrame

Computes similarity scores of given new documents that are not used to train the Lbl2Vec model to each label.

Parameters
  • tagged_docs (iterable list of gensim.models.doc2vec.TaggedDocument elements) – New documents that are not used to train the model.

  • multiprocessing (boolean, optional) – Whether to use the ray multiprocessing library during prediction. If True, ray uses all available workers for prediction. If False, just use a single core for prediction.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

save(filepath: str)

Saves the Lbl2Vec model to disk.

Parameters

filepath (str) – Path of file.

Lbl2TransformerVec

class lbl2vec.lbl2transformervec.Lbl2TransformerVec(keywords_list: ~typing.List[~typing.List[str]], documents: ~typing.List[str], transformer_model: ~typing.Union[~sentence_transformers.SentenceTransformer.SentenceTransformer, ~transformers.models.auto.modeling_auto.AutoModel] = SentenceTransformer(   (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel    (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})   (2): Normalize() ), label_names: ~typing.Optional[~typing.List[str]] = None, similarity_threshold: ~typing.Optional[float] = None, similarity_threshold_offset: float = 0, min_num_docs: int = 1, max_num_docs: ~typing.Optional[int] = None, clean_outliers: bool = False, workers: int = -1, device: ~torch.device = device(type='cpu'), verbose: bool = True)

Creates jointly embedded label and document vectors with transformer language models. Once the model is trained it contains document and label vectors.

Parameters
  • keywords_list (iterable list of lists with descriptive keywords of type str) – For each label at least one descriptive keyword has to be added as list of str.

  • documents (iterable list of strings) – Iterable list of text documents

  • transformer_model (Union[SentenceTransformer, transformers.AutoModel], default=SentenceTransformer(‘all-MiniLM-L6-v2’)) – Transformer model used to embed the labels, documents and keywords.

  • label_names (iterable list of str, default=None) – Custom names can be defined for each label. Parameter values of label names and keywords of the same topic must have the same index. Default is to use generic label names.

  • similarity_threshold (float, default=None) – Only documents with a higher similarity to the respective description keywords than this threshold are used to calculate the label embeddings.

  • similarity_threshold_offset (float, default=0) – Sets similarity threshold to n-similarity_threshold_offset with n = (smiliarity of keyphrase_vector to most similar document_vector).

  • min_num_docs (int, default=1) – Minimum number of documents that are used to calculate the label embedding. Adds documents until requirement is fulfilled if simiarilty threshold is choosen too restrictive. This value should be chosen to be at least 1 in order to be able to calculate the label embedding. If this value is < 1 it can happen that no document is selected for label embedding calculation and therefore no label embedding is generated.

  • max_num_docs (int, default=None) – Maximum number of documents to calculate label embedding from. Default is all available documents.

  • clean_outliers (boolean, default=False) – Whether to clean outlier candidate documents for label embedding calculation. Setting to False can shorten the training time. However, a loss of accuracy in the calculation of the label vectors may be possible.

  • workers (int, default=-1) – Use these many worker threads to train the model (=faster training with multicore machines). Setting this parameter to -1 uses all available cpu cores. If using GPU, this parameter is ignored.

  • device (torch.device, default=torch.device('cpu')) – Specify the device that should be used for training the model. Default is to use the CPU device. To use CPU, set device to ‘torch.device(‘cpu’)’. To use GPU, you can e.g. specify ‘torch.device(‘cuda:0’)’.

  • verbose (boolean, default=True) – Whether to print status during training and prediction.

add_lbl_thresholds(lbl_similarities_df: DataFrame, lbl_thresholds: List[Tuple[str, float]]) DataFrame

Adds threshold column with the threshold value of the most similar classification label.

Parameters
  • lbl_similarities_df (pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels.) – This pandas.DataFrame type is returned by the predict_model_docs() and predict_new_docs() functions.

  • lbl_thresholds (list of tuples) – First tuple element consists of the label name and the second tuple element of the threshold value.

Returns

lbl_similarities_df

Return type

pandas.DataFrame with first column of document keys, second column of most similar labels, third column of similarity scores of the document to the most similar label, fourth column of the label threshold values and the following columns with the respective labels and the similarity scores of the documents to the labels.

fit()

Trains the Lbl2TransformerVec model which creates jointly embedded label and document vectors.

classmethod load(filepath: str) object

Loads the Lbl2Vec model from disk.

Parameters

filepath (str) – Path of file.

Returns

lbl2vec_model

Return type

Lbl2Vec model loaded from disk.

predict_model_docs(doc_idxs: Optional[List[int]] = None) DataFrame

Computes similarity scores of documents that are used to train the Lbl2TransformerVec model to each label.

Parameters

doc_idxs (list of document indices, default=None) – If None: return the similarity scores for all documents that are used to train the Lbl2TransformerVec model. Else: only return the similarity scores of training documents with the given indices.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document texts, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

predict_new_docs(documents: List[str], workers: int = -1, device: device = device(type='cpu')) DataFrame

Computes similarity scores of given new documents that are not used to train the Lbl2TransformerVec model to each label.

Parameters
  • documents (iterable list of strings) – New documents that are not used to train the model.

  • workers (int, default=-1) – Use these many worker threads to train the model (=faster training with multicore machines). Setting this parameter to -1 uses all available cpu cores. If using GPU, this parameter is ignored.

  • device (torch.device, default=torch.device('cpu')) – Specify the device that should be used for training the model. Default is to use the CPU device. To use CPU, set device to ‘torch.device(‘cpu’)’. To use GPU, you can e.g. specify ‘torch.device(‘cuda:0’)’.

Returns

labeled_docs

Return type

pandas.DataFrame with first column of document texts, second column of most similar labels, third column of similarity scores of the document to the most similar label and the following columns with the respective labels and the similarity scores of the documents to the labels. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

save(filepath: str)

Saves the Lbl2Vec model to disk.

Parameters

filepath (str) – Path of file.