gensimソースコードリーディング

はじめに

全体の流れ

  1. 文書からコーパス生成
  2. コーパスを元にトピックモデルを生成
  3. モデルを利用して文書の類似度を算出

注意

  • 英語の文章を対象とします

1. 文書からコーパス生成

文章をベクトル化してコーパスを生成します。

# ログ出力の設定
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 必要モジュールのインポート
from gensim import corpora, models, similarities
from pprint import pprint 

# コーパス元となる文書
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# ストップワード
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# 2回以上出現する単語に限定する
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

# テキスト内容を確認
from pprint import pprint   # pretty-printer
pprint(texts)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# 単語と単語に与えられるIDをマッピングする
dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # 保存
print(dictionary)
print(dictionary.token2id)

ここまでチュートリアル通りです。
与えられた文書を元にコーパスを生成する下準備をしてます。
文書を単語に分割して、単語にIDを付与します。
英語はホワイトスペース区切りで単語分割できるから楽ちんですね。

corpora.Dictionary(texts)から文章をbag-of-wordsでベクトル化して、コーパスを生成できます。 ベクトル化では1文章内の単語の登場回数を利用しています。

次に、corpora.Dictionary(texts)がどうなっているか把握するため、gensim/dictionary.py at develop · piskvorky/gensim · GitHubを見ていきます。

class Dictionary(utils.SaveLoad, Mapping):
    """
    Dictionary encapsulates the mapping between normalized words and their integer ids.
    The main function is `doc2bow`, which converts a collection of words to its
    bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
    """
    def __init__(self, documents=None, prune_at=2000000):
        """
        If `documents` are given, use them to initialize Dictionary (see `add_documents()`).
        """
        self.token2id = {} # token -> tokenId
        self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
        self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared

        self.num_docs = 0 # number of documents processed
        self.num_pos = 0 # total number of corpus positions
        self.num_nnz = 0 # total number of non-zeroes in the BOW matrix

        if documents is not None:
            self.add_documents(documents, prune_at=prune_at)

    def add_documents(self, documents, prune_at=2000000):
        """
        Update dictionary from a collection of documents. Each document is a list
        of tokens = **tokenized and normalized** strings (either utf8 or unicode).
        This is a convenience wrapper for calling `doc2bow` on each document
        with `allow_update=True`, which also prunes infrequent words, keeping the
        total number of unique words <= `prune_at`. This is to save memory on very
        large inputs. To disable this pruning, set `prune_at=None`.
        >>> print(Dictionary(["máma mele maso".split(), "ema má máma".split()]))
        Dictionary(5 unique tokens)
        """
        for docno, document in enumerate(documents):
            # log progress & run a regular check for pruning, once every 10k docs
            if docno % 10000 == 0:
                if prune_at is not None and len(self) > prune_at:
                    self.filter_extremes(no_below=0, no_above=1.0, keep_n=prune_at)
                logger.info("adding document #%i to %s", docno, self)

            # update Dictionary with the document
            self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids

        logger.info("built %s from %i documents (total %i corpus positions)",
                     self, self.num_docs, self.num_pos)


    def doc2bow(self, document, allow_update=False, return_missing=False):
        """
        Convert `document` (a list of words) into the bag-of-words format = list
        of `(token_id, token_count)` 2-tuples. Each word is assumed to be a
        **tokenized and normalized** string (either unicode or utf8-encoded). No further preprocessing
        is done on the words in `document`; apply tokenization, stemming etc. before
        calling this method.
        If `allow_update` is set, then also update dictionary in the process: create
        ids for new words. At the same time, update document frequencies -- for
        each word appearing in this document, increase its document frequency (`self.dfs`)
        by one.
        If `allow_update` is **not** set, this function is `const`, aka read-only.
        """
        if isinstance(document, string_types):
            raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string")

        # Construct (word, frequency) mapping.
        counter = defaultdict(int)
        for w in document:
            counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1

        token2id = self.token2id
        if allow_update or return_missing:
            missing = dict((w, freq) for w, freq in iteritems(counter) if w not in token2id)
            if allow_update:
                for w in missing:
                    # new id = number of ids made so far;
                    # NOTE this assumes there are no gaps in the id sequence!
                    token2id[w] = len(token2id)

        result = dict((token2id[w], freq) for w, freq in iteritems(counter) if w in token2id)

        if allow_update:
            self.num_docs += 1
            self.num_pos += sum(itervalues(counter))
            self.num_nnz += len(result)
            # increase document count for each unique token that appeared in the document
            dfs = self.dfs
            for tokenid in iterkeys(result):
                dfs[tokenid] = dfs.get(tokenid, 0) + 1

        # return tokenids, in ascending id order
        result = sorted(iteritems(result))
        if return_missing:
            return result, missing
        else:
            return result

流れとしては、Dictionaryadd_documentsdoc2bow です。 コメントにも書いてあるように doc2bowがここのメインです。 ここでは、どの単語が何回登場するのか計算するのが目的です。
そのために、単語とIDのマッピングなどの処理があります(self.token2id)

単語の登場回数はself.dfs (document frequencies)に保存されます。 doc2bowはread onlyでも使用できるようにallow_updateのオプションがあります。 いろいろ書いてありますが、下記で登場回数を更新していることがわかると思います。

for tokenid in iterkeys(result):
    dfs[tokenid] = dfs.get(tokenid, 0) + 1

チュートリアルに戻ります

new_doc = "Human computer interaction"  # 新しい文書
new_vec = dictionary.doc2bow(new_doc.lower().split())  # ベクトル化
corpus = [dictionary.doc2bow(text) for text in texts]  # コーパス生成
corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # ディスクに保存

新しい文書を与え、doc2bowでベクトル化できます。 また、Matrix Market形式でコーパスをディスクに保存します。

2. コーパスを元にトピックモデルを生成

コーパスを生成できました。 これを元にTF-IDFを算出するためのモデルを生成し、LSIやLDAのモデルを生成する流れを見ていきます。 ディスクに保存したコーパスを読み込み、TF-IDFのモデルを生成します。

corpus = corpora.MmCorpus('/tmp/deerwester.mm')  # コーパス読み込み
tfidf = models.TfidfModel(corpus)  # tf idfのモデル生成

TF-IDFを簡単に紹介すれば、出現頻度の多い単語は重要視して、異なる文章間に何度も出現する単語を重要視しない的な重み付けです。
TfidfModel gensim/tfidfmodel.py at develop · piskvorky/gensim · GitHub の中を見ていきます。
idfの計算はdf2idfの中で行われてます。

def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
    """
    Compute default inverse-document-frequency for a term with document frequency `doc_freq`::
      idf = add + log(totaldocs / doc_freq)
    """
    return add + math.log(1.0 * totaldocs / docfreq, log_base)

def precompute_idfs(wglobal, dfs, total_docs):
    """Precompute the inverse document frequency mapping for all terms."""
    # not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
    # this method is here just to speed things up a little.
    return dict((termid, wglobal(df, total_docs))
                for termid, df in iteritems(dfs))

チュートリアルに戻ります。

doc_bow = [(0, 1), (2, 1)]  # ベクトル化した文章を新たに定義します
print(tfidf[doc_bow])  # モデルからTF-IDFを算出します

corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

モデルを元に文書ベクトルのTF-IDFを求めています。 コーパスからTF-IDFを計算できました。 それを元にLSIやLDAなどのトピックモデルを求めることができます。

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # LSIモデル生成
corpus_lsi = lsi[corpus_tfidf]  # bow->tfidf->fold-in-lsi
lsi.print_topics(2)
for doc in corpus_lsi: # bow->tfidf と tfidf->lsi の変換はここで行われてます
    print(doc)

文章量が極小なので、出力されたトピックは意味不明ですね。

gensim/lsimodel.py at develop · piskvorky/gensim · GitHub LSIモデルの中を少し見ていきます。(とほほ)

class Projection(utils.SaveLoad):
    def __init__(self, m, k, docs=None, use_svdlibc=False, power_iters=P2_EXTRA_ITERS, extra_dims=P2_EXTRA_DIMS):
        """
        Construct the (U, S) projection from a corpus `docs`. The projection can
        be later updated by merging it with another Projection via `self.merge()`.
        This is the class taking care of the 'core math'; interfacing with corpora,
        splitting large corpora into chunks and merging them etc. is done through
        the higher-level `LsiModel` class.
        """
        self.m, self.k = m, k
        self.power_iters = power_iters
        self.extra_dims = extra_dims
        if docs is not None:
            # base case decomposition: given a job `docs`, compute its decomposition,
            # *in-core*.
            if not use_svdlibc:
                u, s = stochastic_svd(
                    docs, k, chunksize=sys.maxsize,
                    num_terms=m, power_iters=self.power_iters,
                    extra_dims=self.extra_dims)
            else:
                try:
                    import sparsesvd
                except ImportError:
                    raise ImportError("`sparsesvd` module requested but not found; run `easy_install sparsesvd`")
                logger.info("computing sparse SVD of %s matrix", str(docs.shape))
                if not scipy.sparse.issparse(docs):
                    docs = matutils.corpus2csc(docs)
                ut, s, vt = sparsesvd.sparsesvd(docs, k + 30)  # ask for extra factors, because for some reason SVDLIBC sometimes returns fewer factors than requested
                u = ut.T
                del ut, vt
                k = clip_spectrum(s**2, self.k)
            self.u = u[:, :k].copy()
            self.s = s[:k].copy()
        else:
            self.u, self.s = None, None

(省略)

class LsiModel(interfaces.TransformationABC):
    """
    Objects of this class allow building and maintaining a model for Latent
    Semantic Indexing (also known as Latent Semantic Analysis).
    The main methods are:
    1. constructor, which initializes the projection into latent topics space,
    2. the ``[]`` method, which returns representation of any input document in the
       latent space,
    3. `add_documents()` for incrementally updating the model with new documents.
    The left singular vectors are stored in `lsi.projection.u`, singular values
    in `lsi.projection.s`. Right singular vectors can be reconstructed from the output
    of `lsi[training_corpus]`, if needed. See also FAQ [2]_.
    Model persistency is achieved via its load/save methods.
    .. [2] https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q4-how-do-you-output-the-u-s-vt-matrices-of-lsi
    """
    def __init__(self, corpus=None, num_topics=200, id2word=None, chunksize=20000,
                 decay=1.0, distributed=False, onepass=True,
                 power_iters=P2_EXTRA_ITERS, extra_samples=P2_EXTRA_DIMS):
        """
        `num_topics` is the number of requested factors (latent dimensions).
        After the model has been trained, you can estimate topics for an
        arbitrary, unseen document, using the ``topics = self[document]`` dictionary
        notation. You can also add new training documents, with ``self.add_documents``,
        so that training can be stopped and resumed at any time, and the
        LSI transformation is available at any point.
        If you specify a `corpus`, it will be used to train the model. See the
        method `add_documents` for a description of the `chunksize` and `decay` parameters.
        Turn `onepass` off to force a multi-pass stochastic algorithm.
        `power_iters` and `extra_samples` affect the accuracy of the stochastic
        multi-pass algorithm, which is used either internally (`onepass=True`) or
        as the front-end algorithm (`onepass=False`). Increasing the number of
        power iterations improves accuracy, but lowers performance. See [3]_ for
        some hard numbers.
        Turn on `distributed` to enable distributed computing.
        Example:
        >>> lsi = LsiModel(corpus, num_topics=10)
        >>> print(lsi[doc_tfidf]) # project some document into LSI space
        >>> lsi.add_documents(corpus2) # update LSI on additional documents
        >>> print(lsi[doc_tfidf])
        .. [3] http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf
        """
        self.id2word = id2word
        self.num_topics = int(num_topics)
        self.chunksize = int(chunksize)
        self.decay = float(decay)
        if distributed:
            if not onepass:
                logger.warning("forcing the one-pass algorithm for distributed LSA")
                onepass = True
        self.onepass = onepass
        self.extra_samples, self.power_iters = extra_samples, power_iters

        if corpus is None and self.id2word is None:
            raise ValueError('at least one of corpus/id2word must be specified, to establish input space dimensionality')

        if self.id2word is None:
            logger.warning("no word id mapping provided; initializing from corpus, assuming identity")
            self.id2word = utils.dict_from_corpus(corpus)
            self.num_terms = len(self.id2word)
        else:
            self.num_terms = 1 + max([-1] + self.id2word.keys())

        self.docs_processed = 0
        self.projection = Projection(self.num_terms, self.num_topics, power_iters=self.power_iters, extra_dims=self.extra_samples)

(省略)

def stochastic_svd(corpus, rank, num_terms, chunksize=20000, extra_dims=None,
                   power_iters=0, dtype=numpy.float64, eps=1e-6):
    """
    Run truncated Singular Value Decomposition (SVD) on a sparse input.
    Return (U, S): the left singular vectors and the singular values of the input
    data stream `corpus` [4]_. The corpus may be larger than RAM (iterator of vectors).
    This may return less than the requested number of top `rank` factors, in case
    the input itself is of lower rank. The `extra_dims` (oversampling) and especially
    `power_iters` (power iterations) parameters affect accuracy of the decomposition.
    This algorithm uses `2+power_iters` passes over the input data. In case you can only
    afford a single pass, set `onepass=True` in :class:`LsiModel` and avoid using
    this function directly.
    The decomposition algorithm is based on
    **Halko, Martinsson, Tropp. Finding structure with randomness, 2009.**
    .. [4] If `corpus` is a scipy.sparse matrix instead, it is assumed the whole
       corpus fits into core memory and a different (more efficient) code path is chosen.
    """

(省略)

流れは、LsiModelProjectionstochastic_svd となります。
LsiModelの中では規模の大きいデータを処理する分散処理の仕組みが用意されています。ふむ。
Projectionでは特異値分解にstochastic_svdを使用するか、sparsesvdを使用するか決定できます。 また、文章を後に追加した時に結合するための仕組みもあります。
stochastic_svdで特異値分解を行っています。ここで次元圧縮してLSIを実現しているのですね。

3. モデルを利用して文書の類似度を算出

LSIモデルの中で新たな文章のベクトルを求めます。

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] 
print(vec_lsi)

おわりに

gensimソースコードリーディングは以上となります。 gensimソースコードの上辺だけ追いましたが、ブラックボックスであったライブラリの一片が見えて楽しかったです。
みなさまも楽しんでいただけたら幸いです。