Progressis continuing on my blog mining and analysis software. I’m calling it “thoth”, after the Egyptian god of writing and knowledge. I’ve cracked a couple difficult tasks, and now the software automatically performs semantic analysis on the blog posts, creating a network of the most frequent terms cluster-coded by how the terms appear together (associate) in the posts. Here is the resulting semantic network based upon a search of all posts containing the term “iraq” dated January 15, 2008:

iraq semantic network

The font sizes of the terms reflect how frequently each term appears in the posts; for example, the term “iraqi” is the most frequent term, and thus has the largest font size. The lines between the terms indicate association - when any given term appears, the terms that are connected to it also tend to appear implying a collective mental association among them. The thicker the line, the stronger the connection. The different colors represent “clusters” of related concepts.

The graph shows that the analysis isn’t perfect; there’s a few appearing terms that are obviously synonymous or part of a combination (e.g. “general” and “petraeus” instead of “general petraeus”), but I’m very pleased with the progress. The thoth software implements the following natural language processing (NLP) techniques:

  1. a custom stopword filter
  2. the porter token stemmer (not implemented in this graph)
  3. statistical frequency analysis
  4. custom cosine-weighted co-occurrence network generation
  5. k-means centroid and single-link hierarchical clustering algorithms (k-means shown)
  6. spring-embedder graph generation

Development will continue to make the software more sensitive and useful for PR, brand, and market research applications.


Share/Save/Bookmark Subscribe | Permalink | Trackback

 

 Next Post: A New Type of Real-Time Data Collection

 


Leave a new comment

(required)
(required)