SpamBot

From MWWiki

Revision as of 06:05, 6 February 2010 by Magsol (Talk | contribs)
Jump to: navigation, search

Contents

Twitter SpamBot

Project Overview

The purpose of this project is to create a semi-intelligent Twitterbot that posts updates based on a sampling of the most recent updates from the public timeline. Techniques in lexical analysis, data mining, and machine learning will be employed for various (and most likely, changing) purposes.

Description

In the first phases of this project, the focus will be on developing a generative model representation of the distribution of content in Twitter's public feed over an arbitrary period of time. The first iteration will involve a 1st-order Markov Model, from which sample words will be drawn in order to construct Twitter posts according to the aggregated public posts.

For each small increment of time, the public timeline will be queried, appending the 20 posts returned to a list. These posts will be preprocessed to include special START and STOP tokens at the beginning and end of each post, respectively; these special tokens will simplify the creation of the Markov chain.

With the 1st order Markov chain, each successive word is dependent only on the word before it. Formally, for each word we are computing:

P(wordi | wordi - 1)

Status

Version 1.1 (last update: Nov 14)

  • Went live Nov 5, 2009
  • Two separate cronjobs: one to aggregate public posts, one to build the model from these posts
    • Public timeline aggregation runs every minute, adding 40 posts (API calls/hour: 120)
    • Model generation runs every 20 minutes; destroys aggregated posts (API calls/hour: 3)
  • New post is made by sampling from 1st-degree markov model until special STOP token is found
  • All public timeline posts are permanently logged for any needed later inspection
  • Logger outputs a variety of messages
  • Uses standard username/password authentication; future plans include adoption of OAuth

Outstanding Issues

  • On occasion, about four times per day, a post length will exceed 140 characters. This is likely due to the sparsity of the model itself and not a bug
  • Recently discovered the model is processing extra spaces between words as individual tokens. Fix would involve trimming each and every token to eliminate extraneous whitespace
  • Need error-checking in case Twitter rate-limits API calls. Currently, application hits public timeline twice every minute, plus 1 posting every 20 minutes, for a total of 123 API calls each hour (theoretically under the limit, but still need to handle this possibility)

Code

markov.php

update.php

format.php

publictimeline.php

post.php

logger.php

Future Work

Statistical Model

  • Higher-order or variable-order markov model
  • Classification or regression in order to post on specific topics (LDA? Topic model? Bag of words?)
  • Persistent model storage between iterations
  • Variable posting times dependent on certain conditions

Functionality

  • Engage in conversations with users (additional @ replies)
  • Send DMs, perhaps when picking up a new follower
  • In general, responding to events (followed, @ replied, topic of interest in public feed, etc)
  • Change image / layout / background / profile text based on certain conditions

Links

Live Project

Ruby HMM Implementation

PostMaster9000 (a similar project)