This is the first post of many that I hope to publish describing some of the technological underpinnings of the Fresh Comics ecosystem. Compiling a weekly database of new comic releases and trying to keep track of everything going on in a particular global retail niche is challenging, and I hope that these entries shed some light on how the pieces all fit together.
In a recent letter I sent out to comic shop retailers, I mentioned that my goal for 2013 is to get a better handle on events happening in local comic shop (LCS) events around the world. With the relaunch of the Fresh Comics website last year, I added some emphasis to LCS events on the Upcoming Events page and the Fresh Comics Events Map. While I've been very happy how these pages showcase upcoming events, it's been challenging keeping these lists current and populated. As of this blog entry, I have a grand total of fifteen upcoming events logged in the Fresh Comics database for the next five months. Not a great showing.
If I were solely focusing on a local audience (such as the comic shops in the Chicagoland area), the way I would tackle this problem by subscribing to the mailing lists for the comic shops in the area and simply enter the events manually as I received new information. However, given the thousands of comic shops around the world, subscribing to the print or e-mail mailing lists isn't really a scalable solution. (Can you imagine my e-mail inbox each week - madness!) In the past, I've requested event information from comic shops, but due to changes in my monetization strategy (original plan: charge retailers $5 to list an event in-app and on the website, new plan: stay tuned), I theorize that retailers have been hesitant to send me event information because they didn't want another invoice to pay. I've since changed my event-listing policy (events are now free), but it's been challenging to get the word out to busy folks who don't want to deal with another one-off method of advertising their shop.
Given the difficulty of keeping a steady stream of events flowing into the app, I decided to approach this problem from a technological perspective. One thing that works in my favor is the stellar uptake of social network services that local shops have adopted to to keep in touch with their customers. Since Twitter is one of the more popular services and the most transparent data-wise, I decided to see what I could build to "mine" event-related tweets from the mountains of messages that @FreshComicsApp receives. Of the hundreds of stores in the Fresh Comics database, I found that several hundred (400 at the time of writing) published news via Twitter, and I followed each and every one of them.
Now, even with the best Twitter client, it's next to impossible to manually keep track of the flood of tweets 400 businesses send out on a daily or hourly basis. Given that Fresh Comics is not my full-time job, I needed to automate the collection of these tweets while I was away. To this end, I expanded the Fresh Comics Django infrastructure to automatically collect all the tweets that had accumulated over a fixed period and to write those messages to my database for further processing. For each message, I log the sender's screenname (e.g. @comicshop), full name (e.g. "The Local Comic Shop"), the date the message was sent and the contents of the message. The results look like this in my administrative interface:
To give a sense of scale, I've collected almost 100,000 tweets since I turned on the system in September, and almost 20,000 for the month of December alone. As I mentioned above, even if I wanted to spend my whole day doing so, keeping up with this volume is infeasible.
So, given the mountains of data being collected, I moved on to the next problem: how do I filter the wheat (event information) from the chaff (everything else)? To solve this problem, I turned to my background in computer science and started dusting off my practical machine learning knowledge.
If you're unfamiliar with machine learning, as a field, it's artificial intelligence's statistical sibling. Whereas AI has traditionally focused on building symbolic systems that function using a traditional logical framework (deduction), machine learning is more of a data-driven approach that seeks to generate models of the world based on patterns found in actual data (induction). Sometimes a human can evaluate the generated model and understand the structure behind how and why it works (white box learners, e.g. decision trees), but often the generated models are incomprehensible and opaque "black boxes" of statistical calculations.
In addition to selecting a learning algorithm, it's also extremely important to break the problem down into simpler parts that the algorithm can evaluate (features). Text classification is a fairly common problem, so in this case, I followed others' examples by treating it as a multidimensional space partitioning problem.
As humans, we are accustomed to living in a three dimensional world. "Forward" & "back", "left" & "right", and "up" & "down" are very familiar concepts to us. (The pedantic reader will also add "before" & "after" to the three physical dimensions.) In text classification, we take this line of thinking and extend it a bit to describe text documents as vectors in a many-dimensional space.
For example take this sentence:
"The quick brown fox jumps over the lazy dog."
This sentence has 9 total words, 8 that are unique. ("the" is repeated once.) We can take this sentence and represent it as a point in 8-dimensional space by replacing the traditional X, Y & Z labels with the unique words and adding new dimensions beyond the original three. If we order our "dimensions" alphabetically:
(brown, dog, fox, jumps, lazy, over, quick, the)
The sentence above becomes the coordinates (or vector space model):
(1, 1, 1, 1, 1, 1, 2)
If we are given another sentence:
"Give the dog a bone."
Our dimensions expand with the additional new terms:
(a, bone, brown, dog, give, fox, jumps, lazy, over, quick, the)
And the sentences' position in the new multidimensional space become:
The quick brown fox jumps over the lazy dog.
(0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 2)
Give the dog a bone.
(1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1)
With the sentences assigned locations in a multidimensional space, the job of the machine learning algorithm then becomes to determine how to divide that space into multiple mutually-exclusive regions. (See the Wikipedia article on Support Vector Machines for illustrations of partitioning regions.) In the example above, we may want to distinguish between descriptions and commands. Given the two examples above, a simple learner may produce a model like this:
if (sentence[bone] > 0) // Comment: sentence contains the word 'bone'
This space partition would be a geometric plane perpendicular to the "bone" dimension, located between the planes defined by "bone = 0" and "bone = 1".
If all we were doing was classifying these two sentences, we would be finished. However, that isn't an interesting problem to solve. If we add another sentence to the mix:
"The dog jumps over the fox bone."
We find that our original model doesn't hold, and we have to retrain the algorithm with the new information to produce a better model:
if (sentence[give] > 0)
If we repeat this process often enough with diverse data, we will probably arrive at a point where the algorithm can train models that can classify most data points correctly, but not all. Rather than letting the perfect be the enemy of the good, we accept this reality and try to produce the models that are as useful as possible. I use the term "useful" instead of "true" in this context because truth and accuracy may not be helpful in a given context. Given the problem of distinguishing between event-related tweets and all the other ones, a model that predicts that none of the tweets are event-related may be extremely accurate in terms of correct classifications made if there are 99 "other" tweets for each single "event" tweet. A model that is 90% accurate, but has a false-positive rate of 50% (half the tweets classified as "event" are actually "other") may be a more useful tool depending upon the problem we're trying to solve (and as long as the false positives don't overwhelm us). I use this example to illustrate the point that raw model "correctness" may not be what we are maximizing given our specific problem.
Getting back to the Fresh Comics Twitter event miner, I use the text classification techniques described above to try to "mine" event-relates tweets from the tens of thousands of tweets that I've collected. Rather than write all the code and infrastructure needed to break down the text, build the multidimensional space, and implement machine learning algorithms, I use a toolkit designed by folks smarter than me to implement most of this: RapidMiner. RapidMiner includes all the functionality for doing most of the tasks described above - my main job was to implement a way to get data from my Twitter database into RapidMiner's analytics. Fortunately, there's already a standard format called Attribute-Relation File Format (ARFF) that RapidMiner can read. On my end, I implemented two web endpoints to support this process: one that packages all the tweets I've already classified for generating a model, and one that packages all the unclassified tweets that use the model RapidMiner generates (these are the tweets being "mined").
RapidMiner implements a graphical programming environment where I've constructed a two step process:
The first step uses the tweets already classified as "event" or "other" train a model. If you look at the screenshot, this is contained in the yellow box with the name "Validation". It's called Validation because in addition to generating the model, I also quantify its usefulness using a technique called cross-validation where I hold some data back from the training process to test the new model. This allows me to generate an estimate of two parameters: recall & precision. In my case, I'm most interested in the precision: how many actual event tweets are classified versus "fake" event tweets. Given that this process manages to extract a sufficient amount of event data for me to work with, I'm not terribly concerned with maximizing recall quite yet. If I found myself "mining" smaller and smaller amounts of events, then optimizing recall would become a higher priority.
Once I have a model, RapidMiner fetches the uncategorized tweets from the database and applies the model to each tweet. Two important results are generated by this process: a classification, and a confidence estimate. The classification is an easy thing to understand - is the tweet an event or other message. The confidence estimate is the "certainty" that the algorithm has about the classification. Since machines are not capable of true certainty, we can think about the confidence estimate as a proxy for the distance (in the multidimensional space) that a given tweet's location from the boundary that separates "event" space from "other" space. A large confidence estimate (0.75 or higher) represents a tweet far away from the boundary, whereas a small one (anything close to 0.5) indicates a tweet that should probably be investigated to confirm its true classification.
RapidMiner does not include the ability to upload arbitrary ARFF files to websites, so my final bit of integration with my Django backend is a method for uploading ARFF files and applying the classifications for the unlabeled tweets to the tweets in my database. Once the new tweets have been labeled with their classification and confidence estimates, I can start sifting through the newly-identified tweets for items to add to the upcoming event calendar. Since many of these tweets will contain links to pages, this is still a manual process. However, since the miner has identified the most promising candidates, I can spend my limited time on the most useful tweets.
As I mentioned before, I use pre-classified tweets to train the model that labels the new tweets. This is an ongoing iterative process. When I apply the classifications to new tweets, I review the classifications of all the newly-labeled events and the most uncertain "others" to refine the model.
The interface above is what I use to go over these events. While it appears odd in a regular web browser, the large text and buttons have been optimized for mobile use - otherwise known as the time I spend on the bus heading to work every morning. This lightweight interface helps me refine the Twitter miner during downtime, so that the next time I do a training & classification run, I'm working with better and better data to improve my generated models.
While this approach has significantly improved the effectiveness of Twitter as an input for event information, I still haven't automated away the need to manually input the event details into the other parts of the database. Given the variation in presentation (website, Facebook, PDFs of flyers), it's unlikely that I'll be able to implement a fully-automated solution. A similar problem that I've run into is related to how comic shop managers use Twitter. While it would be optimal for me if the managers tweeted out events weeks in advance, I found that many shops only tweet their event information a day or two in advance, which means that I end up missing events if I'm not reviewing the results of the miner once or twice a day. This presents another problem for me to solve, and perhaps a future entry for Fresh Tech. :-)
Have a Happy New Year!
Article image credit: ~Menace-of-the-Mils@Deviant Artcomments powered by Disqus