SMiSC Open Response: September 2011

Saturday, September 24, 2011

Converting All Input to Text

I thought on this and decided maybe the whole system needs to be able to pick out memes by looking at everything it sees as text so that the semantic content can be extracted with a tool like ResearchCyc.

So how do we convert everything to text?

Text is already text. Lucky!
Audio can be converted with open-source speech recognition tools. Example: CMUSphnyx
Convert videos to text by sampling each frame through open-source OCR, and grabbing all text visible at each frame and noting it in some kind of markup so we might be able to piece together the subtitles in a paragraph and the signage in each frame or "movie set" as a separate (and potentially useful) idea/meme. (Does this mean we need to think of "place" as a factor in memes?)

This is how we can absorb everything the internet can dish out. Of course, this is all just a thought-experiment until we piece it all together, and the state of every open source package is subject to great capability variation (i.e. we might not be able to use the example packages or any other tools found on Sourceforge etc.

Just how much data must TA 1 mine?

Here's a quote from the initial introduction to TA 1 technologies:

TA1 performers will develop automated and semi-automated operator support tools and techniques for the systematic and methodical use of social media at data scale and in a timely fashion

Since I have been working on TA 2 test systems with just 5k users, and finding 185k posts per fake year at a posting rate p of 0.1 posts/day, I wondered what the real world has in store for SMiSC. That is, what is "social media at data scale and in a timely fashion?"

Well here is "data scale" as of Feb 23, 2010: By The Numbers: Twitter Vs. Facebook Vs. Google Buzz

Updates/Posts
Facebook status updates: 700 per second
Twitter tweets: 600 per second
Buzz posts: 55 per second

1355 updates per second, discriminated, categorized, aggregated, and reported on. A "timely fashion" implies that it is okay to be "behind" by some time, but eventually the system must process everything. I figure the requirement for maximum delay is set up to give a report on any new/significant meme within our leaders' decision-making cycle so that leaders cannot be outfoxed by a rapidly-spreading strategic message.

Yikes.

Here's stuff just on Facebook (current): FB stats
Twitter doesn't seem to have a similar page.
Couldn't find one for Google+ either.

Created "TA 1 Work" Page

Pages in the Blogger system are for content that we need to keep around. I just discovered them today. We should convert some of our work to pages. I made a page for the TA 1 thoughts that I had overnight. I created a diagram this morning and have displayed it here on this page.

Friday, September 23, 2011

Generating NPCs for the TA 2 System

Well I thought about the 185k messages that had to have meaningful content. I thought, "What's the best way to get 5000 people talking to each other?" and couldn't come up with 5000 friends or 5000 gaming suckers like the Initiative discussed.

I decided maybe OpenCyc (but have since found that ResearchCyc is necessary for Natural Language Processing),

OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. OpenCyc can be used as the basis of a wide variety of intelligent applications such as:
rapid development of an ontology in a vertical area
email prioritizing, routing, summarization, and annotating
expert systems
games
to name just a few.

We could use OpenCyc to act as the "brains" of an "NPC" system, make 5k NPCs, and set them up as friends with one another in a "real world" that exists outside of the TA 2 platform itself, and make them babble for a simulated year.

I think we'd have to add some concepts to ResearchCyc (maybe they already exist) to allow each NPC to have interests so the NPC would discuss certain categories of things more often ("I like sports!") ("I like eggs! Free-range eggs!") and we could give them a news feed from some RSS setups that would let them act like they know things about "the world" that they could discuss on the "social media test platform."

Thursday, September 22, 2011

Test Data for a TA 2 Simulated Environment

I was thinking today about populating a test social network (see TA 2 Simulation Technology and Simulation Platforms ). To me, that means creating a bunch of test data to drive that test network. Hmm. Social networks have the following coarse entity types:

Users
Posts
Friendships
Re-posts (+1 or "Share on XYZ social network" buttons)
Fan pages (for artists or movies or companies)
Discussions related to posts
Relationships between users and between fan pages
Relationships between posts (your friends are posting about XYZ idea too!)

So I can imagine large lists of these things in a test data database and a system for dipping into the lists for use in the simulated social network system. One thing that strikes me is that as the test data are touched, they might be "used up" for that simulation run (i.e. it makes no sense to have the same user join the test social network 200 times... so we'd need 2000-5000 test users and maybe more to account for people leaving the test social network by closing their accounts). That means the test database needs to have the facility to note the relationships between these entities as the simulation is run. The relationships need to mirror the test social network's capabilities. Example:

Simulation picks user to sign up (pick a random user from the list and then cross it off)
Simulation picks user to sign up (cannot be first user again unless the user comes with a pseudonym) (decide pseudonym and if not, pick a different random user and mark it used)
Simulation picks post for user 1 to contribute (pick a random post from the list and cross it off)
Simulation picks post for user 1 to contribute (cannot be first post again unless user 2 is a pseudonym of 1) (pick another random post...)
Simulation picks post for user 2 to contribute (might be either post 1 or 2 or some other post) (pick another random post)

We'll want to reuse the test data. These "used-up" qualities mean that the test database needs to have an "at rest" state and a "simulation in progress" state so that we can clean out the relationships for another run. If we want repeatability, the test database itself needs a method for recreating the relationships in the same way every time we ask (perhaps a fixed order of events, like the numbers above).

To get a handle on the size of this test database, consider users and posts. An attribute of a user/fan page is posts-per-day/ posting frequency. Given that each of those entities will have a given posting frequency, we will need to understand the total number of posts that a 5000 user social network can generate in the course of t days. (m posters (users), f posting frequency (average posts per user per day), t duration of simulation (days) , so p=mft is the simple equation for total posts in the system in units of posts). Also a "real" discussion sometimes gets more or less play than others so that means we will need a lot of fake discussions....

A whole test database... wow:

t days worth of activity (suggest 1 year)
m user entities posting at f posts/day (suggest 5k users per spec) (suppose average is 0.1 p/d but has wide variation , thanks Gauss)
p posts (suggests 182,500 posts!!! all unique!! None can have junk data because we need to toss the meme detector at the posts!)
discussions with more or less play (again no junk data for the meme detector)
it's already huge and relationships between users, reposts still need to be treated; I think I need another article to write about all those things. (use some kind of connectedness term to show relationships/user.... I have over 200 social networking friends, for example) (posts visible to friends might be reacted to by friends... provoking posts etc but not to exceed the average posts per day for that specific user entity)
mechanisms for providing "random" event data
mechanisms for providing "repeatable" event data
mechanisms for relating entities that mirror the capabilities of the test social network

It might be possible to build the whole test data system as a modified copy of the test social network system because the test social network system has specific rules for operations in the system (i.e. you must be a user before you can post) and a copy of the test social network system could enforce the same rules as the database is created, examined, copied, or reset.

Beta Meme Classification Taxonomy

This is my first shot at creating a meme classification system. I beg your indulgence, for all the things I glossed over and the abstraction of some of these ideas. Since, Richard Dawkins, first introduced the meme, meme people have struggled to define it. Some definitions focus on the genetics version that Dawkins used while others prefer the virus model. Because of this schism there has never been a clear definition of a meme, but to do this classification I had to settle for one.

Meme Definition

I am going to start with Dr. Finkelstein's definition of a meme. He has argued for the genetics definition but his final version is developed from a viral standpoint and is very short.

A meme is information which propagates, persists and has impact. (Finkelstien, 2008:15)

Based on this definition any classification system has to address how a meme propagates, persists and has impact. My system will try to do this, though, I think this is done in combination of facets and chains.

T Reynolds Meme Taxonomy

In my previous post on classification, I suggested the biological taxonomy made good model for a meme classification system. Since, then I have reviewed the virus taxonomy, however, I found it lacking and decided to return to the biological taxonomy. Th e taxonomy consist of chains called:

Life, Domain, Kingdom, Phylum, Class, Order, Family, Genus and Species.

One look at memes, and you know they don't stand alone. Without going into Kantan arguments about knowledge, I think we know meme's rank somewhere along the chain as a Domain. The Kingdom chain, consist of the facets of genetic approach and virus approach. The genetic approach looks at epic cultural shifts. The virus or SARS approach looks at smaller shifts occurring over shorter time frames. From this point on I am developing the Viral Kingdom.

The next chain down in the biological taxonomy is Phylum The most important part of Finkelstein' definition is the idea of propagation or replication. This is the delivery system of the meme. I therefore use the Phylum to look at format of the meme, with facets being: Digital (non Internet), Internet (non Social Network), Social Network, Hard copy or non digital. The next chain is the class known as media type made up of the facets like: video, audio and print.

The next chains deals with how the meme makes its impact. The first of these is the Order Communication type. Here I use, Thomas A Seboak's, different type of instinctive messages, to define my facets. These are the four different types of messages we send out. The first is the monologue, which is delivered without regard to the receiver...its just broadcasted. The next one is phatic communication or conversations. The third is the Emotive message, these are sent to explain the condition of the sender to the receiver. (I have been made). The last is the Vocative and Imperative or simply put commands (Campbell, 1968:667). The sixth chain is Family or Method. Here I use, Roland Barthes', idea of symbols, signs and signals to define my facets (Adams 2005:88).

The final two chains are Genus/Style and Species/Target(s). The Genus Style's facets are made up of the many different styles one uses. Here is a short list: humor, academic, news, religious, political, anger. Finally we have the Species Target made up facets of: Individual(s), Group, Organization, Government or Geographic area.

Example

Now Tony suggested we run our blogs through the classification system. I ll start with this blog.

Domain: Meme
Kingdom: Virus (designed to spread an idea)
Phylum/Format: Social Network (blog posting)
Class/Media Type: Print (It is made up of words)
Order/Communication Type: Monologue (Broadcasted to all)
Family/Method: sign (I am defining a meme not making a meme)
Census/Style: Academic (Formal research)
Species/Target: Group (those who read this blog SMiSC community)

Some concerns

In this first attempt I stuck to the 8 chains of the biological taxonomy so my system is a bit restrictive and probably needs one or two more chains. Specifically something has to be done between the Format and Media type. There is to much overlap and confusion here.

Next most memes that SMiSC wants to monitor will be of the Order Vocative and Imperative which is often synchronous communication between sender and receiver. When dealing with social networks your talking Twitter at best but most likely IM systems built into a social network such as Facebook IM. This means much of the important communication will be lost unless these systems are monitored.

Works Cited

Finkelstein, Richard

2008 Compendium and a Military Memetics Overview: Robotic Technology Inc.

Adams, Paul

2005 The Boundless Self: Communication in physical and virtual spaces. Space,

place, and society. Syracuse, N.Y: Syracuse University Press.

Campbell, Joseph

1968 Creative Mytholog,y N.Y. N.Y.: Pinguin Compus

Monday, September 12, 2011

Meme Classification

The more I think and write on this subject the more I am drawn to the conclusion, that a classification system for memes needs to be developed. Without it SMiSC can neither spot a meme attack, defended against one or develop its own meme attack. Simultaneously a definition of meme will also need to be developed.

There are many ways to classify something. In the purest sense classification is the organization of knowledge (memes's) into a systematic order. Most systems are hierarchical, dividing the knowledge into categories and subcategories consisting of: facets, or characteristics of something in our case memes, arrays or links of facets horizontally and the chains the vertical connects of facets.

Modern classification schemes focus on facet analysis and synthesis. That is the breaking up and reassembling so as to identify the basic related facets (Chan: 1994 p. 259-261).

While these general principals come from information science, they hold true for most systems, however, when discussing memes most people would tend to think more of the biology taxonomies. The main reason is that the two approaches to memes whether genetics or viral are biological. I ll not bore you with the history of the approach but to say the modern system is known as the evolution. Here the lead facet and first array is called Kingdoms with chains known: as phylum, class, order, family, genus, and last species. This system is extremely complex and often experience great upheavals: either by designers seeking to streamline or order the system or because something new is discovered such as the recent discovered A. Sediba which will reorganize the chain Homo. (Biological classification accessed September 12, 2011)

With this is mind SMiSC needs to be thought of in terms of arrays and chains. Take for example the first array might be seen as Culture with facets of: Political, Religious, Business, Race. The next chain of arrays could then be Geography lets say by country. A final chain could be made of facets of approach such as: humor, patriotic, academic, news. Naturally the real classification would include clear definitions to guide the process.

Now lets look at an example of a meme. “Pollocks are stupid” often carried in the joke “How do you sink a Polish submarine?...open the screen door.” The meme could be classified as follows.

Culture: Race

Geography: Poland

Approach: Humor

This also works when it comes from building a meme as required by SMiSC. First a mission or meme idea would be developed: embarrass a foreign leader. With the classification system in place we could begin from the bottom and work up. Starting with a fake news story, then pick the geography in this case the country the target is from, and finally what type of story a business one.

Leader A from country Y has secretly order all his accounts moved to Switzerland as rioters break into banks.

Regardless of what system is developed the more memes that are classified the smoother SMiSC projects will work. The system could than plugged into a GUI that would be added to systems that identify, track build and launch of memes.

"Biological Classification" http://en.wikipedia.org/wiki/Biological_classification accessed September 12, 2011

Chan, Lois Mae. (1994) Cataloging and Classification. McGraw Hill NY, NY.

Sunday, September 4, 2011

Persuasion Campaigns and Influence Ops

Persuasion campaigns and influence ops are the key operational aspects of the SMiSC initiative. The term persuasion campaign lends itself to an extended effort to change someone's (a large group, probably) mind, while the sibling term influence op speaks to a single/very few focused efforts. That is, a persuasion campaign could be what our friends in the marketing field call a marketing campaign, and an influence op could be a commercial or a series of commercials. To extend the marketing analogy: the idea is the brand.

To me, the TA 1 technologies are intended to be early versions of a complex campaign/operation system, with the abilities to manually or automatically construct, launch, oversee, and evaluate either a campaign or an operation. When such a system is deployed, one can imagine the existence of a console similar to the screen shown in the now-pulled chinese military hacking video clip. However, the SMiSC consoles for campaigns and ops will probably be chock full of great ideas and catchy carrier mechanisms to project them. Instead of an IP address combo box and an attack button, the SMiSC consoles could have much more devious features:

List of ideas (all of which came through some bureaucracy of approvals) (brand! think brand!)

List of carrier mechanisms in which to embed the idea (also approved) (think commercial!)

A viewer for the combined idea and carrier

A simulator preview mode (using TA 3 technologies, the idea and carrier would be inserted into the TA 2 system for immediate-mode efficacy modeling)

The attack button could be in Chinese, too, just to make a new "tradition."

The operator could specify constraints on the initial placement of the attack (but come on, these things are intended to spread like wildfire, so who'd be kidding whom?)

The SMiSC console would really just be the GUI to the attack/injection system.

The attack/injection system itself will be covered in a forthcoming post. What we do with it has already been covered.