Thursday, September 22, 2011

Test Data for a TA 2 Simulated Environment

I was thinking today about populating a test social network (see TA 2 Simulation Technology and Simulation Platforms ).  To me, that means creating a bunch of test data to drive that test network.  Hmm. Social networks have the following coarse entity types:
  • Users
  • Posts
  • Friendships
  • Re-posts (+1 or "Share on XYZ social network" buttons)
  • Fan pages (for artists or movies or companies)
  • Discussions related to posts
  • Relationships between users and between fan pages
  • Relationships between posts (your friends are posting about XYZ idea too!)
So I can imagine large lists of these things in a test data database and a system for dipping into the lists for use in the simulated social network system.  One thing that strikes me is that as the test data are touched, they might be "used up" for that simulation run (i.e. it makes no sense to have the same user join the test social network 200 times... so we'd need 2000-5000 test users and maybe more to account for people leaving the test social network by closing their accounts).  That means the test database needs to have the facility to note the relationships between these entities as the simulation is run.  The relationships need to mirror the test social network's capabilities.  Example:
  1. Simulation picks user to sign up (pick a random user from the list and then cross it off)
  2. Simulation picks user to sign up (cannot be first user again unless the user comes with a pseudonym) (decide pseudonym and if not, pick a different random user and mark it used)
  3. Simulation picks post for user 1 to contribute (pick a random post from the list and cross it off)
  4. Simulation picks post for user 1 to contribute (cannot be first post again unless user 2 is a pseudonym of 1) (pick another random post...)
  5. Simulation picks post for user 2 to contribute (might be either post 1 or 2 or some other post) (pick another random post)
We'll want to reuse the test data.  These "used-up" qualities mean that the test database needs to have an "at rest" state and a "simulation in progress" state so that we can clean out the relationships for another run.  If we want repeatability, the test database itself needs a method for recreating the relationships in the same way every time we ask (perhaps a fixed order of events, like the numbers above).

To get a handle on the size of this test database, consider users and posts.  An attribute of a user/fan page is posts-per-day/ posting frequency. Given that each of those entities will have a given posting frequency, we will need to understand the total number of posts that a 5000 user social network can generate in the course of t days.  (m posters (users), f posting frequency (average posts per user per day), t duration of simulation (days) , so p=mft is the simple equation for total posts in the system in units of posts).  Also a "real" discussion sometimes gets more or less play than others so that means we will need a lot of fake discussions....

A whole test database... wow:
  • t days worth of activity (suggest 1 year)
  • m user entities posting at f posts/day (suggest 5k users per spec) (suppose average is 0.1 p/d but has wide variation , thanks Gauss)
  • p posts (suggests 182,500 posts!!! all unique!! None can have junk data because we need to toss the meme detector at the posts!)
  • discussions with more or less play (again no junk data for the meme detector)
  • it's already huge and relationships between users, reposts still need to be treated; I think I need another article to write about all those things. (use some kind of connectedness term to show relationships/user.... I have over 200 social networking friends, for example) (posts visible to friends might be reacted to by friends... provoking posts etc but not to exceed the average posts per day for that specific user entity)
  • mechanisms for providing "random" event data
  • mechanisms for providing "repeatable" event data
  • mechanisms for relating entities that mirror the capabilities of the test social network
It might be possible to build the whole test data system as a modified copy of the test social network system because the test social network system has specific rules for operations in the system (i.e. you must be a user before you can post) and a copy of the test social network system could enforce the same rules as the database is created, examined, copied, or reset.

No comments:

Post a Comment