Naughty Words & Content Source

A Quick Comparison of the Front Page to the Incoming Stream
In regards to Swear Word Usage and Source of Content
(for the default subreddits only)

Raw Data

Naughty Words

Obviously, not all of these are naughty, but (for whatever reason) the naughtier ones caught my interest, so I ended up running with those a lot further. Tabulated below, the Relative Frequency of select Phrases contained within User Names and Submission Title both further broken down by Front Page vs. Overall New Submissions.

Phrase (conditions) Author Name Submission Title
# Front # New   Ratio (F/N) # Front # New   Ratio (F/N)
'reddit' (exact match) 2, 0.0001 0, 0.0000 na 0, 0.0000 1, 0.0000 na
'reddit' (in) 169, 0.0043 266, 0.0047 0.9150 996, 0.0254 4105, 0.0726 0.3494
'bots' (in sequence) 35, 0.0009 44, 0.0008 1.1456 1, 0.0000 3, 0.0001 0.4801
'cute', 'funny', or 'interesting' 36, 0.0009 53, 0.0009 0.9782 232, 0.0059 552, 0.0098 0.6053
swear word (short list) 299, 0.0076 484, 0.0086 0.8897 544, 0.0138 826, 0.0146 0.9485
'assh' (in word) 18, 0.0005 18, 0.0003 1.4402 32, 0.0008 58, 0.0010 0.7946
'fuck' (in) 76, 0.0019 158, 0.0028 0.6927 188, 0.0048 261, 0.0046 1.0374
'shit' (in sequence) 73, 0.0019 94, 0.0017 1.1184 150, 0.0038 248, 0.0044 0.8711
fuck_shit (either/or) 146, 0.0037 247, 0.0044 0.8513 331, 0.0084 502, 0.0089 0.9496

So if that's not clear (and why wouldn't it be):
Phrase: regex searched for
Author: search done in author name
Submission Title: search done in title
Front: occurances on front page
New: occurances in the new stream
Ratio: ratio between the two

So, for the 'swear word' entry (the most interesting search IMO):
The regex is 'fuck|shit|assh[a|o]|bitch|dick|penis|(?<!ppy|pea)cock(?!y)|vagina'

  For the Author's Name:
    299 instances in the Front Page data, comprising 0.76% of the total
    484 instances in the New Stream data, comprising 0.86% of the total
    Ratio of 0.88 means this type of post was less likely (under a ratio of 1) than the norm to hit the front page

  Submission Title
    544 instances in the Front Page data, comprising 1.38% of the total
    826 instances in the New Stream data, comprising 1.46% of the total
    Ratio under 1 (0.94 this datum) means this type of post was less likely than the norm to hit the front page

Perhaps the most interesting thing from all this is that a post appears to have a greater chance (if only slightly greater chance) of hitting the front page if the title contains the word 'fuck'. Who would have fucking believed it?

Domain: Content Origination

Pretty much the same thing as before (OK, exactly the same thing as before), but trackig the domain (imgur, tumblr, etc.) of where the content was originally posted (linked from).

Domain (original content source)
# Front # New   Ratio (F/N)
self (reddit) 15129 27043 0.8057
imgur 11933 16643 1.0326
youtube 3931 5145 1.1004
wiki 634 558 1.6363
soundcloud 292 190 2.2133
gfycat 103 87 1.7050
tumblr 93 66 2.0293
twitter 21 18 1.6802

Obviously self posts (text) and imgur are the most numerous. What is perhaps more surprising is the high ratio (conversion) of the SoundCloud and Tumblr material. Since I'm not into music, I would have never expected this, as I consider reddit a text based visual medium. But even the conversion (popularity among browsers) for YouTube videos surpases both text and images.

Reddit Index Page

Brett Stuff

© 2015 copyright Brett Paufler
Terms of Use