Brett Code
Python 2.7.5
Stripping HTML of Unicode, UTF,
and Special Characters
This is the code I came up with to strip random web content of unwanted garbage (html, unicode, utf, <. ?, and so on).
It may not be elegant (don't ask me how much time I wasted trying to be elegant), but it works.
#To be used to clean up Title and comment entries"
def scrubHTML(text):
#common garbage on the pages I'm looking at,
#add more entries as needed, or delte entirely if overkill for your application
text = text.replace(";quot;", " ")
text = text.replace(";#039;", " ")
text = text.replace("&", " ")
#text to lower makes the remainder easier to scrub
text = text.lower()
#list of letters & spaces
#This is ALL that's getting through
#everything else is blocked
#THIS IS THE MAIN WORKING PART
l1 = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k","l", "m"]
l2 = ["n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", " "]
letters = l1 + l2
tempString = ""
for bP in text:
if bP in letters:
tempString += bP
#This is probably overkill, but computers are fast
#Between the two routines, all leading, trailing, and double spaces are
removed
for r in range(0,10):
tempString = tempString.replace(" "," ")
tempString = tempString.strip()
return tempString
#end def scrubHTML(text):
Back to Brett Code Home
Copyright © 2014 Brett Paufler