I was bored over the weekend and reading through [Empirical Study of Topic Modeling in Twitter ](http://snap.stanford.edu/soma2010/papers/soma2010_12.pdf] and thought it would be interesting to try to perform this analysis but on comments taken from The Tea Party’s facebook page
Step One - Getting The Data
I am pretty familiar with Facebook’s Graph API from work at a previous start-up. I used a combination of the Graph API to pull the post ids from the page, and then used FQL to pull the comments from each post:
post_ids = '10153066534910779'
FQL_COMMENT = """
select text from comment where object_id in ( % s) limit 2000
"""
def _get_post_ids_by_page_id ( page_name , access_token ):
""" """
url = 'https://graph.facebook.com/{page_name}?fields=posts.id&access_token={access_token}' . format (
page_name = page_name , access_token = access_token )
return simplejson . loads ( urllib . urlopen ( url ) . read ())
post_ids = [ post [ 'id' ] . split ( "_" )[ 1 ] for post in \
_get_post_ids_by_page_id ( page_name = 'teapartypatriots' ,
access_token = token )[ 'posts' ][ 'data' ]]
for post_id in post_ids :
url = "https://graph.facebook.com/fql?access_token= % s&q= % s" % ( token , ( FQL_COMMENT % post_id ))
json_comment = simplejson . loads ( urllib . urlopen ( url ) . read ())
print len ( json_comment [ 'data' ])
df = pd . DataFrame ( json_comment [ 'data' ])
df . to_csv ( "csvs/{}.csv" . format ( post_id ))
print "csvs/{}.csv" . format ( post_id )
time . sleep ( 1 )
I saved the data in csv
files using pandas
so I could do offline processing and not have to keep hitting Facebook’s API.
Step Two - Running LDA on the Corpus
After you have the data pulled and in csv
files, you need to run LDA on the corpus to discover latent
topics. It is pretty important to apply preprocessing routines to the corpus before you start LDA, removing stop-works (common words in the English language) as well as making words lower-case, removing puncuation, etc.
corpus = 1, words = 14020, K = 10, a = 0.500000, b = 0.500000
initial perplexity = 2378.337280
-1 p = 2378.290024
-2 p = 2378.187329
-3 p = 2377.986063
-4 p = 2377.657654
-5 p = 2376.859515
-6 p = 2375.187447
-7 p = 2371.556214
-8 p = 2365.740169
-9 p = 2355.670145
-10 p = 2340.404687
-11 p = 2321.435137
-12 p = 2299.308017
-13 p = 2273.851541
-14 p = 2245.997043
-15 p = 2220.681310
-- topic: 0 ( 71271 words)
people: 0.014901 ( 1166)
country: 0.012602 ( 986)
obama: 0.010175 ( 796)
would: 0.008885 ( 695)
like: 0.008885 ( 695)
america: 0.008054 ( 630)
iran: 0.007927 ( 620)
american: 0.007773 ( 608)
need: 0.006777 ( 530)
know: 0.006751 ( 528)
congress: 0.006700 ( 524)
deal: 0.006649 ( 520)
year: 0.006611 ( 517)
think: 0.006458 ( 505)
this: 0.006228 ( 487)
president: 0.006202 ( 485)
never: 0.005883 ( 460)
state: 0.005768 ( 451)
government: 0.005742 ( 449)
thing: 0.005078 ( 397)
done : 0.004810 ( 376)
going: 0.004784 ( 374)
failed: 0.004771 ( 373)
illegal: 0.004541 ( 355)
make: 0.004414 ( 345)
world: 0.004196 ( 328)
problem: 0.004196 ( 328)
muslim: 0.004158 ( 325)
many: 0.004081 ( 319)
agree: 0.004005 ( 313)
trump: 0.003877 ( 303)
stop: 0.003737 ( 292)
party: 0.003711 ( 290)
still: 0.003647 ( 285)
everything: 0.003596 ( 281)
want: 0.003583 ( 280)
take: 0.003583 ( 280)
democrat: 0.003570 ( 279)
http: 0.003558 ( 278)
there: 0.003558 ( 278)
wrong: 0.003545 ( 277)
house: 0.003519 ( 275)
keep: 0.003443 ( 269)
work: 0.003392 ( 265)
much: 0.003366 ( 263)
bush: 0.003315 ( 259)
anything: 0.003264 ( 255)
enough: 0.003200 ( 250)
must: 0.003149 ( 246)
even: 0.003111 ( 243)
-- topic: 1 ( 6002 words)
great: 0.004880 ( 63)
came: 0.003881 ( 50)
real: 0.002805 ( 36)
post: 0.002267 ( 29)
first: 0.002113 ( 27)
president: 0.002037 ( 26)
obama: 0.001960 ( 25)
country: 0.001883 ( 24)
economy: 0.001806 ( 23)
john: 0.001806 ( 23)
donald: 0.001652 ( 21)
action: 0.001499 ( 19)
iran: 0.001499 ( 19)
america: 0.001345 ( 17)
especially: 0.001345 ( 17)
either: 0.001345 ( 17)
http: 0.001345 ( 17)
year: 0.001268 ( 16)
light: 0.001191 ( 15)
trump: 0.001191 ( 15)
voted: 0.001114 ( 14)
southern: 0.001114 ( 14)
your: 0.001114 ( 14)
think: 0.001114 ( 14)
last: 0.001114 ( 14)
mean: 0.001114 ( 14)
question: 0.001038 ( 13)
paid: 0.001038 ( 13)
shown: 0.001038 ( 13)
play: 0.001038 ( 13)
consequence: 0.001038 ( 13)
list: 0.000961 ( 12)
there: 0.000961 ( 12)
necessary: 0.000961 ( 12)
like: 0.000961 ( 12)
weapon: 0.000961 ( 12)
international: 0.000961 ( 12)
education: 0.000961 ( 12)
right: 0.000884 ( 11)
could: 0.000884 ( 11)
care: 0.000884 ( 11)
history : 0.000884 ( 11)
based: 0.000884 ( 11)
youtube: 0.000884 ( 11)
impeachment: 0.000884 ( 11)
recognize: 0.000884 ( 11)
trade: 0.000884 ( 11)
allowed: 0.000884 ( 11)
contractor: 0.000807 ( 10)
senator: 0.000807 ( 10)
-- topic: 2 ( 576 words)
election: 0.000461 ( 3)
recession: 0.000461 ( 3)
plus: 0.000330 ( 2)
surly: 0.000330 ( 2)
football: 0.000330 ( 2)
sentence: 0.000330 ( 2)
polarize: 0.000330 ( 2)
iran: 0.000330 ( 2)
sound: 0.000330 ( 2)
fault: 0.000330 ( 2)
threat: 0.000330 ( 2)
payer: 0.000330 ( 2)
liable: 0.000330 ( 2)
making: 0.000330 ( 2)
greater: 0.000330 ( 2)
freddie: 0.000330 ( 2)
islamic: 0.000330 ( 2)
assumption: 0.000330 ( 2)
smith: 0.000330 ( 2)
drowning: 0.000330 ( 2)
deliverance: 0.000330 ( 2)
crybaby: 0.000330 ( 2)
cant: 0.000330 ( 2)
twitter: 0.000330 ( 2)
chelsea: 0.000330 ( 2)
unicorn: 0.000330 ( 2)
opportunity: 0.000330 ( 2)
explains: 0.000330 ( 2)
depression: 0.000330 ( 2)
crony: 0.000330 ( 2)
taught: 0.000198 ( 1)
monger: 0.000198 ( 1)
milsao: 0.000198 ( 1)
timeline: 0.000198 ( 1)
stonewalls: 0.000198 ( 1)
oboehner: 0.000198 ( 1)
reread: 0.000198 ( 1)
jewsnews: 0.000198 ( 1)
grain: 0.000198 ( 1)
desecrate: 0.000198 ( 1)
screening: 0.000198 ( 1)
flopped: 0.000198 ( 1)
newfeed: 0.000198 ( 1)
reaps: 0.000198 ( 1)
sending: 0.000198 ( 1)
challenge: 0.000198 ( 1)
standard: 0.000198 ( 1)
sour: 0.000198 ( 1)
enterprise: 0.000198 ( 1)
primeminster: 0.000198 ( 1)
-- topic: 3 ( 3570 words)
china: 0.002127 ( 22)
fund: 0.001654 ( 17)
poll: 0.001654 ( 17)
muslim: 0.001276 ( 13)
also: 0.001276 ( 13)
must: 0.001181 ( 12)
regulation: 0.001181 ( 12)
read : 0.001087 ( 11)
care: 0.000992 ( 10)
bringing: 0.000898 ( 9)
law: 0.000898 ( 9)
much: 0.000898 ( 9)
within: 0.000898 ( 9)
forced: 0.000898 ( 9)
time : 0.000803 ( 8)
hope: 0.000803 ( 8)
thank: 0.000803 ( 8)
sanction: 0.000803 ( 8)
american: 0.000803 ( 8)
forgotten: 0.000803 ( 8)
boehner: 0.000803 ( 8)
federal: 0.000709 ( 7)
result: 0.000709 ( 7)
running: 0.000709 ( 7)
elected: 0.000709 ( 7)
country: 0.000709 ( 7)
system: 0.000709 ( 7)
appropriate: 0.000709 ( 7)
wait : 0.000709 ( 7)
candidate: 0.000709 ( 7)
news: 0.000709 ( 7)
authority: 0.000709 ( 7)
speak: 0.000709 ( 7)
buddy: 0.000709 ( 7)
mean: 0.000709 ( 7)
couple: 0.000709 ( 7)
chance: 0.000709 ( 7)
business: 0.000709 ( 7)
pointing: 0.000614 ( 6)
somebody: 0.000614 ( 6)
ring: 0.000614 ( 6)
corporation: 0.000614 ( 6)
lot: 0.000614 ( 6)
leg: 0.000614 ( 6)
watch: 0.000614 ( 6)
east: 0.000614 ( 6)
represents: 0.000614 ( 6)
good: 0.000614 ( 6)
completely: 0.000614 ( 6)
student: 0.000614 ( 6)
-- topic: 4 ( 12657 words)
think: 0.008924 ( 175)
obama: 0.006483 ( 127)
want: 0.004652 ( 91)
people: 0.004398 ( 86)
idiot: 0.004042 ( 79)
great: 0.003839 ( 75)
failed: 0.003076 ( 60)
iran: 0.002822 ( 55)
come: 0.002720 ( 53)
become: 0.002568 ( 50)
email: 0.002517 ( 49)
like: 0.002466 ( 48)
action: 0.002364 ( 46)
others: 0.002364 ( 46)
another: 0.002263 ( 44)
past: 0.002263 ( 44)
treason: 0.002263 ( 44)
terrorist: 0.002110 ( 41)
country: 0.002110 ( 41)
nation: 0.002110 ( 41)
mean: 0.002110 ( 41)
article: 0.002008 ( 39)
also: 0.002008 ( 39)
everything: 0.001958 ( 38)
course: 0.001958 ( 38)
change: 0.001907 ( 37)
barack: 0.001805 ( 35)
corporation: 0.001805 ( 35)
democratic: 0.001805 ( 35)
industry: 0.001805 ( 35)
really: 0.001754 ( 34)
michael: 0.001754 ( 34)
enemy: 0.001754 ( 34)
president: 0.001703 ( 33)
time : 0.001703 ( 33)
soon: 0.001653 ( 32)
vote: 0.001602 ( 31)
blame: 0.001602 ( 31)
office: 0.001602 ( 31)
they: 0.001602 ( 31)
back: 0.001551 ( 30)
keep: 0.001500 ( 29)
sanction: 0.001500 ( 29)
that: 0.001500 ( 29)
research: 0.001500 ( 29)
leader: 0.001500 ( 29)
impeachment: 0.001500 ( 29)
control: 0.001500 ( 29)
still: 0.001449 ( 28)
afraid: 0.001449 ( 28)
-- topic: 5 ( 4904 words)
trying: 0.002140 ( 25)
family: 0.001721 ( 20)
stand: 0.001721 ( 20)
criminal: 0.001721 ( 20)
party: 0.001469 ( 17)
line: 0.001469 ( 17)
lying: 0.001385 ( 16)
brain: 0.001301 ( 15)
page: 0.001217 ( 14)
daughter: 0.001217 ( 14)
what: 0.001217 ( 14)
hillary: 0.001133 ( 13)
also: 0.001133 ( 13)
make: 0.001049 ( 12)
back: 0.001049 ( 12)
finally: 0.001049 ( 12)
without: 0.001049 ( 12)
send: 0.001049 ( 12)
doesn: 0.000965 ( 11)
doubt: 0.000965 ( 11)
perhaps: 0.000965 ( 11)
marriage: 0.000881 ( 10)
least: 0.000881 ( 10)
special: 0.000881 ( 10)
nut: 0.000797 ( 9)
blow: 0.000797 ( 9)
scare: 0.000797 ( 9)
coverage: 0.000797 ( 9)
question: 0.000797 ( 9)
colin: 0.000797 ( 9)
pack: 0.000713 ( 8)
exchange: 0.000713 ( 8)
haven: 0.000713 ( 8)
these: 0.000713 ( 8)
pray: 0.000713 ( 8)
something: 0.000713 ( 8)
rino: 0.000713 ( 8)
standard: 0.000713 ( 8)
grow: 0.000713 ( 8)
real: 0.000713 ( 8)
democracy: 0.000713 ( 8)
forward: 0.000713 ( 8)
representative: 0.000713 ( 8)
obama: 0.000713 ( 8)
constituent: 0.000630 ( 7)
last: 0.000630 ( 7)
building: 0.000630 ( 7)
pick: 0.000630 ( 7)
guarantee: 0.000630 ( 7)
cover: 0.000630 ( 7)
-- topic: 6 ( 1492 words)
smarter: 0.000647 ( 5)
waiting: 0.000647 ( 5)
donald: 0.000529 ( 4)
clearly: 0.000529 ( 4)
command : 0.000529 ( 4)
income: 0.000529 ( 4)
were: 0.000529 ( 4)
rawnsleyb: 0.000529 ( 4)
line: 0.000529 ( 4)
whatever: 0.000412 ( 3)
refused: 0.000412 ( 3)
ever: 0.000412 ( 3)
like: 0.000412 ( 3)
trusted: 0.000412 ( 3)
suggestion: 0.000412 ( 3)
obese: 0.000412 ( 3)
overthrow: 0.000412 ( 3)
sarah: 0.000412 ( 3)
asking: 0.000412 ( 3)
clerk: 0.000412 ( 3)
bullfrog: 0.000412 ( 3)
remains: 0.000412 ( 3)
list: 0.000412 ( 3)
capitalism: 0.000412 ( 3)
maybe: 0.000412 ( 3)
favorite: 0.000412 ( 3)
respect: 0.000412 ( 3)
regulation: 0.000412 ( 3)
laid: 0.000412 ( 3)
reject: 0.000412 ( 3)
sensible: 0.000412 ( 3)
susan: 0.000412 ( 3)
productive: 0.000412 ( 3)
didn: 0.000412 ( 3)
regret: 0.000412 ( 3)
live: 0.000412 ( 3)
norm: 0.000412 ( 3)
given: 0.000412 ( 3)
prosecution: 0.000412 ( 3)
security: 0.000412 ( 3)
magazine: 0.000412 ( 3)
think: 0.000412 ( 3)
worldwide: 0.000412 ( 3)
follower: 0.000412 ( 3)
enabled: 0.000412 ( 3)
opening: 0.000412 ( 3)
leading: 0.000412 ( 3)
newly: 0.000412 ( 3)
created: 0.000412 ( 3)
christopher: 0.000294 ( 2)
-- topic: 7 ( 45838 words)
obama: 0.022659 ( 1197)
american: 0.010833 ( 572)
america: 0.010398 ( 549)
deal: 0.009811 ( 518)
time : 0.008524 ( 450)
right: 0.008487 ( 448)
president: 0.007881 ( 416)
want: 0.007805 ( 412)
need: 0.007162 ( 378)
think: 0.006859 ( 362)
back: 0.006727 ( 355)
vote: 0.006651 ( 351)
people: 0.006519 ( 344)
like: 0.006519 ( 344)
office: 0.005837 ( 308)
what: 0.005648 ( 298)
well: 0.004929 ( 260)
could: 0.004910 ( 259)
first: 0.004683 ( 247)
nothing: 0.004683 ( 247)
look: 0.004362 ( 230)
country: 0.004305 ( 227)
come: 0.004286 ( 226)
muslim: 0.004210 ( 222)
trump: 0.004172 ( 220)
republican: 0.004116 ( 217)
take: 0.004040 ( 213)
good: 0.003775 ( 199)
going: 0.003718 ( 196)
they: 0.003624 ( 191)
illegal: 0.003548 ( 187)
problem: 0.003415 ( 180)
failed: 0.003321 ( 175)
terrorist: 0.003302 ( 174)
care: 0.003264 ( 172)
know: 0.003264 ( 172)
voted: 0.003094 ( 163)
business: 0.003056 ( 161)
traitor: 0.003056 ( 161)
citizen: 0.003037 ( 160)
took: 0.002923 ( 154)
fact: 0.002905 ( 153)
control: 0.002848 ( 150)
long: 0.002848 ( 150)
white: 0.002772 ( 146)
http: 0.002715 ( 143)
iran: 0.002696 ( 142)
year: 0.002677 ( 141)
leave: 0.002659 ( 140)
hope: 0.002640 ( 139)
-- topic: 8 ( 71 words)
cultery: 0.000212 ( 1)
require: 0.000212 ( 1)
birthed: 0.000212 ( 1)
imminent: 0.000212 ( 1)
understanding: 0.000212 ( 1)
major: 0.000212 ( 1)
divided: 0.000212 ( 1)
bale: 0.000212 ( 1)
throughout: 0.000212 ( 1)
intrusion: 0.000212 ( 1)
somehow: 0.000212 ( 1)
welcome: 0.000212 ( 1)
groom: 0.000212 ( 1)
seagle: 0.000212 ( 1)
boener: 0.000212 ( 1)
poor: 0.000212 ( 1)
regulation: 0.000212 ( 1)
bully: 0.000212 ( 1)
arrest: 0.000212 ( 1)
current: 0.000212 ( 1)
faux: 0.000212 ( 1)
pjnet: 0.000212 ( 1)
reposting: 0.000212 ( 1)
strictly: 0.000212 ( 1)
uninformed: 0.000212 ( 1)
saddens: 0.000212 ( 1)
chaney: 0.000212 ( 1)
transaction: 0.000212 ( 1)
trading: 0.000212 ( 1)
none: 0.000212 ( 1)
routinely: 0.000212 ( 1)
notarized: 0.000212 ( 1)
quick: 0.000212 ( 1)
orchestrated: 0.000212 ( 1)
manipulate: 0.000212 ( 1)
bullshi: 0.000212 ( 1)
saveamericasfreedomfighters: 0.000212 ( 1)
atomic: 0.000212 ( 1)
tragic: 0.000212 ( 1)
february: 0.000212 ( 1)
creveld: 0.000212 ( 1)
humane: 0.000212 ( 1)
uranian: 0.000212 ( 1)
koolaide: 0.000212 ( 1)
vehicle: 0.000212 ( 1)
dead: 0.000212 ( 1)
washy: 0.000212 ( 1)
ihope: 0.000212 ( 1)
trojan: 0.000212 ( 1)
crew: 0.000212 ( 1)
-- topic: 9 ( 28955 words)
obama: 0.010190 ( 366)
country: 0.008022 ( 288)
every: 0.007049 ( 253)
that: 0.006882 ( 247)
life: 0.005881 ( 211)
muslim: 0.005186 ( 186)
know: 0.005074 ( 182)
iran: 0.004963 ( 178)
would: 0.004908 ( 176)
like: 0.004741 ( 170)
time : 0.004630 ( 166)
america: 0.004630 ( 166)
actually: 0.004630 ( 166)
they: 0.004407 ( 158)
first: 0.004379 ( 157)
government: 0.003934 ( 141)
great: 0.003907 ( 140)
didn: 0.003768 ( 135)
deal: 0.003656 ( 131)
israel: 0.003629 ( 130)
show: 0.003573 ( 128)
issue: 0.003517 ( 126)
what: 0.003462 ( 124)
money: 0.003295 ( 118)
many: 0.003239 ( 116)
president: 0.003184 ( 114)
happen: 0.003184 ( 114)
people: 0.003156 ( 113)
failed: 0.003128 ( 112)
make: 0.003128 ( 112)
nuclear: 0.003017 ( 108)
made: 0.002989 ( 107)
another: 0.002961 ( 106)
said: 0.002961 ( 106)
thank: 0.002933 ( 105)
good: 0.002878 ( 103)
citizen: 0.002794 ( 100)
enough: 0.002683 ( 96)
long: 0.002655 ( 95)
agenda: 0.002655 ( 95)
conservative: 0.002655 ( 95)
class: 0.002628 ( 94)
constitution: 0.002628 ( 94)
hillary: 0.002600 ( 93)
house: 0.002600 ( 93)
without: 0.002572 ( 92)
even: 0.002544 ( 91)
business: 0.002544 ( 91)
feel: 0.002544 ( 91)
american: 0.002544 ( 91)