I was bored over the weekend and reading through [Empirical Study of Topic Modeling in Twitter](http://snap.stanford.edu/soma2010/papers/soma2010_12.pdf] and thought it would be interesting to try to perform this analysis but on comments taken from The Tea Party’s facebook page

Step One - Getting The Data

I am pretty familiar with Facebook’s Graph API from work at a previous start-up. I used a combination of the Graph API to pull the post ids from the page, and then used FQL to pull the comments from each post:

post_ids = '10153066534910779'
FQL_COMMENT = """
	select text from comment where object_id in (%s) limit 2000
"""

def _get_post_ids_by_page_id(page_name, access_token):
    """ """
    url = 'https://graph.facebook.com/{page_name}?fields=posts.id&access_token={access_token}'.format(
        page_name=page_name, access_token=access_token)
    return simplejson.loads(urllib.urlopen(url).read())


post_ids = [post['id'].split("_")[1] for post in \
    _get_post_ids_by_page_id(page_name='teapartypatriots', 
            access_token=token)['posts']['data']]

for post_id in post_ids:
    url = "https://graph.facebook.com/fql?access_token=%s&q=%s" %(token, (FQL_COMMENT % post_id))
    json_comment = simplejson.loads(urllib.urlopen(url).read())
    print len(json_comment['data'])
    df = pd.DataFrame(json_comment['data'])
    df.to_csv("csvs/{}.csv".format(post_id))
    print "csvs/{}.csv".format(post_id)
    time.sleep(1)

I saved the data in csv files using pandas so I could do offline processing and not have to keep hitting Facebook’s API.

Step Two - Running LDA on the Corpus

After you have the data pulled and in csv files, you need to run LDA on the corpus to discover latent topics. It is pretty important to apply preprocessing routines to the corpus before you start LDA, removing stop-works (common words in the English language) as well as making words lower-case, removing puncuation, etc.

corpus=1, words=14020, K=10, a=0.500000, b=0.500000
initial perplexity=2378.337280
-1 p=2378.290024
-2 p=2378.187329
-3 p=2377.986063
-4 p=2377.657654
-5 p=2376.859515
-6 p=2375.187447
-7 p=2371.556214
-8 p=2365.740169
-9 p=2355.670145
-10 p=2340.404687
-11 p=2321.435137
-12 p=2299.308017
-13 p=2273.851541
-14 p=2245.997043
-15 p=2220.681310

-- topic: 0 (71271 words)
people: 0.014901 (1166)
country: 0.012602 (986)
obama: 0.010175 (796)
would: 0.008885 (695)
like: 0.008885 (695)
america: 0.008054 (630)
iran: 0.007927 (620)
american: 0.007773 (608)
need: 0.006777 (530)
know: 0.006751 (528)
congress: 0.006700 (524)
deal: 0.006649 (520)
year: 0.006611 (517)
think: 0.006458 (505)
this: 0.006228 (487)
president: 0.006202 (485)
never: 0.005883 (460)
state: 0.005768 (451)
government: 0.005742 (449)
thing: 0.005078 (397)
done: 0.004810 (376)
going: 0.004784 (374)
failed: 0.004771 (373)
illegal: 0.004541 (355)
make: 0.004414 (345)
world: 0.004196 (328)
problem: 0.004196 (328)
muslim: 0.004158 (325)
many: 0.004081 (319)
agree: 0.004005 (313)
trump: 0.003877 (303)
stop: 0.003737 (292)
party: 0.003711 (290)
still: 0.003647 (285)
everything: 0.003596 (281)
want: 0.003583 (280)
take: 0.003583 (280)
democrat: 0.003570 (279)
http: 0.003558 (278)
there: 0.003558 (278)
wrong: 0.003545 (277)
house: 0.003519 (275)
keep: 0.003443 (269)
work: 0.003392 (265)
much: 0.003366 (263)
bush: 0.003315 (259)
anything: 0.003264 (255)
enough: 0.003200 (250)
must: 0.003149 (246)
even: 0.003111 (243)

-- topic: 1 (6002 words)
great: 0.004880 (63)
came: 0.003881 (50)
real: 0.002805 (36)
post: 0.002267 (29)
first: 0.002113 (27)
president: 0.002037 (26)
obama: 0.001960 (25)
country: 0.001883 (24)
economy: 0.001806 (23)
john: 0.001806 (23)
donald: 0.001652 (21)
action: 0.001499 (19)
iran: 0.001499 (19)
america: 0.001345 (17)
especially: 0.001345 (17)
either: 0.001345 (17)
http: 0.001345 (17)
year: 0.001268 (16)
light: 0.001191 (15)
trump: 0.001191 (15)
voted: 0.001114 (14)
southern: 0.001114 (14)
your: 0.001114 (14)
think: 0.001114 (14)
last: 0.001114 (14)
mean: 0.001114 (14)
question: 0.001038 (13)
paid: 0.001038 (13)
shown: 0.001038 (13)
play: 0.001038 (13)
consequence: 0.001038 (13)
list: 0.000961 (12)
there: 0.000961 (12)
necessary: 0.000961 (12)
like: 0.000961 (12)
weapon: 0.000961 (12)
international: 0.000961 (12)
education: 0.000961 (12)
right: 0.000884 (11)
could: 0.000884 (11)
care: 0.000884 (11)
history: 0.000884 (11)
based: 0.000884 (11)
youtube: 0.000884 (11)
impeachment: 0.000884 (11)
recognize: 0.000884 (11)
trade: 0.000884 (11)
allowed: 0.000884 (11)
contractor: 0.000807 (10)
senator: 0.000807 (10)

-- topic: 2 (576 words)
election: 0.000461 (3)
recession: 0.000461 (3)
plus: 0.000330 (2)
surly: 0.000330 (2)
football: 0.000330 (2)
sentence: 0.000330 (2)
polarize: 0.000330 (2)
iran: 0.000330 (2)
sound: 0.000330 (2)
fault: 0.000330 (2)
threat: 0.000330 (2)
payer: 0.000330 (2)
liable: 0.000330 (2)
making: 0.000330 (2)
greater: 0.000330 (2)
freddie: 0.000330 (2)
islamic: 0.000330 (2)
assumption: 0.000330 (2)
smith: 0.000330 (2)
drowning: 0.000330 (2)
deliverance: 0.000330 (2)
crybaby: 0.000330 (2)
cant: 0.000330 (2)
twitter: 0.000330 (2)
chelsea: 0.000330 (2)
unicorn: 0.000330 (2)
opportunity: 0.000330 (2)
explains: 0.000330 (2)
depression: 0.000330 (2)
crony: 0.000330 (2)
taught: 0.000198 (1)
monger: 0.000198 (1)
milsao: 0.000198 (1)
timeline: 0.000198 (1)
stonewalls: 0.000198 (1)
oboehner: 0.000198 (1)
reread: 0.000198 (1)
jewsnews: 0.000198 (1)
grain: 0.000198 (1)
desecrate: 0.000198 (1)
screening: 0.000198 (1)
flopped: 0.000198 (1)
newfeed: 0.000198 (1)
reaps: 0.000198 (1)
sending: 0.000198 (1)
challenge: 0.000198 (1)
standard: 0.000198 (1)
sour: 0.000198 (1)
enterprise: 0.000198 (1)
primeminster: 0.000198 (1)

-- topic: 3 (3570 words)
china: 0.002127 (22)
fund: 0.001654 (17)
poll: 0.001654 (17)
muslim: 0.001276 (13)
also: 0.001276 (13)
must: 0.001181 (12)
regulation: 0.001181 (12)
read: 0.001087 (11)
care: 0.000992 (10)
bringing: 0.000898 (9)
law: 0.000898 (9)
much: 0.000898 (9)
within: 0.000898 (9)
forced: 0.000898 (9)
time: 0.000803 (8)
hope: 0.000803 (8)
thank: 0.000803 (8)
sanction: 0.000803 (8)
american: 0.000803 (8)
forgotten: 0.000803 (8)
boehner: 0.000803 (8)
federal: 0.000709 (7)
result: 0.000709 (7)
running: 0.000709 (7)
elected: 0.000709 (7)
country: 0.000709 (7)
system: 0.000709 (7)
appropriate: 0.000709 (7)
wait: 0.000709 (7)
candidate: 0.000709 (7)
news: 0.000709 (7)
authority: 0.000709 (7)
speak: 0.000709 (7)
buddy: 0.000709 (7)
mean: 0.000709 (7)
couple: 0.000709 (7)
chance: 0.000709 (7)
business: 0.000709 (7)
pointing: 0.000614 (6)
somebody: 0.000614 (6)
ring: 0.000614 (6)
corporation: 0.000614 (6)
lot: 0.000614 (6)
leg: 0.000614 (6)
watch: 0.000614 (6)
east: 0.000614 (6)
represents: 0.000614 (6)
good: 0.000614 (6)
completely: 0.000614 (6)
student: 0.000614 (6)

-- topic: 4 (12657 words)
think: 0.008924 (175)
obama: 0.006483 (127)
want: 0.004652 (91)
people: 0.004398 (86)
idiot: 0.004042 (79)
great: 0.003839 (75)
failed: 0.003076 (60)
iran: 0.002822 (55)
come: 0.002720 (53)
become: 0.002568 (50)
email: 0.002517 (49)
like: 0.002466 (48)
action: 0.002364 (46)
others: 0.002364 (46)
another: 0.002263 (44)
past: 0.002263 (44)
treason: 0.002263 (44)
terrorist: 0.002110 (41)
country: 0.002110 (41)
nation: 0.002110 (41)
mean: 0.002110 (41)
article: 0.002008 (39)
also: 0.002008 (39)
everything: 0.001958 (38)
course: 0.001958 (38)
change: 0.001907 (37)
barack: 0.001805 (35)
corporation: 0.001805 (35)
democratic: 0.001805 (35)
industry: 0.001805 (35)
really: 0.001754 (34)
michael: 0.001754 (34)
enemy: 0.001754 (34)
president: 0.001703 (33)
time: 0.001703 (33)
soon: 0.001653 (32)
vote: 0.001602 (31)
blame: 0.001602 (31)
office: 0.001602 (31)
they: 0.001602 (31)
back: 0.001551 (30)
keep: 0.001500 (29)
sanction: 0.001500 (29)
that: 0.001500 (29)
research: 0.001500 (29)
leader: 0.001500 (29)
impeachment: 0.001500 (29)
control: 0.001500 (29)
still: 0.001449 (28)
afraid: 0.001449 (28)

-- topic: 5 (4904 words)
trying: 0.002140 (25)
family: 0.001721 (20)
stand: 0.001721 (20)
criminal: 0.001721 (20)
party: 0.001469 (17)
line: 0.001469 (17)
lying: 0.001385 (16)
brain: 0.001301 (15)
page: 0.001217 (14)
daughter: 0.001217 (14)
what: 0.001217 (14)
hillary: 0.001133 (13)
also: 0.001133 (13)
make: 0.001049 (12)
back: 0.001049 (12)
finally: 0.001049 (12)
without: 0.001049 (12)
send: 0.001049 (12)
doesn: 0.000965 (11)
doubt: 0.000965 (11)
perhaps: 0.000965 (11)
marriage: 0.000881 (10)
least: 0.000881 (10)
special: 0.000881 (10)
nut: 0.000797 (9)
blow: 0.000797 (9)
scare: 0.000797 (9)
coverage: 0.000797 (9)
question: 0.000797 (9)
colin: 0.000797 (9)
pack: 0.000713 (8)
exchange: 0.000713 (8)
haven: 0.000713 (8)
these: 0.000713 (8)
pray: 0.000713 (8)
something: 0.000713 (8)
rino: 0.000713 (8)
standard: 0.000713 (8)
grow: 0.000713 (8)
real: 0.000713 (8)
democracy: 0.000713 (8)
forward: 0.000713 (8)
representative: 0.000713 (8)
obama: 0.000713 (8)
constituent: 0.000630 (7)
last: 0.000630 (7)
building: 0.000630 (7)
pick: 0.000630 (7)
guarantee: 0.000630 (7)
cover: 0.000630 (7)

-- topic: 6 (1492 words)
smarter: 0.000647 (5)
waiting: 0.000647 (5)
donald: 0.000529 (4)
clearly: 0.000529 (4)
command: 0.000529 (4)
income: 0.000529 (4)
were: 0.000529 (4)
rawnsleyb: 0.000529 (4)
line: 0.000529 (4)
whatever: 0.000412 (3)
refused: 0.000412 (3)
ever: 0.000412 (3)
like: 0.000412 (3)
trusted: 0.000412 (3)
suggestion: 0.000412 (3)
obese: 0.000412 (3)
overthrow: 0.000412 (3)
sarah: 0.000412 (3)
asking: 0.000412 (3)
clerk: 0.000412 (3)
bullfrog: 0.000412 (3)
remains: 0.000412 (3)
list: 0.000412 (3)
capitalism: 0.000412 (3)
maybe: 0.000412 (3)
favorite: 0.000412 (3)
respect: 0.000412 (3)
regulation: 0.000412 (3)
laid: 0.000412 (3)
reject: 0.000412 (3)
sensible: 0.000412 (3)
susan: 0.000412 (3)
productive: 0.000412 (3)
didn: 0.000412 (3)
regret: 0.000412 (3)
live: 0.000412 (3)
norm: 0.000412 (3)
given: 0.000412 (3)
prosecution: 0.000412 (3)
security: 0.000412 (3)
magazine: 0.000412 (3)
think: 0.000412 (3)
worldwide: 0.000412 (3)
follower: 0.000412 (3)
enabled: 0.000412 (3)
opening: 0.000412 (3)
leading: 0.000412 (3)
newly: 0.000412 (3)
created: 0.000412 (3)
christopher: 0.000294 (2)

-- topic: 7 (45838 words)
obama: 0.022659 (1197)
american: 0.010833 (572)
america: 0.010398 (549)
deal: 0.009811 (518)
time: 0.008524 (450)
right: 0.008487 (448)
president: 0.007881 (416)
want: 0.007805 (412)
need: 0.007162 (378)
think: 0.006859 (362)
back: 0.006727 (355)
vote: 0.006651 (351)
people: 0.006519 (344)
like: 0.006519 (344)
office: 0.005837 (308)
what: 0.005648 (298)
well: 0.004929 (260)
could: 0.004910 (259)
first: 0.004683 (247)
nothing: 0.004683 (247)
look: 0.004362 (230)
country: 0.004305 (227)
come: 0.004286 (226)
muslim: 0.004210 (222)
trump: 0.004172 (220)
republican: 0.004116 (217)
take: 0.004040 (213)
good: 0.003775 (199)
going: 0.003718 (196)
they: 0.003624 (191)
illegal: 0.003548 (187)
problem: 0.003415 (180)
failed: 0.003321 (175)
terrorist: 0.003302 (174)
care: 0.003264 (172)
know: 0.003264 (172)
voted: 0.003094 (163)
business: 0.003056 (161)
traitor: 0.003056 (161)
citizen: 0.003037 (160)
took: 0.002923 (154)
fact: 0.002905 (153)
control: 0.002848 (150)
long: 0.002848 (150)
white: 0.002772 (146)
http: 0.002715 (143)
iran: 0.002696 (142)
year: 0.002677 (141)
leave: 0.002659 (140)
hope: 0.002640 (139)

-- topic: 8 (71 words)
cultery: 0.000212 (1)
require: 0.000212 (1)
birthed: 0.000212 (1)
imminent: 0.000212 (1)
understanding: 0.000212 (1)
major: 0.000212 (1)
divided: 0.000212 (1)
bale: 0.000212 (1)
throughout: 0.000212 (1)
intrusion: 0.000212 (1)
somehow: 0.000212 (1)
welcome: 0.000212 (1)
groom: 0.000212 (1)
seagle: 0.000212 (1)
boener: 0.000212 (1)
poor: 0.000212 (1)
regulation: 0.000212 (1)
bully: 0.000212 (1)
arrest: 0.000212 (1)
current: 0.000212 (1)
faux: 0.000212 (1)
pjnet: 0.000212 (1)
reposting: 0.000212 (1)
strictly: 0.000212 (1)
uninformed: 0.000212 (1)
saddens: 0.000212 (1)
chaney: 0.000212 (1)
transaction: 0.000212 (1)
trading: 0.000212 (1)
none: 0.000212 (1)
routinely: 0.000212 (1)
notarized: 0.000212 (1)
quick: 0.000212 (1)
orchestrated: 0.000212 (1)
manipulate: 0.000212 (1)
bullshi: 0.000212 (1)
saveamericasfreedomfighters: 0.000212 (1)
atomic: 0.000212 (1)
tragic: 0.000212 (1)
february: 0.000212 (1)
creveld: 0.000212 (1)
humane: 0.000212 (1)
uranian: 0.000212 (1)
koolaide: 0.000212 (1)
vehicle: 0.000212 (1)
dead: 0.000212 (1)
washy: 0.000212 (1)
ihope: 0.000212 (1)
trojan: 0.000212 (1)
crew: 0.000212 (1)

-- topic: 9 (28955 words)
obama: 0.010190 (366)
country: 0.008022 (288)
every: 0.007049 (253)
that: 0.006882 (247)
life: 0.005881 (211)
muslim: 0.005186 (186)
know: 0.005074 (182)
iran: 0.004963 (178)
would: 0.004908 (176)
like: 0.004741 (170)
time: 0.004630 (166)
america: 0.004630 (166)
actually: 0.004630 (166)
they: 0.004407 (158)
first: 0.004379 (157)
government: 0.003934 (141)
great: 0.003907 (140)
didn: 0.003768 (135)
deal: 0.003656 (131)
israel: 0.003629 (130)
show: 0.003573 (128)
issue: 0.003517 (126)
what: 0.003462 (124)
money: 0.003295 (118)
many: 0.003239 (116)
president: 0.003184 (114)
happen: 0.003184 (114)
people: 0.003156 (113)
failed: 0.003128 (112)
make: 0.003128 (112)
nuclear: 0.003017 (108)
made: 0.002989 (107)
another: 0.002961 (106)
said: 0.002961 (106)
thank: 0.002933 (105)
good: 0.002878 (103)
citizen: 0.002794 (100)
enough: 0.002683 (96)
long: 0.002655 (95)
agenda: 0.002655 (95)
conservative: 0.002655 (95)
class: 0.002628 (94)
constitution: 0.002628 (94)
hillary: 0.002600 (93)
house: 0.002600 (93)
without: 0.002572 (92)
even: 0.002544 (91)
business: 0.002544 (91)
feel: 0.002544 (91)
american: 0.002544 (91)