reddit scrape analytics project notes

03-09-2025

I had an idea to scrape and systematically analyze text from the r/conservative subreddit. I have a preliminary scraper written, and currently setting up postgres and pgadmin4 on this machine (using an alt machine while my mac is in the shop).

NICE. I set up a local db in postgres and tested the connection in a python script, which returned this \/ truly a gorgeous site to behold. 
Connected to PostgreSQL successfully!

Now I will attemp to run the scraper with the db connection and start populating the db with the reddit data, I a scraping the conservative subreddit specifically for posts about aid to ukraine and US foreign policy vis a vis russian-ukraine.

ouch. something is wrong my my refresh token. will check string and/or generate a new one in config.py.  ok, refresh token fixed. scraper is running:

Scraping posts and comments...

Sleeping for 10 minutes... 

Ok. It looks like my DB is populating! 26 rows in comments table.  11 rows in posts table. This is great. I think next steps is to set up an automated, periodic scrape while I figure out how I want to approach analytics. This could become a big dataset pretty fast, which is great for analytics but may cause issues in storage/management.

DBs are incredibly efficient and after double checking storage needs I am assured there is no reason to worry about hosting this db on my local machine. I set up a scheduled task on my windows machine to run the scraper every 3 hours, then deleted the old 10 minute sleep functionality. I have it set up in the db to not collect the same post or comment twice, and I also made sure I had the data to reconstruct comment hierarchies later when I get into the analytics.

I am very happy with the progress I made on this project today. I think I'm going to give this project a rest, let the data bulk up for a few days, and focus on other projects. Here is the (private) github repo for today's work. If someone out there (trusted) wants access to it, feel free to reach out. 

Summary of progress: 

Next Steps: 

03-10-2025

I realized today while at work that a, perhaps better for several reasons aproach, to this project would be to extend the post historical reachback limit and just grab a long history of posts that match the keywords from r/conservative. I would then have ample data to do some time series and/or NLP analytics without having to wait for my db to bulk up. While at work (manual labor, brainless job) I felt sad that I don't have a political data job. Hopefully a stronger portfolio will help me inch towards having a job in the discipline I've invested so much into studying. Ah...America. 

I adapted a copy of the old continuous scraper to be a one time massive scrape using  pushshift, I am unclear whether I will be able to use this since the terms state this api is intended for users that are also mods, and apparently this api has added several restrictions on use over the past few years. 

Error fetching data: 403

Ok, seems like pushshift is a no go. Back to  a PRAW approach.  

 Alright, the new scraper is running. I am scraping back 4 years in time so this may take a while to run. I will wait in anticipation and hope that my effort will be fruitful...

Just checked the db in pgadmin4 and it is indeed populating with a bunch of data.  I am thrilled.

While this runs I am thinking about analytics approaches. I think that the score (function up and down votes) will be an interesting parameter to build a time series analysis with. Perhaps some statistical learning on the relationship between n-grams and scores as well. I also have some aggregated polling data on US support for aid to Ukraine that might be interesting to compare against trends in the conservative subreddit, the polling data can also be grouped to only focus on republicans and republican-leaning folks.  

Uh oh. getting a few 429 errrors (reddit api request rate limit). Looks like it only happened for 3 specific posts. I can store those post IDs and go back for them later. Hopefully this doesn't happen too much more., as the script itself is still running.

ok the rate limiting resulted in no more data being added to the db. I changed the time between requests up enough to hopefully avoid this and am running the new script.

Good god. I had several api request issues, then some other convoluted issues. But it appears my scraper is running as expected and is slowly moving backwards in time to collect older posts and comments from the sub. Since I had to do some jenky sleeps (ROBOT NAP strikes again) to avoid all the rate request limits, this is going to have to run for quite some time- as I intend to go back four years in the sub history. I am going to leave it be and let it run and periodically check the db to ensure data insertion is continuing to happpen and continuing to go back in time. I could probably implement a programatic check, but I don't want to halt my precious scraper now that it is running and I am, ultimately, a noob. Here is what the sequence of print statements look like, i love the peace of mind a print statement brings me:
Checking post: 1j5dufg - Congress Should Ban School Accreditors From Enforcing DEI

Checking post: 1j5drpw - Is MAGA Wrong on Ukraine?

✔ Matched post: 1j5drpw - Is MAGA Wrong on Ukraine?

✔ Post inserted: 1j5drpw

✔ Storing comment: mggd3i5 (2025-03-07 03:56:00) 

Eh this scraper is going to take several days to get all the data. I guess that is ok and worth it. People like big ole data. Personally I think little data is cool too.

I am probably done working on this project until the scraper is complete. Here is the scraper I built today (private, message me if you want access and are trusted).

Thank god I took that course in SQL and relational DBs, there is no way I'd be able to store all these data without a trusty postgres db. psycopg2 package has no business going this hard, frankly.

 I am also thinking a lot about my Columbia university peer mahmoud being warrantlessly abducted under threat of having his permanent legal status revoked for being a student leader (and a very kind, thoughtful and nonviolent one at that) in an anti-genocide student movement. When this university pitched itself to me as an institution that cared about social justice, I foolishly believed it. I think about dropping out all the time. I'm not convinced the cost is worth it for the skill acquisition, I could learn the same skills at a community college...all ethical and moral considerations aside. It is a bit disappointing how much the hiring market seems to care about completed (I have literally 2 classes left, I essentially got all the skills) and fancy degrees more so than simple skills. Having "must have a MA or PHD" in the hiring requirements is honestly a massive employer red flag for me. Merit and skill should be all that matter. 


3-11-2025

The date header is a liberal hoax, it is actually still 3-10-2025 I lifted weights in the evening and then felt like coding.


Making this thing more efficient:

Here is a side by side: https://youtu.be/t36Vd1v-rbo  (these collapsible text boxes should allow video embeds) 

The slow poke approach I initially took now looks laughable. such is improvement. I had to slow poke around in the sub before I could start sprinting! My batch insert scraper has done more in 25 minutes what my slow poke scraper did in 3 hours. GAINZ. 

This could still be a lot faster. Scraping 10 days of posts and comments in 20 minutes when my goal is scrape ~4 years worth of data is really nothing to gawk at. This will come out to approximately 2.5 days of non stop run time. I'll need to think about whether this is acceptable or if I ought to try to make this even more efficient.

This would be a good time to work on understanding and implementing parallelization. This is honestly a big gap in my personal knowledge and experience, I bet I could figure it out.

In any case, I must tuck in my disabled client.

I tucked in my disabled client. I noticed that at some point, my reddit query bot was not adding new posts to the db, only new comment. What was happening in there is a default limit subreddit(subreddit).new(limit=1000) that will remain 1000 max even if you limit = None. 

I tried a custom pagination approach almost identical to what this person tried, looked around in other forums, and asked chatgpt. I will not be able to shirk this limit. While not the massive dataset my hear desires, my db is still alright to chug along for this exercise project. I will also continue running the OG scraper continuously to snag new posts, I may alter it a bit for efficiency. 

I did find an academic torrent of reddit data but trying to work with it would involve working with over 2 TB of compressed files. I have a BA in political science. I can code a little bit but working with >2TB is beyond my CS knowledge at present and I'd rather not accidentally render my machine inoperable at the moment. 

--- ACTUALLY 3-11-2025-----

Ran the scraper.py (snags today's new posts and comment in the sub with the slow comment scrape to keep db up to date) 

Going to run some rudimentary nlp and time series analytics - torn between doing this in R or Python. I think I'll do R just bc I don't yet have jupyter notebook on this device and will save about 3 mins by just doing this in R

Alrighty, I vizzed a few things. You can look at the vizzes on this page

My housemate just now: "is there any sauce that doesn't work on a chicken nugget?"

Things I need to get better at: 

3-24-2025

I had a PTSD flare up triggered by a bad nightmare about my sexual assault that took me out of commission for the past week or so. One thing that I think is cool about this independent side project is that people who read these notes about the experience of the researcher can see the full, sometimes ugly, truth. I intend to be totally, perhaps embarrassingly, honest in these project notes.

All I could bring myself to do for this project since the 11th is run the scraper. One really sad impact of my PTSD is that it routinely creates these massive walls of anxiety around doing any work beyond what I must do to survive. Without the pressure of a boss or team relying on me to do this work, it's particularly easy to opt out of productivity and get trapped in anxiety pits. I routinely have to kiss my creative mind goodbye on periodic, temporary bases. I'm reaching out the services to try and get personal justice and support services for what happened to me, but navigating resources is a part time job in and of itself and I haven't found nor have I been able to afford the care I truly need. I deeply mourn the productive, creative, brilliant mind I was reliably able to exercise before I got sexually assaulted and I hope to one day have it back. In the meantime, I'm going to give myself whatever space I need to try and relax when the trauma strikes. 

I did, however, recently pick up another disabled care client who I enjoy helping out and the additional income helps me survive capitalism. So that is good. I might fiddle around with this project today, but in earnest will likely just try my best to keep resting. If anyone reading this resonates with anything here...I believe you and I'm sorry for whatever happened to you and how it has impacted you. I have hope we will feel normal again one day.  

3-25-2025

The trauma spike issue I discussed yesterday is feeling much less prominent today, which is wonderful. I had a great, 2 hour long conversation with my host son from my time as an au pair. I LOVE MY BOY. 

I want to do some more complex analytics on my dataset, and have been wanting to play around with network analysis. The idea I am going to implement today is a conversation cascade analysis, to glean some insights about how conversations amongst users grow and branch in the trees below posts about US aid to Ukraine.

I have some network analysis experience in R, but have long been interested in learning networkx in python. Today, I am going to implement prototypes of this analysis in R and Python - to analyze analytic tradeoffs and assess relative suitability for my specific dataset.

Python Protoype:
First I need to select a post to  try the protoype with, I wrote a SQL query on my reddit scrape database that gave me a ranked list of most volatile commented post (parameter = SD of post score where # of comments is more than 50)

The most volatile post in my dataset had post ID = 1j88xyg

I'll rock with this one for my protoype, unless I find any issues with it's suitability for this as I work with it.

heh. I had the idea to include both the most and least volatile in the prototype, to keep things spicy.

My least volatile was post ID = 1j1cc38

Referencing conda cheat sheet like an animal... : - )

Huh so the python prototype is kinda interesting.

The most contentious post was about Elon saying that the cyber attack on tesla can from Ukrainian VPNS, and the least contentious was a sort of internal commsn "letter to real conservatives" where the poster talks about recent criticismm of trum showing up in the subreddit and saying it might be outside infiltration. 


The most contentious (elon ukraine) post had a network viz showing a max tree depth of 10 and a pretty scarce viz, while the least (internal comm) had a max tree depth of 19 and was *much* busier. I think this means that the one that was an internal conservative comms post sparked a lot of dialogue while not being contentious. Meanwhile the most contentious post had a much higher variance in scores amongst comments but didn't spark lengthy comment tree dialogue. This makes intuitive sense.

I plotted the distribution of scores for the comments on each post. Most contentious (elon) post was bimodal with a super popular post, and the least contentious was essentially normal with a tall bar around 0.

That super popular comment on the contentious (elon ukraine) post was a user expressing skepticism that these were actually ukrainians, stating "my VPN can do that, so im not convinced..."

I'm logging off for the day to go workout and then do my care job.

I'mm leave the R protyp for tomorrow and perhaps also do a writeup on today's results later tonight or tomorrow. I do indeed want someoone willing to pay me to work to see these projects. :P 





3-26-2025

While I still want to make the R prototype, the results from the python prototype are sufficiently interesting that I'm going to craft a writeup on results for my substack. Working on this now. 

It seems I had a scrape gap, discovered it after I reran the frequency analysis in the rmd. It appears I need be using my batch insertion scraper as I built in a longer back history scrape into this one. I'm actually not sure what I put at the stop scraping criteria for the initial scraper. Don't really care to try and figure that out at this moment, but will add it to my todo list. Ah oh I just checked, the initial scraper has a post keyword match limit of 100.  The batch insertion scraper has a limit of 1000. The batch insertion also had a bunch of satisfying print statements that make the experience of running the scraper in the powershell much nicer, will prob just default use this one when manual scraping.

GORGEOUS. I fixed the scrape gap and now have complete time series data. I'm glad this was a  pretty simple fix. time series data gaps is noob behavior :P.

Heh, looks like I might still have a slight gap? hmmmm. Ok it looks like this might not be a scraping issue, but there might not have been much content on a few days in mid may. It's still sus, but there is at least a few post datapoints in the gap region. ugh. Welp, I actually cannot fill this gap if it's a scraper issue given the scrape/pagination limits on PRAW, and as we have mourned, I cannot get push shift access. I'm just going to move forward. and will think about how to include this scrape gap in my report, shamefully...It's odd that there is at least one post datapoint in the gap...but I'm very skeptical this is a natural lack of data for these days (as in, the sub was not posting about ukraine on those days...but this is possible).  This is good motivation to prioritize setting up a cloud-run automated scraper, as my scheduler on my dinky backup machine is not being reliable.


I integrated the SQL query into my ipynb for grabbing the most and least contentious posts  (with comments more than 50), instead of running it with the pgadmin4 query tool. 

A rerun of the conversation cascade script resulted in a change to the most contentious post, least contentious post remained the same. The max tree depth of the least contentious post is deeper now as well, since more discourse has happened in the comments for the post since rerun of the scrape. The comment tree viz for the least contentious looks like a thumb print. 

Oh woah this most contentious post is pretty interesting, it's  a call out of Putin for "snubbing" trump. I was excited about the elon one from yesterday. 

The only issue I would have on reporting this result is that the post is nascent, and ideally there should be more time allowed for posts to accumulate.  Perhaps adding a time requirement for how long the post needs to be live as part of the sql query is a good idea, or a minimum comment count. I'm going to try the time requirement and see what I get. maybe 24 hours?  ok the 24 hour requirement returned the same results. I am going to make the comment requirement >=50


Ah gotta go. will come back to this later. will think about what do. I'd like to produce a written report today. maybe just reporting on the least contentious could be cool for today.



3-28-2025

I have working converstion cascade vizzes written and produced in R, as well as comment score plots. There are some interesting insights. Writing write up now.

I finished the write up off the result! https://madisonraasch.substack.com/p/conversation-cascade-analytics-rconservative

Something I wanted to say in the report but didn't: "findings suggest that circle jerks aren't circles at all, but rather very deep trees."