This article is the first feature in the Projects of UpCode Series I, where UpCode Academy instructors lead teams of graduates to work on (hopefully) fun projects.
Interested to learn how to build projects like this? This project is built by our graduates! Have a look at our Python Development course for beginners and Data Science Introduction course for engineers trying to start a career in Data Science.
Internet communities are a wonderful thing – they have the amazing ability to both unite and divide, depending on their focus. Where there are people, there will inevitably be a community. r/Singapore (r/Sg) and Eat Drink Man Woman (EDMW) are two of such local communities online in Singapore.
Given their popularity, there have been countless comparisons and discussions about the differences between these two forums. However, they were mostly qualitative assessments and based on the members’ gut feelings.
Here’s the question: can we do better than that and quantitatively identify the differences?
We here at UpCode think so. Three of our graduates, Cheryl, Jones, and Jun Jie (all graduates of the Data Science Intro class), together with instructor Jackie, analyzed the differences between these two prominent local forums using data science techniques.
What are r/Sg and EDMW?
But before we begin: a little context. Reddit is a forum that sees popular usage internationally. Reddit has many sub-pages (or sub-reddits) that focus on specific things such as photography, anime, cooking or even specific countries, such as r/Sg. Here’s a random screen grab of a post on the Geylang Serai night market:
EDMW, on the other hand, is a subforum that belongs to HardwareZone forum. It boasts the highest activity, and the members can talk just about anything – current news, popular issues, etc. Here’s a random screen grab of news of a mining tycoon marrying someone:
If the screenshots are of any indication, we observe that there are major differences in tone, topic, and language between the two forums. In fact, EDMW has been described as trashy and crass whereas Reddit has been described as intellectual, sophisticated, and Dunning-Kruger effect in a nutshell. What these two have in common is that members from both communities are equally judgemental and loyal to their respective communities.
We decided to conduct an initial exploration of the two forums, spending time to get acquainted with their respective cultures. Based on our initial exploration, we formed three hypotheses of what we thought are the differences between r/Singapore and EDMW and were quantifiable.
Between these two forums, we hypothesized that there was a difference:
- Between the language used by the members
- In politeness/crassness of the members
- In the activity of the members
For this project, we used Python programming language to scrape, parse, and process the text from both forums. More specifically, we used pushshift.io, an open API for Reddit data to scrape r/Sg. However, HardwareZone did not have an API to call so we used the BeautifulSoup library to scrape the comments ourselves.
Timewise, we started scraping from 1st Jan 2018 to 29th Apr 2019 (when the project started).
It would have been impractical to scrape the entire forum, and we reasoned that the time window was representative enough. All in all, we scraped 1,191,003 comments from r/Sg and 1,717,873 from EDMW. We obtained comments, commenters, and the comments’ timestamps. To organize the scraped data into a structured form, we used the Pandas library.
Here’s what we found.
#1 EDMWers write more simply than r/Sg
To test the first hypothesis, we used the Gunning Fog Index to calculate readability and compare the simplicity of comments between EDMWers and r/Sg. The Gunning Fog formula generates a grade level, typically between 0 and 20. The formula estimates the years of formal education the reader requires to understand the text on first reading.
As such, if a piece of text has a grade level readability score of 6 then this should be easily readable by those educated to 6th grade in the US schooling system, i.e. 11-12 year olds. But don’t worry, we’ll do the conversion to our local education system for you.
We used the Textastic Python library to calculate the readability scores of both EDMW and r/Sg comments. You can find the code and writeup here.
Upon analysis, we found the Gunning Fog score of EDMW to be roughly 6, whereas r/Sg was roughly 8 which corresponded to 6th (11 years old) and 8th grade (13 years old) respectively. Given that Singapore’s education system is more advanced than that of America’s, we figured that the actual equivalent for us is probably earlier, which is roughly a year.
Caveat: this metric works best when used on pure English text but comments from EDMW and Reddit forums will definitely contain occasional Chinese characters (especially so for EDMW), emoticons, text emojis 🙂 and other unconventionally structured content ¯(ツ)/¯.
tl;dr – EDWMers write like a 10/11-year old, whereas r/Sg writes like a 12/13-year old
#2 EDMW is more polite than r/Sg
This may come as a surprise to you (it came as a surprise to us), but hear us out first. We defined politeness as a function of how frequent profanities (or words commonly construed as profanities) occur in our scraped messages. As such, we prepared a list of interesting words to flag out.
The most common word that is flagged out in EDMW is “knn” (kanina), a popular Hokkien profanity. On the other hand, “shit” is the most commonly flagged word in r/Sg. Interestingly, 4/10 words in EDMW are local slangs whereas there are none in r/Sg. The lack of usual profanities may also be a reflection of the language moderation that is present in EDMW but not in r/Singapore.
tl;dr – EDMWers swear less than r/Sg
#3 There are more Redditors than EDMWers, but EDMWers are far more active
We were also interested in the members’ activity, and wanted to assess which forum had a more active membership. In the timeframe that we set, we found that there were 34,756 unique commenters for r/Sg whereas there were 15,510 unique commenters in EDMW. If each of them got into a fight, EDMWers would have to fend off two people from Reddit on average.
Out of curiosity, we also built a leaderboard to identify who the most active members were in the two forums. We decided to censor part of the names so that it would not reveal everything, but enough such that members of the respective communities can recognize who they were. It’s a good chance to celebrate their contributions to their respective forums, after all.
At first glance, we see that even though there were twice as many unique Redditors in r/Sg commenting, EDMWers commented 50-100% more in quantity. Activity-wise, the top commenter from EDMW commented every 14 minutes on average whereas the top commenter from r/Sg commented every 70 minutes on average.
tl;dr – There are twice as many unique Redditors compared to EDMW-ers, but EDMW-ers are far more active
Interesting insights using word clouds
We also generated a word cloud to map the frequency of words appearing in all of the messages. A word cloud visualizes the prominence of keywords in bodies of text. We thought it’d be interesting to identify the words that feature most frequently in the respective forums.
r/Sg Redditors comments are less assertive, as shown by the prevalence of the words “think”, “believe”, and “try”. This is commonly found in comments where the user says “I think…”, “I believe…”, and “I try…”. They’re also a thankful and polite bunch, as shown the occurrence of “thank”. Overall, quite positive.
There are several insights that we can observe from EDMW’s wordcloud. Firstly, there is a higher occurrence of Singlish, e.g., la, siaa, one, etc. One unusual word stands out in the word cloud, “china”. This reflects the sentiment of a lot of threads, where EDMWers will express their disdain for people from China.
We successfully scraped comments from both forums and performed data analysis using Python to test our hypotheses about these two communities. We believe that it is with these sort of analyses that add more appreciation to the cultural treasures that are EDMW and r/Sg.
There’s a lot more analyses that we did, and you can check out how we did it in our Github repository, and possible extend our work.
If you want to build projects like this, you can do so by attending our Python Development course for beginners and Data Science Introduction course for engineers trying to start a career in Data Science.