Data, tools and main features
The data used is obtained from these news sources (news, opinions):
Search terms: "climate change" AND "Thunberg"
- BBC News UK (27, 0)
- Fox News (28, 3)
- The New York Times (5, 2)
- Al-Jazeera English (10, 2)
- ABC News Australia (16, 0)
- Total: 86 news articles, 7 opinions
Tools used
- Python 3.7: for gluing everything together + standard library tools
- Jupyter Notebook: prototyping, testing tools and results
- Pandas: Data handling & storage operations (CSV, no database)
- NewsAPI: Searching for news & retrieval of metadata
- Newspaper3k: Fetching news content & initial preprocessing
- NLTK: General NLP module (SpaCy also tested, but didn't switch to that)
- Vader: Sentiment analysis
- Afinn-corpus: Degative word recognition
- Penn-Treebank-corpus: POS tagging
- Scikit-learn: To provide Latent Dirichlet Allocation (LDA) for topic modeling
- Flask: Web framework
- Plotly: Creating graphs
- Github: Version control and bridge to deployment
Preprocessing
- Newspaper3k: from HTML to text and initial preprocessing
- Manually created lists: removing code, links, picture galleries, copyright & social media links
- Stopword removal: NLTK stopword + manually created list
- Punctuation removal: Python standard library String.punctuation + manually created list
Most frequent words in corpus
All text concatenated, with stopword removal. Displaying 20 most frequent terms.
Rank |
Word |
Frequency |
1 |
year |
250 |
2 |
people |
213 |
3 |
trump |
205 |
4 |
world |
183 |
5 |
time |
173 |
6 |
president |
142 |
7 |
us |
141 |
8 |
one |
130 |
9 |
new |
124 |
10 |
also |
122 |
11 |
madrid |
112 |
12 |
activist |
106 |
13 |
countries |
99 |
14 |
un |
97 |
15 |
first |
96 |
16 |
many |
91 |
17 |
global |
91 |
18 |
conference |
90 |
19 |
action |
88 |
20 |
think |
80 |