Topic Modelling
- Tech Stack: Python 3.x, json, pandas, matplotlib, nltk, wordcloud, gensim
- Github URL: Project Link
Topic Modelling
This project uses Latent Dirichlet Allocation (LDA) algorithm to perform topic modeling on a corpus of documents. The aim of the project is to identify latent topics in the corpus and explore the distribution of topics across the documents.
This notebook loads data from a JSON file and performs some data analysis and visualization on it. The code reads the first 10,000 lines from the JSON file and loads them into a list of dictionaries using the json.loads() function. It then extracts the categories of the news articles, creates a frequency distribution of the categories, and plots a bar chart and a pie chart to visualize the distribution. The code also groups the news articles by category and plots a line chart to show the trend of the publication dates of the articles in each category. Finally, the code defines a function to clean the text data by removing URLs, HTML tags, punctuation, and stop words, and tokenizing the text.