Research
My research centers on advancing a broad range of areas, including Large Language Models (LLMs), LLM Agents, Human–LLM Interaction, Large Multimodal Models (LMMs), NLP for Social Good, NLP for Low-Resource Languages, AI in Healthcare, Vision-Language Models (VLMs), Trustworthy AI, Multimodal Agents. I am deeply involved in both the theoretical exploration and practical application of these technologies across diverse real-world domains. Below are some of the research areas I have worked on or am currently exploring.
1. Large Language Models for Social Media Analysis
Sentiment analysis in the context of Bangladesh’s elections plays a crucial role in understanding voter perceptions and public opinion about different political parties. By analyzing social media platforms such as Facebook, X (formerly Twitter), and online newspapers, it becomes possible to capture citizens’ emotions, whether positive, negative, or neutral, towards electoral campaigns and political agendas. Hate speech detection, on the other hand, focuses on identifying and classifying language that promotes violence or discrimination against people based on race, religion, gender, or sexual orientation. Both tasks are highly important in today’s digital world, where online communication can significantly influence society. Large Language Models (LLMs) such as Gemini 1.5 Pro and GPT-3.5 Turbo have transformed natural language processing (NLP). They support prompting techniques like Zero-Shot Learning (ZSL) and Few-Shot Learning (FSL), which are especially useful for sentiment analysis and hate speech detection. In ZSL, the model can classify text into categories such as positive, negative, neutral, or hateful without being trained on task-specific labeled data, relying instead on its broad pretraining knowledge. In FSL, the model learns from only a few labeled examples, quickly adapting to the task with minimal data. These prompting techniques are particularly valuable for low-resource languages like Bangla, where large labeled datasets are often unavailable, making ZSL and FSL powerful alternatives to traditional methods.
Motamot: A Dataset for Revealing the Supremacy of Large Language Models over Transformer Models in Bengali Political Sentiment Analysis
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Rabeya Islam Mumu, Md Mahabubul Alam Abir, Abrar Nawar Alfy, Mohammad Shafiul Alam
Conference: Published in The IEEE Region 10 Symposium (TENSYMP 2024)
View PaperInvestigating the Predominance of Large Language Models in Low-Resource Bangla Language Over Transformer Models for Hate Speech Detection: A Comparative Analysis
Authors: Fatema Tuj Johora Faria, Laith H. Baniata, Sangwoo Kang
Journal: Published in MDPI Mathematics (Q1)
View Paper2. Vision–Language Models for Medical Visual Question Answering
Medical Visual Question Answering (MedVQA) lies at the intersection of computer vision, natural language processing, and clinical decision-making, aiming to generate accurate responses from medical images paired with complex inquiries. Traditional approaches in MedVQA often rely on supervised learning with limited annotated datasets, making them prone to overfitting and limiting their generalization across diverse medical cases. Zero-shot learning offers a way to bypass large-scale annotation, but it frequently struggles with complex reasoning, producing direct answers without revealing the underlying logic. This lack of transparency is particularly concerning in medical applications, where understanding the reasoning behind a diagnosis is as crucial as the answer itself. To overcome these challenges, a chain-of-thought prompting framework is employed to guide vision–language models to perform stepwise reasoning. By decomposing the problem, analyzing both visual and textual information sequentially, and generating an explicit reasoning path, this approach enhances interpretability, trustworthiness, and the clinical relevance of model predictions.
Analyzing Diagnostic Reasoning of Vision–Language Models via Zero-Shot Chain-of-Thought Prompting in Medical Visual Question Answering
Authors: Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi, Sangwoo Kang
Journal: Published in MDPI Mathematics (Q1)
View Paper3. Large Language Models for Mental Health Advice Generation
Bangla text generation for mental health advice aims to provide empathetic, culturally relevant guidance to the Bangladeshi population, addressing sensitive issues such as sexual abuse, miscarriage, divorce, and self-harm ideation. Traditional approaches face challenges due to limited domain-specific datasets that reflect the country’s unique linguistic and societal context, making general-purpose models less effective. Previous research has explored mental health text generation in other languages such as English, Hindi, and Chinese, but there has been no domain-specific dataset or dedicated study for Bangla. Zero-shot learning allows models to generate responses without extensive training, but it often produces generic or contextually inappropriate advice, limiting its usefulness in mental health applications. To overcome these limitations, chain-of-thought prompting with advanced LLMs such as GPT-4o Mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro can be applied. By guiding models to reason step by step, analyze context, and incorporate societal nuances, this approach enhances the relevance, empathy, and interpretability of generated advice.
MindSpeak-Bangla: A Domain-Specific Dataset for Automatic Chain-of-Thought Adaptation in Mental Health Support for Low-Resource Bengali Language Settings
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Md. Mahfuzur Rahman, Khan Hasib, Md. Jakir Hossen, Dr. M. F. Mridha
Journal: Under Review in IEEE Open Journal of the Computer Society (Q1)
4. Large Multimodal Models for Remote Sensing Imaging
Remote Sensing Visual Question Answering (RSVQA) extends the capabilities of traditional computer vision and natural language processing by enabling models to answer complex, natural-language questions about geospatial data. It plays a critical role in applications such as environmental monitoring, urban planning, disaster response, and resource management, where accurate interpretation of spatial and contextual information is essential. Traditional approaches and zero-shot learning methods often fall short in RSVQA because they tend to generate direct answers without explicitly reasoning over spatial relationships, contextual cues, or multi-step dependencies in satellite imagery, which can result in incorrect or superficial responses for complex queries that require layered understanding. To overcome these limitations, chain-of-thought prompting guides large multimodal models to reason step by step, breaking down problems into interpretable intermediate steps that reflect spatial and contextual analysis. Integrating self-consistency with chain-of-thought prompting further enhances reliability by generating multiple reasoning paths and selecting the most consistent answer, reducing errors from individual reasoning chains and improving model robustness, interpretability, and confidence in geospatial decision-making. Using proprietary large multimodal models such as GPT‑4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet, RSVQA can advance beyond simple answer prediction toward more explainable and trustworthy analysis of Earth observation data.
Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models
Authors: Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi, Sangwoo Kang
Journal: Under Review in MDPI Mathematics (Q1)
5. Large Language Models for Natural Language Inference (NLI)
Natural Language Inference (NLI) is an important task in natural language processing (NLP) that examines the relationship between two sentences to determine if one sentence (the premise) supports, contradicts, or is unrelated to another sentence (the hypothesis). This capability is crucial for various applications such as answering questions, retrieving information, and creating chatbots, as it enhances computers' understanding of human language. NLI consists of three main categories: entailment, where the truth of the hypothesis can be inferred from the premise; contradiction, where both cannot be true simultaneously; and neutral, where the hypothesis's truth is independent of the premise. Understanding NLI is particularly vital for languages like Bangla, which differ significantly from more widely studied languages like English, as it can improve how NLP models process and interpret Bangla text. With the rise of digital communication, there is an increasing demand for technologies that comprehend Bangla, and NLI can enhance chatbots, virtual assistants, and translation services to better understand user queries and respond accurately. Large Language Models (LLMs) are trained on extensive amounts of text data, enabling them to learn various language patterns and meanings, which is especially beneficial for NLI where understanding subtle connections between sentences is key.
Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Asif Iftekher Fahim, Pronay Debnath, Faisal Muhammad Shah
Conference: Presented at the 4th International Conference on Computing and Communication Networks (ICCCNet-2024)
View Paper6. Multimodal Deep Learning
Multimodal deep learning is a method that improves understanding by combining images and text. This approach uses three main techniques: early fusion, late fusion, and intermediate fusion. In early fusion, raw images and text are combined into a single input before the model processes them. This allows the model to learn a shared representation, but it can also make it sensitive to noise from either the images or the text. Late fusion works differently. Here, images and text are processed separately using different models. The results are combined later on. This method is flexible and allows each model to be optimized independently, but it might miss important connections between the two modalities that could improve performance. Intermediate fusion is a middle ground. It combines features from images and text at different stages of processing. This way, it keeps the unique qualities of each type of data while also sharing useful information between them. A major challenge in using multimodal deep learning for the Bangla language is the lack of annotated datasets that pair images and text. Most existing datasets are not diverse enough, which can lead to models that don't work well in different situations. There is still a significant need for high-quality, labeled image-text datasets in Bangla.
MultiBanFakeDetect: Integrating Advanced Fusion Techniques for Multimodal Detection of Bangla Fake News in Under-Resourced Contexts
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Zayeed Hasan, Md Arafat Alam Khandaker, Niful Islam, Khan Md Hasib, M. F. Mridha
Journal: Published in International Journal of Information Management Data Insights (Q1)
View PaperSentimentFormer: A Transformer-Based Multi-Modal Fusion Framework for Enhanced Sentiment Analysis of Memes in Under-Resourced Bangla Language
Authors: Fatema Tuj Johora Faria, Laith H. Baniata, Mohammad H. Baniata, Mohannad A. Khair, Ahmed Ibrahim Bani Ata, Chayut Bunterngchit, Sangwoo Kang
Journal: Published in MDPI Electronics (Q2)
View PaperUddessho: An Extensive Benchmark Dataset for Multimodal Author Intent Classification in Low-Resource Bangla Language
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Md. Mahfuzur Rahman, Md Morshed Alam Shanto, Asif Iftekher Fahim, Md. Moinul Hoque
Conference: Published in 18th International Conference on Information Technology and Applications (ICITA 2024)
View PaperBanglaCalamityMMD: A Comprehensive Benchmark Dataset for Multimodal Disaster Identification in the Low-Resource Bangla Language
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Busra Kamal Rafa, Swarnajit Saha, Md. Mahfuzur Rahman, Khan Md Hasib, M. F. Mridha
Journal: Accepted for Publication in the International Journal of Disaster Risk Reduction (Q1)
BanglaMemeEvidence: A Multimodal Benchmark Dataset for Explanatory Evidence Detection in Bengali Memes
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Asif Iftekher Fahim, Pronay Debnath, Faisal Muhammad Shah
Conference: Under Review in 2025 9th International Conference on Vision, Image and Signal Processing (ICVISP 2025)
7. Bengali Language Generation with Large Language Models
Text generation tasks involve creating human-like text using models, especially in Natural Language Processing (NLP). One important aspect is paraphrase generation, where large language models (LLMs) can understand context and semantics to create different expressions of the same idea. By using few-shot learning techniques, these models can be trained with just a little data to improve linguistic diversity, which is useful for educational content and creative writing in Bengali. Another key task is reading comprehension, where LLMs need to understand and generate text to answer questions or summarize information. This ability is enhanced by fine-tuning LLMs with Bengali datasets, making them better for educational tools in languages with fewer resources. Additionally, generating formal documents like applications or reports can be challenging, as it requires maintaining the right tone and structure. Here, Retrieval-Augmented Generation (RAG) techniques help LLMs use external information effectively, resulting in more relevant and organized outputs, which is crucial in professional settings. In the mental health domain, LLMs can provide empathetic and contextually appropriate advice. By training these models on specialized datasets, they can produce responses that resonate culturally with Bengali speakers, ensuring the advice is relatable and effective. However, developing LLMs for low-resource languages like Bangla comes with challenges, such as limited training data.
Enhancing Bangla NLP Tasks with LLMs: A Study on Few-Shot Learning, RAG, and Fine-Tuning Techniques
Authors: Saidur Rahman Sujon, Ahmadul Karim Chowdhury, Fatema Tuj Johora Faria, Mukaffi Bin Moin, Faisal Muhammad Shah
Conference: Under Review in 2025 28th International Conference on Computer and Information Technology (ICCIT 2025)
8. Explainable AI in Medical Image Analysis
Medical image analysis is important for diagnosing and treating diseases, especially in eye care and cancer treatment. This field uses advanced machine learning methods, especially convolutional neural networks (CNNs), to study medical images. However, these models can be complex and hard to understand, which is a problem in clinical settings where knowing how decisions are made is crucial. To address this, explainable AI techniques are developed to clarify how CNNs classify different eye conditions from retinal images. Additionally, new segmentation models are introduced to accurately identify blood vessels in these images, which helps doctors assess vascular health. By combining classification and segmentation, eye doctors can make better decisions and provide timely care, ultimately improving patient outcomes. Segmentation is vital in medical imaging as it helps separate images into important parts, allowing healthcare providers to focus on specific features for accurate diagnosis and treatment. For lung and colon cancer detection, explainable AI techniques also enhance understanding by showing how certain image features affect predictions, which helps build trust among healthcare professionals and improves communication with patients about their diagnoses.
Explainable Convolutional Neural Networks for Retinal Fundus Classification and Cutting-Edge Segmentation Models for Retinal Blood Vessels from Fundus Images
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Pronay Debnath, Asif Iftekher Fahim, Faisal Muhammad Shah
Journal: Under Review in Journal of Visual Communication and Image Representation (Q1)
View PaperExploring Explainable AI Techniques for Improved Interpretability in Lung and Colon Cancer Classification
Authors: Mukaffi Bin Moin, Fatema Tuj Johora Faria, Swarnajit Saha, Busra Kamal Rafa, Mohammad Shafiul Alam
Conference: Presented at the 4th International Conference on Computing and Communication Networks (ICCCNet-2024)
View Paper9. Machine Translation and Regional Dialect Detection
Machine Translation (MT) is a part of natural language processing (NLP) that helps automatically translate text from one language to another. A major improvement in MT comes from Transformer models, which make translations faster and better. These models can read entire sentences at once, making them great for translating complex sentences. Text classification is another important NLP task that involves assigning predefined categories or labels to text data. However, many languages, especially low-resource languages, lack enough linguistic resources like annotated corpora and dictionaries to develop advanced NLP applications. This is often the case for languages spoken by marginalized communities. In Bangladesh, there are many regional dialects of Bangla that can differ greatly in vocabulary, pronunciation, and syntax. For instance, the daily conversational dialects in regions like Sylhet, Noakhali, and Mymensingh have unique expressions and phrases that are distinct from Standard Bangla. Dialect Machine Translation (DMT) aims to translate these regional dialects into Standard Bangla, the official language, but it faces challenges like the wide variability in dialects, which can create confusion in translation models. Additionally, there is often a scarcity of datasets specifically designed for dialect translations, making it hard to create strong models. Dialects also carry cultural meanings that may not always translate well, so it’s essential to handle local expressions carefully.
Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Ahmed Al Wase, Mehidi Ahmmed, Md Rabius Sani, Tashreef Muhammad
Journal: Under Review in Array (Q1)
View PaperBanglaDialect-Synth: An Approach for Synthetic Corpus Expansion of Bangla Regional Dialects Through Few-Shot Learning with Large Language Models
Authors: (Ongoing Work)
Conference: (Ongoing Work)
10. Large Language Models for Bangla Medical Question Answering
Developing a medical question-answering system in low-resource languages like Bangla presents unique challenges because of limited datasets and pre-trained models. Using closed-source Large Language Models such as Claude 4, GPT-4.1, and Gemini 2.5 Pro with zero chain-of-thought prompting allows the system to generate accurate answers without requiring step-by-step reasoning prompts. These models can process Bangla medical texts, clinical notes, and patient records while handling complex medical terminology and code-switching with English terms. By leveraging zero chain-of-thought prompting, the system can answer various question types including factoid, list, causal, temporal, and unanswerable questions efficiently and accurately. This approach can support clinical decision-making, improve patient education, and enhance healthcare accessibility in rural areas by delivering contextually relevant responses directly in Bangla.
BanglaMedQA: A Comprehensive Dataset for Adapting Zero-Shot Chain-of-Thought Reasoning in Bengali Medical Question Answering
Authors: (Ongoing Work)
Conference: (Ongoing Work)
11. Generative Adversarial Networks in Agriculture
Generative Adversarial Networks (GANs) have revolutionized the field of machine learning, offering groundbreaking applications across various industries, particularly in agriculture. One significant area of focus is the detection of potato diseases, where gathering images of infected crops can be challenging due to limited access to diseased samples at different stages. GANs provide an innovative solution by generating synthetic data that mimics real-world conditions, significantly enhancing the ability to train and improve machine learning models. They can generate a wide variety of diseased potato images, increasing the size and diversity of training datasets, which enables models to better generalize and identify potato diseases. By producing realistic images reflecting natural disease patterns, GANs assist researchers and farmers in developing more accurate diagnostic tools. Moreover, explainable AI is pivotal in building trust with agricultural professionals by offering transparency into the decision-making processes of AI models; providing visual explanations for disease classification fosters confidence among farmers and researchers reliant on these technologies. Instance segmentation in agriculture, particularly for tasks like potato disease detection, involves identifying and delineating individual potato plants or specific infected areas within each plant at the pixel level, allowing for a detailed analysis of disease severity and spread by accurately segmenting diseased regions from healthy parts of the plant.
PotatoGANs: Utilizing Generative Adversarial Networks, Instance Segmentation, and Explainable AI for Enhanced Potato Disease Identification and Classification
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Mohammad Shafiul Alam, Ahmed Al Wase, Md. Rabius Sani, Khan Md Hasib
Conference: Under Review in 11th IEEE International Conference on Sustainable Technology and Engineering
View Paper12. Computer Vision Applications in Agriculture
Disease classification in agriculture is crucial for modern farming, particularly concerning food security and sustainable practices, as the increasing global population drives the demand for high-yield, healthy crops. Disease outbreaks can severely impact crop production, resulting in financial losses for farmers and threatening food supply chains; thus, timely and accurate classification of plant diseases is essential for implementing effective management strategies, reducing losses, and ensuring the sustainability of agricultural systems. For example, potatoes, a staple food crop and significant source of carbohydrates, are susceptible to various diseases such as Black Scurf and Common Scab. Advancements in machine learning and deep learning, particularly through computer vision, have made convolutional neural networks (CNNs) powerful tools for image classification tasks, including disease detection in crops. CNNs excel at analyzing visual data, automatically learning spatial hierarchies of features from images; thus, for potato disease classification, a CNN can be trained on labeled images of healthy and diseased potatoes, identifying patterns associated with each disease for accurate classification. Hybrid models integrating Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Bidirectional LSTM (Bi-LSTM) architectures enhance classification by capturing spatial and temporal dependencies within the data.
Classification of Potato Disease with Digital Image Processing Technique: A Hybrid Deep Learning Framework
Authors: Fatema Tuj Johora Faria, Mukaffi Bin Moin, Ahmed Al Wase, Md Rabius Sani, Khan Md Hasib, Mohammad Shafiul Alam
Conference: 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)
View Paper