Responsible AI for Voice Diagnostics · PhD project (Jan 2024 – Present)
Responsible AI for Voice Diagnostics (RAIVD) is a project funded by NWO (Dutch Research Council). My PhD is embedded within this project, focusing on speech assessment for L2 speakers using large language and speech models.
What I work on
Designing and fine-tuning large language and speech models (SpeechLLMs) using supervised, multi-task, and preference-based training approaches for automated pronunciation and fluency assessment.
Building retrieval-augmented generation (RAG) pipelines for pedagogical feedback systems, grounding LLM outputs with reference material.
Implementing confidence-based speech scoring components, comparing alignment-based and alignment-free methods and evaluating system-level trade-offs.
Establishing evaluation and benchmarking frameworks to validate model improvements and guide deployment decisions.
Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.
@misc{parikh2026finetunedspeechllmjointmultigranular,title={{A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales}},author={Parikh, Aditya Kamlesh and Tejedor-Garcia, Cristian and Cucchiarini, Catia and Strik, Helmer},year={2026},archiveprefix={arXiv},primaryclass={cs.CL},url={https://arxiv.org/abs/2606.09470}}
LREC
Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment
Aditya Kamlesh Parikh, Cristian Tejedor-García, Catia Cucchiarini, and 1 more author
In Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), May 2026
Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.
@inproceedings{parikh-etal-2026-rubric,title={Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment},author={Parikh, Aditya Kamlesh and Tejedor-García, Cristian and Cucchiarini, Catia and Strik, Helmer},booktitle={Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},month=may,year={2026},pages={10255--10265},address={Palma, Mallorca, Spain},publisher={European Language Resources Association (ELRA)},doi={10.63317/4dgvijh3226x}}
2025
Interspeech
Evaluating Logit-Based GOP Scores for Mispronunciation Detection
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, and 1 more author
Pronunciation assessment relies on goodness of pronunciation (GOP) scores, traditionally derived from softmax-based posterior probabilities. However, posterior probabilities may suffer from overconfidence and poor phoneme separation, limiting their effectiveness. This study compares logit-based GOP scores with probability-based GOP scores for mispronunciation detection. We conducted our experiment on two L2 English speech datasets spoken by Dutch and Mandarin speakers, assessing classification performance and correlation with human ratings. Logit-based methods outperform probability-based GOP in classification, but their effectiveness depends on dataset characteristics. The maximum logit GOP shows the strongest alignment with human perception, while a combination of different GOP scores balances probability and logit features. The findings suggest that hybrid GOP methods incorporating uncertainty modeling and phoneme-specific weighting improve pronunciation assessment.
@inproceedings{parikh25b_interspeech,title={{Evaluating Logit-Based GOP Scores for Mispronunciation Detection}},author={Parikh, Aditya Kamlesh and Tejedor-Garcia, Cristian and Cucchiarini, Catia and Strik, Helmer},year={2025},booktitle={{Interspeech 2025}},pages={2405--2409},doi={10.21437/Interspeech.2025-1012},issn={2958-1796}}
Interspeech
Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, and 1 more author
Computer-Assisted Pronunciation Training (CAPT) systems employ automatic measures of pronunciation quality, such as the goodness of pronunciation (GOP) metric. GOP relies on forced alignments, which are prone to labeling and segmentation errors due to acoustic variability. While alignment-free methods address these challenges, they are computationally expensive and scale poorly with phoneme sequence length and inventory size. To enhance efficiency, we introduce a substitution-aware alignment-free GOP that restricts phoneme substitutions based on phoneme clusters and common learner errors. We evaluated our GOP on two L2 English speech datasets, one with child speech, My Pronunciation Coach (MPC), and SpeechOcean762, which includes child and adult speech. We compared RPS (restricted phoneme substitutions) and UPS (unrestricted phoneme substitutions) setups within alignment-free methods, which outperformed the baseline. We discuss our results and outline avenues for future research.
@inproceedings{parikh25_interspeech,title={{Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge}},author={Parikh, Aditya Kamlesh and Tejedor-Garcia, Cristian and Cucchiarini, Catia and Strik, Helmer},year={2025},booktitle={{Interspeech 2025}},pages={5068--5072},doi={10.21437/Interspeech.2025-829},issn={2958-1796}}
SLaTE
Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities
Aditya Kamlesh Parikh, Cristian Tejedor Garcia, Catia Cucchiarini, and 1 more author
In 10th Workshop on Speech and Language Technology in Education (SLaTE), 2025
An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.
@inproceedings{parikh25_slate,title={{Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities}},author={Parikh, Aditya Kamlesh and {Tejedor Garcia}, Cristian and Cucchiarini, Catia and Strik, Helmer},year={2025},booktitle={{10th Workshop on Speech and Language Technology in Education (SLaTE)}},pages={11--15},doi={10.21437/SLaTE.2025-3},issn={2311-4975}}