Applied Sciences (Switzerland), cilt.15, sa.20, 2025 (SCI-Expanded)
Large language models (LLMs) are playing an increasingly important role in data science applications. In this study, the performance of LLMs in generating code and designing solutions for data science tasks is systematically evaluated based on different real-world tasks from the Kaggle platform. Models from different LLM families were tested under both default settings and configurations with hyperparameter tuning (HPT) applied. In addition, the effects of few-shot prompting (FSP) and Tree of Thought (ToT) strategies on code generation were compared. Alongside technical metrics such as accuracy, F1 score, Root Mean Squared Error (RMSE), execution time, and peak memory consumption, LLM outputs were also evaluated against Kaggle user-submitted solutions, leaderboard scores, and two established AutoML frameworks (auto-sklearn and AutoGluon). The findings suggest that, with effective prompting strategies and HPT, models can deliver competitive results on certain tasks. The ability of some LLMS to suggest appropriate algorithms reveals that LLMs can be seen not only as code generators, but also as systems capable of designing machine learning (ML) solutions. This study presents a comprehensive analysis of how strategic decisions such as prompting methods, tuning approaches, and algorithm selection, affect the design of LLM-based data science systems, offering insights for future hybrid human–LLM systems.