Department of Computer Science, Caleb University, Lagos, Nigeria.
International Journal of Science and Research Archive, 2025, 17(02), 005–013
Article DOI: 10.30574/ijsra.2025.17.2.2948
Received on 23 September 2025; revised on 28 October 2025; accepted on 31 October 2025
Traditional aerial scene classification models rely heavily on large, labeled datasets and supervised learning, which limits their ability to generalize to new or rare scene types. In this work, we explore a zero-shot approach to aerial scene understanding by leveraging Contrastive Language Image Pretraining (CLIP), a vision-language model trained on vast image-text pairs. Instead of retraining or fine-tuning the model, we use carefully designed natural language prompts to describe scene categories of interest and classify aerial images based on cosine similarity in a shared semantic embedding space. This method enables flexible and scalable scene classification without requiring additional annotation or retraining. Through prompt engineering, we introduce both generic and domain-specific textual descriptions to maximize classification accuracy. Experiments conducted on benchmark aerial datasets demonstrate that the proposed approach effectively distinguishes between complex and visually similar scenes, even in scenarios with limited or no prior class examples. This work highlights the potential of vision-language models for rapid, adaptable, and annotation-free classification in aerial surveillance applications.
Zero-shot learning; Vision-language models; Remote sensing; Semantic embedding; Large language models
Preview Article PDF
Chukwudi Anthony Udemba, Adekunle Adeoye Eludire and Ayorinde Peters Oduroye. Zero-shot aerial scene classification using clip and prompt engineering. International Journal of Science and Research Archive, 2025, 17(02), 005–013. Article DOI: https://doi.org/10.30574/ijsra.2025.17.2.2948.
Copyright © 2025 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0







