THE EVOLUTION OF LARGE MODEL INFERENCE ARCHITECTURES: FROM CENTRALIZED CLOUDS TO DECENTRALIZED ON-DEVICE INTELLIGENCE

ZeYi Luo

doi:10.61784/jcsee3146

Authors

ZeYi Luo (Corresponding Author) Yuan’an No. 1 Senior High School, Yichang 444200, Hubei, China.

Keywords:

Large Language Models (LLMs), AI inference, Cloud computing, Edge computing, On-device AI, Distributed systems, Model optimization, Multi-Access Edge Computing (MEC)

Abstract

The proliferation of large-scale AI models, particularly Large Language Models (LLMs), has made inference a critical and resource-intensive workload. This survey provides a comprehensive review of the historical evolution of inference architectures, charting a distinct trajectory from centralized, cloud-native paradigms to fully decentralized, on-device intelligence. We systematically analyze four key architectural epochs: (1) Device-Cloud, (2) Device-Edge-Cloud, (3) Device-Edge, and (4) pure On-Device inference. For each paradigm, we conduct an in-depth examination of its dominant systems, key enabling technologies, and the inherent advantages and limitations that catalyzed the transition to the subsequent stage. Our analysis reveals that this evolution is driven by a persistent set of trade-offs between computational power, latency, data privacy, cost, and energy efficiency. This paper concludes that the future of AI inference lies not in a single monolithic architecture but in a heterogeneous "compute continuum," where workloads are dynamically orchestrated across a spectrum of resources to meet diverse application demands.

References

[1] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics, 2019.

[2] Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. ArXiv, 2020, abs/2005.14165.

[3] Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models. ArXiv, 2023, abs/2302.13971.

[4] Sun M, Han R, Jiang B, et al. A survey on large language model-based agents for statistics and data science. ArXiv, 2024, abs/2412.14222.

[5] Zhou Z, Ning X, Hong K, et al. A survey on efficient inference for large language models. ArXiv, 2024, abs/2404.14294.

[6] Chang Z, Liu S, Xiong X, et al. A survey of recent advances in edge-computing-powered artificial intelligence of things. IEEE Internet of Things Journal, 2021, 8: 13849-13875.

[7] Kachris C. A survey on hardware accelerators for large language models. ArXiv, 2024, abs/2401.09890.

[8] Li E, Zhou Z, Chen X. Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. Proceedings of the 2018 Workshop on Mobile Edge Communications, 2018.

[9] Nguyen DC, Ding M, Pathirana PN, et al. Federated learning for Internet of Things: A comprehensive survey. IEEE Communications Surveys & Tutorials, 2021, 23: 1622-1658.

[10] Zhao Z, Fang L, Cai Z, et al. Edge computing: Platforms, applications and challenges. Journal of Computer Research and Development, 2018, 55: 327.

[11] Abadi M, Agarwal A, Barham P, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. ArXiv, 2016, abs/1603.04467.

[12] Li S, Wang H, Xu W, et al. Collaborative inference and learning between edge SLMs and cloud LLMs: A survey of algorithms, execution, and open challenges. ArXiv, 2025, abs/2507.16731.

[13] Kang Y, Hauswald J, Gao C, et al. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.

[14] Zhen R, Li J, Ji Y, et al. Taming the titans: A survey of efficient LLM inference serving. ArXiv, 2025, abs/2504.19720.

[15] Ye H, Li J, Lu Q. Deep reinforcement learning for dependent task offloading in multi-access edge computing. IEEE Access, 2024, 12: 166281-166297.

[16] Zhao Z, Barijough KM, Gerstlauer A. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37: 2348-2359.

[17] Li E, Zeng L, Zhou Z, et al. Edge AI: On-demand accelerating deep neural network inference via edge computing. IEEE Transactions on Wireless Communications, 2019, 19: 447-457.

[18] Chiang C, Liu P, Wang D, et al. Optimal branch location for cost-effective inference on Branchynet. 2021 IEEE International Conference on Big Data (Big Data), 2021: 5071-5080.

[19] Han S, Mao H, Dally WJ. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. ArXiv: Computer Vision and Pattern Recognition, 2015.

[20] Hinton GE, Vinyals O, Dean J. Distilling the knowledge in a neural network. ArXiv, 2015, abs/1503.02531.

[21] Howard AG, Zhu M, Chen B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. ArXiv, 2017, abs/1704.04861.

[22] Sandler M, Howard AG, Zhu M, et al. MobileNetV2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 4510-4520.

[23] Paszke A, Gross S, Massa F, et al. PyTorch: An imperative style, high-performance deep learning library. ArXiv, 2019, abs/1912.01703.

[24] Wang X, Jia W. Optimizing edge AI: A comprehensive survey on data, model, and system strategies. ArXiv, 2025, abs/2501.03265.

[25] Mao Y, You C, Zhang J, et al. A survey on mobile edge computing: The communication perspective. IEEE Communications Surveys & Tutorials, 2017, 19: 2322-2358.

[26] Zhang Q, Yang LT, Chen Z, et al. A survey on deep learning for big data. Information Fusion, 2018, 42: 146-157.

[27] Zhang C, Patras P, Haddadi H. Deep learning in mobile and wireless networking: A survey. IEEE Communications Surveys & Tutorials, 2018, 21: 2224-2287.

[28] Zhou Z, Chen X, Li E, et al. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE, 2019, 107: 1738-1762.

[29] Zhang M, Shen X, Cao J, et al. EdgeShard: Efficient LLM inference via collaborative edge computing. IEEE Internet of Things Journal, 2025, 12: 13119-13131.

[30] Fergus P, Chalmers C, Henderson W, et al. Pressure ulcer categorization and reporting in domiciliary settings using deep learning and mobile devices: A clinical trial to evaluate end-to-end performance. IEEE Access, 2023, 11: 65138-65152.

THE EVOLUTION OF LARGE MODEL INFERENCE ARCHITECTURES: FROM CENTRALIZED CLOUDS TO DECENTRALIZED ON-DEVICE INTELLIGENCE

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

DOI:

How to Cite