DeepPast
(Last update: 14 June 2025)
DeepPast is one of BDSL’s flagship artificial intelligence (AI) research projects, designed to enable dynamic exploration of vast and complex digital archives using transformer-based large language models (LLMs) and vision-language models (VLMs). At its core is BDSL’s modular AI architecture, which deliberately separates knowledge bases from language processing. DeepPast operates as a compound system composed of domain-specific and task-specific components tailored to the recurring demands of humanities research. It is also an inter-university collaboration between the University of Hong Kong, Hong Kong University of Science and Technology, Hong Kong Baptist University, and the National University of Singapore.
To identify the common research patterns from the ground up, DeepPast draws on five distinct case studies, each highlighting a different facet of AI-assisted historical and cultural inquiry. Each begins with an extensive and rigorous preprocessing stage that introduces structure into unstructured primary and secondary sources, which are then converted into machine-friendly formats that most effectively enhance AI capabilities, such as vector embeddings and knowledge graphs. For materials subject to commercial licensing restrictions, processing is limited to vectorization and graph construction to ensure compliance while preserving access to semantic and stylistic features as well as metadata retrieval.
Case Study I: Confucian Networks in Medieval Korea
(Entity Recognition in Machine-Readable Documents)
Javier Cha (The University of Hong Kong)
| Challenges | Confucian schools in medieval Korea were generally not formally defined, making it difficult to trace intellectual networks, patron–client ties, and master–disciple relationships. Records are scattered, and much of the knowledge was passed down informally, which makes reconstruction through traditional methods both labor-intensive and incomplete. |
| Proposed Solutions | This case study applies LLM-assisted named entity recognition (NER) to extract, organize, and visualize Confucian networks. By using graph databases and Python libraries for network analysis, it helps historians measure network cohesion, follow intellectual lineages, and link individuals, institutions, and literary movements with greater ease and reliability. |
| License(s) | ~98% open license under Korean Open Government License |
| Sources | 1.2 billion characters of literary Sinitic text (3.6 GB); ~10% is directly relevant, while the remainder supports reinforcement learning and fine-tuning. |


Case Study II: AI-Enhanced Reading of Manchu
(Text Recognition & Translation of a Low Resource Language)
Yan Hon Michael Chung (Hong Kong University of Science and Technology as of July 1, 2025)
| Challenges | Historical Manchu texts present numerous challenges: limited digitization and reliance on manual transcription and translation. These obstacles make it especially difficult to trace names, places, and events across documents. |
| Proposed Solutions | This case study introduces an integrated workflow that combines optical text recognition, machine translation, and data ontologies to make Manchu texts more accessible. Fine-tuned LLMs and VLMs are used to extract accurate Unicode text from scanned documents, followed by text-to-text translation from Manchu into Chinese and English, with careful attention given to historical nuance. The resulting texts are then structured into vector databases and knowledge graphs to support efficient retrieval and analysis. |
| License(s) | CrossAsia Standard License Agreement (Berlin State Library); educational & research use permitted (Library of Congress); CC 4.0 (Harvard-Yenching) |
| Sources | ~900 Manchu-language items |

Case Study III: Hong Kong during the Second World War
(Image-to-Image Processing for Spatial History)
Chi Man Kwong (Hong Kong Baptist University)
| Challenges | Research on Hong Kong during the Second World War (1939–1945) is hindered by fragmented and largely undigitized records. Primary sources such as phone books and intelligence reports are dispersed across repositories in Hong Kong, the UK, Australia, the US, and Japan, much of it in handwritten or typewritten form. Georeferencing historical maps and aerial photographs remains a labor-intensive and error-prone process. |
| Proposed Solutions | This study uses multimodal AI to accelerate data management, organization, and spatial analysis. Automated feature extraction converts handwritten and typed documents into searchable datasets. AI-assisted georeferencing improves the alignment of historical maps and auto-corrects aerial imagery. In addition, AI-powered photo recoloring enhances outreach efforts and enriches our visual understanding of urban geography. Looking ahead, the project aims to train models capable of identifying the locations of wartime photographs, using aerial imagery to reconstruct the spatial dynamics of the conflict with new levels of precision. |
| License(s) | open license |
| Sources | archives and depositories in Hong Kong, the UK, the USA, Japan, and Australia |
Case Study IV: Tracing Martial Arts Lineages Across Time
(Multimodal AI for Audiovisual and Sensor Data)
Yumeng Hou (National University of Singapore)
| Challenges | Rooted in oral traditions and often veiled in secrecy, written records about martial arts are sparse, scattered, and incomplete. When available, martial arts manuals tend to be heavily reliant on illustrations rather than texts. |
| Proposed Solutions | Building on prior research in ontology-based knowledge representation and multimodal archives, this study combines knowledge graphs, deep learning–driven motion analysis, and OCR and natural language processing (NLP) tools customized for Chinese historical texts. This modular, AI-supported system connects textual, visual, and embodied data. The result is a research framework that renders martial traditions more searchable, analyzable, and reconstructable—opening new possibilities for understanding and interpreting embodied practices. |
| License(s) | open access (Martial Arts Living Archive), partially under copyright (dictionaries) |
| Sources | Martial Arts Living Archive, jointly created by International Guoshu Association, City University of Hong Kong, and EPFL |
Case Study V: Geocities and the First Browser War
(Born-Digital Artifacts at Scale)
Javier Cha (The University of Hong Kong)
| Challenges | The GeoCities data dump is a vast collection of HTML documents, stylesheets, scripts, images, audio, and video from one of the most popular personal site-building services of the late 1990s and the early 2000s. Though rich in historical value, its scale and structural complexity make it extremely difficult to explore and analyze using established historical methods. Gaining meaningful insights into the evolution of early web design, the spread of scripting languages and dynamic webpages, and the browser wars requires extensive tagging and indexing, a task that is impossibly time-consuming and inefficient if carried out manually. |
| Proposed Solutions | This study applies algorithmic reading to computationally explore GeoCities. Using LLM-assisted NER, computer vision, and clustering algorithms, this study automates content classification and tracks design and coding trends. The result is a deeper understanding of GeoCities, one of the most iconic platforms of the Web 1.0 era, revealing the aesthetics, technologies, and user behaviors that helped shape early digital culture. |
| License(s) | The Internet Archive and the Archive Team |
| Sources | 38 million webpages (900 GB compressed) |
Local and Power-Efficient AI

BDSL recognized early the risks of over-reliance on OpenAI’s closed-source models, which remain officially unavailable in Hong Kong outside of Microsoft’s Azure cloud subscription services. The use of energy-efficient, locally available hardware and a modular system design keeps overhead manageable and system maintenance sustainable.
To avoid dependence on any single ecosystem and in recognition of the complexities of geopolitics, BDSL uses NVIDIA hardware and the CUDA library for their convenience but prioritise developing relatively small, task-specific models that remain hardware- and library-neutral. For the time being, our AI systems are deployed on Apple hardware using MLX due to their low-power design, relative affordability, and ready availability in Hong Kong. We continue to monitor the emergence of other energy-efficient and sustainable solutions currently under development.