COFUND PhD position – Multi-disciplinary knowledge management of Coastal Area publications
ABG-128023 | Sujet de Thèse | |
20/01/2025 | Financement de l'Union européenne |
- Science de la donnée (stockage, sécurité, mesure, analyse)
- Informatique
- Numérique
Description du sujet
Title of the thesis project: Multi-disciplinary knowledge management of Coastal Area publications
Scientific context
Coastal areas are dynamic interfaces between natural systems and human societies, concentrating critical environmental and economic challenges. These regions face multiple pressures, including rising human populations, biodiversity collapse, sea level rise, extreme climate events, and shifting socio-economic and urban contexts (Barbier, 2015; Merkens et al., 2016; Neumann et al., 2015; Newton et al, 2020; Nichols et al. 2021, Spalding et al., 2014) . Understanding coastal dynamics is essential for predicting the impacts of ongoing environmental upheavals on ecosystems and communities and for devising effective adaptation and remediation strategies.
The interdisciplinary nature of coastal systems presents a significant challenge in achieving a comprehensive understanding of their functioning (Cloern et al., 2016 ; Glavovic et al., 2015). This requires synthesizing and integrating knowledge from diverse fields and bridging conceptual gaps across disciplines. However, key concepts within one discipline often remain disconnected or unaligned with those from others. Consequently, current interdisciplinary efforts either focus narrowly on specific areas and interactions or result in an overly simplistic understanding of the broader system.
At the same time, the scientific literature on coastal areas is extensive, comprising nearly 70,000 publications. This project is part of a broader effort to structure and synthesize this vast body of knowledge by leveraging natural language processing (NLP), deep learning (Lane et al., 2019), and lexical statistical methods (Miner et al., 2012; Mendes et al., 2019) to analyze the entire corpus of publication abstracts. Building on existing work, a PhD student co-supervised by the project PIs has developed a neural network algorithm to identify coastal entities and their interactions at the document level (Delaunay et al., 2024). This algorithm adapts the ARDI method (Actors, Resources, Dynamics, and Interactions; Michel, 2011), originally used for participatory interdisciplinary modeling, to extract and connect relevant entities within individual abstracts using deep learning techniques.
Scientific objectives
The next challenge is to scale this understanding from the level of abstracts to a corpus-wide perspective, constructing a comprehensive knowledge map. Achieving this goal requires addressing disciplinary silos, where concepts may be expressed differently across fields or entirely absent in certain domains. For example, the concept of coastal area representation— how people perceive the area as beautiful, safe, or interesting—is central to humanities and influences societal behavior, but is rarely linked to ecological or geomorphological studies.
The project will address this challenge by hiring a PhD student to develop a novel algorithm that synthesizes document-level information into a global knowledge map. This algorithm will start from the functioning information detected at the level of abstracts, position it into a common framework, and link it to the other pieces of information.
The resulting knowledge map will reveal both well-established and underexplored connections between entities, as well as the interdisciplinary relationships underlying them.
It is expected to highlight knowledge gaps and uncover previously unrecognized patterns or regularities. Crucially, this will have a dual impact: in the domain of coastal systems, it will result in a more advanced and a better integrated understanding. However, on a much broader scale, it will facilitate our understanding about linking concepts across widely diverse scientific disciplines that have previously lacked alignment.
Methodology
The project will proceed through the following steps:
- Ontology construction: A coastal systems ontology will be created from the corpus of abstracts (Cimiano, 2016; Asim et al., 2018).
- Ontology refinement: Disciplinary biases will be addressed by uniquely identifying each entity and linking it to external knowledge bases.
- Interaction classification: Interactions between entities will be categorized into a streamlined functional interaction ontology to reduce complexity.
- Relational graph development: A graph-based approach (e.g., relational graph techniques) will be used to model the connections between entities and across disciplines.
Challenges
Two main challenges are anticipated in this project: the scale of the corpus (see for instance Wand et al., 2020) and its interdisciplinary nature (Augenstein, 2017; Beltagy et al., 2019; Gabor et al, 2018)
- Scale of the corpus: With approximately 70,000 abstracts, the corpus to analyze is huge and composed of short texts. Natural Language Processing (NLP) on a large corpus of short texts presents unique challenges, particularly in validating the proposed solutions (Le & Mikolov, 2014). The diversity and volume of data complicate comprehensive model evaluation, making it difficult to detect systematic biases or errors. Moreover, short texts often provide limited context, which can significantly impact the accuracy of semantic or thematic analyses. The inherent variability in a large number of brief texts can obscure subtle trends or important relationships, requiring sophisticated statistical and sampling approaches for reliable validation. Finally, balancing generalization and specificity becomes crucial, as models must be robust enough to handle corpus diversity while capturing the nuanced characteristics of each individual short text.
To tackle the challenges of processing large corpora of short texts in NLP, a comprehensive and multi-pronged approach is proposed. First, the algorithm will be tested on smaller sub-corpora and specific targeted questions, adhering to best practices in model evaluation (Devlin et al., 2018). Input from a panel of disciplinary experts will play a crucial role in validating results and refining methods, in line with the collaborative approach employed by Wang et al. (2020) for the CORD-19 data set.
The methodology also includes implementing stratified cross-validation to ensure representative test subsets (Sechidis et al., 2011), conducting progressive evaluations to pinpoint performance degradation, and performing qualitative error analysis to understand limitations (Ribeiro et al., 2020). Benchmarking against established methods using standard datasets (Agirre et al., 2015) and conducting robustness tests will further evaluate the algorithm’s reliability.
To enhance effectiveness, domain-specific evaluation metrics will be developed (Reimers & Gurevych, 2019), and the algorithm will undergo iterative refinement based on expert feedback. Interactive visualizations, adhering to best practices in NLP data visualization (Liu et al., 2019), will be designed to facilitate the interpretation of results. Lastly, long-term evaluation will ensure the algorithm’s stability and consistency over time, particularly as the corpus grows.
This structured and adaptive framework aims to address the complexities inherent in NLP applications on large and heterogeneous datasets effectively.
- Interdisciplinary complexity: Natural Language Processing on interdisciplinary corpora presents significant challenges due to the diverse nature of the content. The variability in terminology, writing styles, and conceptual frameworks across disciplines can lead to ambiguity and misinterpretation by NLP models (Augenstein et al., 2017). Domain-specific jargon and technical language often have different meanings in various fields, making it difficult for general-purpose NLP tools to accurately process the text (Beltagy et al., 2019). Additionally, the interconnectedness of concepts across disciplines can create complex semantic relationships that are challenging to capture and represent computationally (Gábor et al., 2018). The lack of standardized vocabularies and ontologies across different fields further complicates the task of entity recognition and relation extraction (Wang et al., 2020). Moreover, the varying levels of abstraction and specificity in different disciplines can affect the performance of text classification and topic modeling algorithms (Jurgens et al., 2018). These challenges necessitate the development of more robust and adaptable NLP techniques that can effectively handle the heterogeneity and complexity of interdisciplinary corpora.
To address the challenges of NLP on interdisciplinary corpora, a multi-faceted approach is proposed. Ontology construction and refinement with expert input will be crucial in mitigating terminological and conceptual ambiguities (Pesquita et al., 2014). The inherent redundancies in large corpora are expected to resolve some ambiguities automatically, while statistical analyses may reveal new insights into interdisciplinary connections (Yan et al., 2012).
Additionally, employing domain adaptation techniques can help models better handle discipline-specific nuances (Beltagy et al., 2019). Implementing a hierarchical classification approach can capture both broad interdisciplinary themes and specific domain concepts (Silla & Freitas, 2011). Utilizing transfer learning methods can leverage knowledge from well-resourced domains to improve performance in less-studied interdisciplinary areas (Ruder et al., 2019).
Furthermore, incorporating active learning strategies can efficiently utilize expert knowledge to refine models iteratively (Settles, 2009). Developing interdisciplinary benchmarks and evaluation metrics will be essential for assessing model performance across diverse domains (Wang et al., 2020). Finally, employing interpretable AI techniques can provide insights into model decision-making, facilitating better alignment with expert knowledge and interdisciplinary understanding (Ribeiro et al., 2016).
Expected Outcomes
The project aims to produce an interdisciplinary knowledge map of coastal systems that bridges disciplinary divides, connects entities across domains, and reveals overlooked relationships. This map will advance understanding of the interactions between ecological, social, and technical systems, identify blind spots, and generate new hypotheses for research.
The identification of knowledge gaps may guide future research, fostering interdisciplinary collaborations to address underexplored interactions or blind spots. The map could also enhance the accessibility of scientific knowledge by synthesizing complex interdisciplinary data into actionable insights. For example, connecting societal concepts like tourism attractiveness to ecological data could inform strategies for sustainable coastal development. Similarly, revealing overlooked connections between disciplines might inspire novel solutions to pressing coastal issues, such as biodiversity loss or climate adaptation.
The project’s methods and results are expected to set a precedent for applying NLP and deep learning to complex social-ecological systems, broadening its impact beyond coastal studies. These refinements position the project as a cornerstone for both advancing scientific understanding and delivering tangible benefits to stakeholders managing vulnerable coastal environments.
Finally, the resulting knowledge map will serve as a decision-support tool for coastal management, helping policymakers identify critical areas of intervention and prioritize resources.
Prise de fonction :
Nature du financement
Précisions sur le financement
Présentation établissement et labo d'accueil
Since its creation in 1993, La Rochelle University has been on a path of differentiation.
Thirty years later, as the university landscape recomposes itself, it continues to assert an original proposition, based on a strong identity and bold projects, in a human-scale establishment located in an exceptional setting.
Anchored in a region with highly distinctive coastal features, La Rochelle University has turned this singularity into a veritable signature, in the service of a new model. Its research it addresses
the societal challenges related to Smart Urban Coastal Sustainability (SmUCS).
The new recruit will join the Littoral, Environment and Society Laboratory (LIENSs) and the Laboratory of Informatics Image Interaction (L3I).
Cotuelle: University of Helsinki (UH), Finland. Learning Language Lab.
Etablissement délivrant le doctorat
Profil du candidat
Research Field
Computer science
Education Level
Master Degree or equivalent
Vous avez déjà un compte ?
Nouvel utilisateur ?
Vous souhaitez recevoir nos infolettres ?
Découvrez nos adhérents
- Généthon
- Aérocentre, Pôle d'excellence régional
- ANRT
- MabDesign
- CASDEN
- Institut de Radioprotection et de Sureté Nucléaire - IRSN - Siège
- PhDOOC
- Tecknowmetrix
- ONERA - The French Aerospace Lab
- SUEZ
- Institut Sup'biotech de Paris
- CESI
- Nokia Bell Labs France
- Groupe AFNOR - Association française de normalisation
- ADEME
- Ifremer
- Laboratoire National de Métrologie et d'Essais - LNE
- MabDesign
- TotalEnergies