This is something I deal with every day, and it is often underestimated. Our simulations generate large, multidimensional datasets. Each of them is linked to specific physical parameters, computational settings, and convergence criteria. The first challenge is simply keeping track of what has already been calculated.
Without proper infrastructure, duplicate calculations, missing metadata, and results that cannot be reproduced after six months quickly become problems. That is why we created OSCARpes, a structured database designed to index, deduplicate, and provide one-step photoemission results with full provenance. It may sound more like engineering than physics, but without this layer, science cannot scale.
Long-term storage is another challenge. Academic groups often rely on local servers or institutional clusters, where long-term availability is not always guaranteed. When a PhD student leaves, important data can effectively disappear.
The field is gradually moving towards FAIR principles, but in practice, their adoption remains slow, partly because they require additional work upfront and do not always lead directly to publications.
Sharing is perhaps the most cultural challenge. Condensed matter physics often still works in such a way that data are shared only when an article is published, if at all. But if we want AI-based approaches to work truly, we need large, well-curated, and accessible datasets. Simulations from one group can save another group months of computation — but only if the data are structured and reusable.
These issues may not sound particularly attractive, but without a high-quality data infrastructure, such approaches cannot be developed in the long term. Physics and algorithms are advancing rapidly, and the way we work with data must keep pace.
Do you see differences in approaches to research data and Open Science between countries, for example, between the Czech Republic, Europe more broadly, and Tunisia?
Yes, the differences are quite visible. At the European level, EOSC is becoming a key environment for publishing, finding, and reusing research data across countries and disciplines. For our field, projects such as PaNOSC are also important, as they bring major synchrotron and neutron facilities into this ecosystem.
In materials science, Germany is very advanced. FAIRmat, one of the NFDI consortia, represents the condensed matter physics and chemical physics communities and builds on NOMAD, one of the largest data infrastructures for computational materials science. In my research environment, this is exactly the type of infrastructure that makes FAIR data practically usable.
In the Czech Republic, things are beginning to move significantly. EOSC CZ is helping to build a national infrastructure for FAIR data, and within Open Science II, a specialized repository called DANTEc is being developed for materials science and technology. Discipline-specific solutions like this can greatly help researchers put FAIR principles into practice.
In Tunisia, connections to global research infrastructure are gradually being strengthened, for example, through the adoption of persistent identifiers and cooperation with DataCite. The country has a strong expert community, and linking it with European platforms such as NOMAD or EOSC could significantly accelerate further development.
Even where infrastructure already exists, however, adoption is still slow. Established workflows do not change overnight. Platforms are being built, but the real change is cultural — and that takes time in every country.
You are a member of the EOSC CZ working group focused on metadata and physical sciences. How important are high-quality metadata for data reusability and interdisciplinary collaboration?