WikiData reaches milestone in mission to open vast data set to AI developers

"To vector embed a large, massively multilingual, multicultural, and dynamic dataset is a hard challenge."

Jasper Hamill

05 Dec 2024 — 3 min read

Wikidata, the structured data backbone of Wikipedia, has made a huge step forward in its mission to open up all its information to developers working on AI and machine learning.

A team from Wikimedia Deutschland, developers of Wikidata, used the DataStax AI platform to ingest, process, and vector embed more than 10 million entries in under three days.

This procedure cleans up the information and turns the processed data into numerical vectors using embedding techniques. It is a common process in machine learning or AI workflows, in which text, images or other unstructured data is mapped into vector spaces to enable similarity search, clustering, or other computational tasks.

Wikidata acts as a linked open data platform for all versions of Wikipedia. It is the world's largest collaborative knowledge graph, containing open, editable data in more than 300 languages.

So far, a global community of more than 24,000 volunteers has contributed over 114 million entries, which are used by thousands of open-source software developers.

Vectors of complexity

The shared goal of Wikimedia Deutschland and DataStax is to provide this data as an open dataset of the world’s knowledge and make it available to the open-source AI and ML communities. However, vector embedding such a huge dataset is not easy, requiring resource-intensive embedding processes because traditional linear read/write operations the hundreds of thousands of updates Wikidata makes every single day.

"To vector embed a large, massively multilingual, multicultural, and dynamic dataset is a hard challenge, especially for low-resource, low-capacity open source developers," said Dr. Jonathan Fraine, Chief Technology Officer, Wikimedia Deutschland.

"With DataStax’s collaboration, there is a chance that the world can soon access large subsets of Wikidata’s data for their AI/ML applications through an easier-to-access method.

"Although only available in English for now, DataStax’s solution provided a valuable initial experiment ~10x faster than our previous, on-premise GPU solution. This near-real-time speed will permit us to experiment at scale and speed by testing the integration of large subsets in a vector database aligned with the frequent updates of Wikidata."

Wikimedia Deutschland plans to make Wikidata’s data easily accessible via advanced vector search by expanding the functionality with fully multilingual models, such as Jina AI, through DataStax’s API portal, enabling the semantic search up to 100 of the languages represented on Wikidata.

All the vectorized data is available under a free CC0 Creative Commons licence.

“Our cooperation with DataStax and its approach has unlocked new capabilities and streamlined our processes, which will allow us to deliver faster and more accurate insights to our community,” said Lydia Pintscher, Portfolio Lead for Wikidata, Wikimedia Deutschland.

“DataStax offers a combination of scalability, ease of use, and advanced embedding models that supports and encourages the development of AI applications for the public good with open and high-quality data.”

DataStax describes itself as "a leading AI platform that helps companies and developers create more accurate AI applications with 60% reduced development time". Wikimedia Deutschland used the DataStax AI Platform, which is built with NVIDIA AI tech including NeMo Retriever and NIM microservices.

Ed Anuff, Chief Product Officer, DataStax, said: “We’re thrilled to see Wikimedia Deutschland improving accessibility to the world’s largest knowledge graph with our AI platform. The open source community is crucial as it can bring more common good and many new ideas and innovations to the digital world."

Wikimedia Deutschland and DataStax plan to expand upon these initial projects, exploring capabilities like graphRAG to enhance search reliability and support hundreds more languages to improve accessibility.

Have you got a story to share? Get in touch and let us know.