Next Priorities

Cataloging: We’re developing a flow that allows for files to be dropped onto the UI and analyzed in order to automatically extract metadata, without the file necessarily ever being uploaded to a server. In addition, we’re building a quick, easy to use AI to create and update metadata records for datasets.

Duplication: An unfortunate consequence of data sharing and modification is that often multiple duplicates of a dataset will exist in various parts of an organisation, with no clue as to which is the definitive version or even if there is one. Magda will have new features that allow for duplicate datasets to be detected and aggregated.

Storage: Magda currently operates purely on metadata, but doesn’t store underlying data. While this isn’t as restrictive as it sounds, it does prevent us from adding future features that could provide a great deal of value to users, like APIfication or secure data sharing.

APIification: The ability to turn well-formatted datasets into RESTful APIs - this is something that’s already offered by CKAN and is depended upon by users of This may mean replicating how CKAN does it - ingesting the data into a single database and running queries on it - or adopting an AWS Lambda-esque approach where we spin up containers on demand to server requests.

Longer-Term Ideas

Exposing Usage to Custodians: Currently data custodians publish data but have very little idea of how much take-up there is. We want to make it extremely easy for them to see how much their data is being viewed/downloaded.

Subjective Data Usefulness/Usability/Ease-of-Use/Interest: On there are many datasets that rate well on the 5-star scale (say a CC-licensed CSV), but are subjectively not as useful as other datasets. Ideally in addition to applying objective measures of quality like the 5-star scale, Magda would also measure datasets subjectively, taking into account how useful, usable and relevant to the average user a dataset is. When an internet user searches for a document on the web via Google, they don’t just expect the most relevant results to come back, they also expect the highest quality results, which Google determines via a number of measures, some of which apply quite a subjective view of what a high-quality page is. We’d like to develop something similar for Magda: we want to make it so that a search for specific data always returns that data, but more general searches return the most useful and interesting datasets that fit the criteria. Some ideas we’re considering:

Dataset Feedback/Collaboration V2: Closing the loop between data users and custodians by providing a Quora-styled Q&A interface that allows users to ask questions / report problems with the data, that causes data custodians (as identified in the dataset’s metadata) to be notified and invited to join the discussion with the user. Naturally this will also involve building moderation features for administrators and the ability to have some Q&As private.

Layering: Allowing for parts of an aspect to be changed in such a way that the next time the dataset is harvested the changes aren’t overwritten.

GraphQL API: The structure of the registry is effectively a graph, and hence lends itself well to being queries via GraphQL.

Saved Search: Just as with eBay you can search for something you want to buy and be emailed when there’s new items listed, on Magda you should be able to save a search for data and be notified when new datasets are posted that match it.