Quisquiliae

Self publishing of scientific datasets

by Catonano — mer 28 febbraio 2018

Publishing datasets

So there's a previous episode, for this article. It's this one.

Long story short, I partecipated in a grassroot initiative in the region where I was born.

It was about collecting some spreadsheets published by a governmental agency and "massaging" them and let them become a nicely formatted csv file.

The further step was to connect them to dictionaries to make them linked data and for some time a SPARQL endpoint has been available

Because a professor at the local university was in this and they hosted the dataset on a machine in their lab.

Here's the code.

So this endeavour had me thinking...

Can I publish the dataset I produced on my own ?

In my dreams, I host a bunch of datasets in my home server and they are accessible on the internet.

I don't offer a SPARQL endpoint because my home server is weak.

But my collections of datasets covers a bunch of interesting phenomenons that go on in my region.

I am no academic. But the code I use to produce these datasets is released in GPL and it's perfectly inspectable.

The same goes for the source files I use and for the files I produce.

I even adhere to reproducible science initiatives, the main one being Guix.

So any scientist can take my data and run the software I run, and inspect the code. And so verify that my massaging of the data is correct

Let's say that instead of publishing only a collection of posts in a blog, I could publish a collection of datasets, curated by me.

My name could become a brand in the field, I could band with others in doing this and this could become something.

With some time, the various aspects of life covered by our datasets could be diverse and wide and we could be the source of some important scientific studies or data based products.

I'm thinking about, for example, real estate values.

Real estate values are affected by many variables and a scientific literature exists about how and why each variable affect the value of some property.

Air pollution, proximity of industrial settings, presence of mass transit, presence of public green areas, schools, whatever.

There are scientific articles about this, proposing different models.

With the availability of the right datasets, each article could be extended on any region and the data presented in a proper form.

Maybe some app could spring out of this. An interactive data based product.

For the final consumer, or citizen, or for the government. Or for both.

So some data analysis could be available for a given region based on the models that the scientific literature proposes.

Any scientific article could become a data based product.

In fact, the initiative to collect pollution data came when I had an exchange with one of the authors of this article, Vincenza Chiarazzo.

Her article used those data and I asked her where she had found them.

That's how I decided to try to make the fetching of those data less incidental and more systematic.

This happened in 2014 !

Now let's come to something more recent: how are real estate values affected by energy consumption profiles ?

I already mentioned this article.

Can people get more appropriate value estimates for real estate properties ?

Can we exert pressure on polluters (values going down for entire areas) ?

Can we exert pressure on the markets and the governments about lack of services, for example mass transit ?

What could be born, should reproducible science (based on Free Software) and open linked data meet and mate ?

How, exactly ? Which tools ?

What is Data World ?

And what is Datahub.io ?

Datahub.io seems fitting well with what I am thinking. It seems the perfect tool to share a dataset and try to help it make its way in the world.

But can I trust a cetralized service ?

I already was burnt with social networks. Won't I be burnt by a social network dedicated to data geeks ?

My hope was that I could have used tools more apt to grassroots things, maybe peer to peer or federated.

GNUnet ? DAT ? Ipfs ? I don't know, really.