Tuesday, January 29, 2013

The law of small numbers

I did know the law of large numbers (and its violations) but I never reflected about the law of small numbers.

You can learn about following this link. It is mostly about Poisson distribution which is, indeed ubiquitous also in Hydrology. So the reading of this R-related post is certainly interesting and useful also for us.


No code No paper

This is entirely from: Simply Statistics » R, and I completely agree with it. It applies the very same way to hydrological literature.

"I think it has been beat to death that the incentives in academia lean heavily toward producing papers and less toward producing/maintaining software. There are people that are way, way more knowledgeable than me about building and maintaining software. For example, Titus Brown hit a lot of the key issues in his interview. The open source community is also filled with advocates and researchers who know way more about this than I do.


This post is more about my views on changing the perspective of code/software in the data analysis community. I have been frustrated often with statisticians and computer scientists who write papers where they develop new methods and seem to demonstrate that those methods blow away all their competitors. But then no software is available to actually test and see if that is true. Even worse, sometimes I just want to use their method to solve a problem in our pipeline, but I have to code it from scratch!

I have also had several cases where I emailed the authors for their software and they said it “wasn’t fit for distribution” or they “don’t have code” or the “code can only be run on our machines”. I totally understand the first and last, my code isn’t always pretty (I have zero formal training in computer science so messy code is actually the most likely scenario) but I always say, “I’ll take whatever you got and I’m willing to hack it out to make it work”. I often still am turned down.

So I have a new policy when evaluating CV’s of candidates for jobs, or when I’m reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method – I simply mentally cross it off the CV. If I’m reading a data analysis and there isn’t code that reproduces their analysis – I mentally cross it off. In my mind, new methods/analyses without software are just vapor ware. Now, you’d definitely have to cross a few papers off my CV, based on this principle. I do that. But I’m trying really hard going forward to make sure nothing gets crossed off.

In a future post I’ll talk about the new issue I’m struggling with – maintaing all that software I’m creating."

Wednesday, January 23, 2013

Object Modelling System Resources

As the readers know from previous posts we  (I and my collaborators and students) use OMS3 (and we will use even more in the future embedding in it all of our modelling efforts) in collaboration with OMS3 developer in chief Olaf David and others.  Any involvement with OMS must  start with browsing the OMS3 web site and the information available there (for instance, but not only, this).
For using it, first  download the console,  then read the installation notes, and read console FAQ (well we will provide a brief description of its use soon) which remain the main information about the tool.

However, during the BioMA summer school, Olaf, Jim Ascough, Jack Carlson and Giuseppe Formetta gave some further material, which finally you can find below.

Jgrasstools use OMS3 (even if a version older than 3.1) and one can find relevant information also browsing their site. 

Other examples of using OMS3 console and scripting will follow soon.

Monday, January 21, 2013

PostgreSQL your data

Science is matter of hypothesis and data. Hypotheses becomes formal models and then you have to  acquire data to prove  (a big word indeed) them at the feeble light of statistics. At the beginning you start to colelct data everywhere in your hard disk (I assume that the data were digitised). After a few months you are submerged by them. You thrown them away and restart it from the beginning again. 
Fortunately some institution store the data in databases and the reboot is relatively easy. However, this cover the primary data sets, and does not cover the data that yourself produce by  running your models and doing your inferences.
So, sooner or later, you have to face the reality that you should store your stuff in a more ordered way, and build your own database.  This opens various questions. It is really necessary to use a database software (C'mon learning another tool!) ? Obviously not: a database, in its general meaning can be just an ordered set of data. So you can just use your filesystem for it (I say it: but I do not really believe it). However, then you have to remind where the data are, and use the search utility of your operating system to find what you are searching (assumed that you documented every step you made in a searchable way).
Databases helps to do that and often use a query language (usually SQL) that helps to find and select the data you need again and again.  So, at a certain moment, one has to take seriously the hypothesis to use a database.
Nowadays there exist many free and open source database solutions (besides, obviously, to the commercial solutions, Oracle's, IBM's and others). Among the most diffuse I cite MySQLPostgresSQL and H2. Each one is a valid choice, with different characteristics.
In the last years we focused our attention on PostgreSQL for its completeness and for having been the first to include the way to manage geographic (geometric data) as shapefile^*. This is done actually by a plugin, called PostGIS, developed by the same Refraction guys who also promote uDig.
Alban De Lavenne, a Ph.D. students from Rennes Agricampus, who spent a few months among us, gave a talk about the use he does of PostgreSQL for supporting his research. His presentation is, as usual on slideshare (I am working to provide the data to run his examples).

The first step is certainly to install PostGIS. The first time I (am a Mac guy and) used the Kyngofchaos instructions  for installing Postgress. However, I noticed that nowadays there are various other possibilities, supported in the main PostgreSQL page.

Alban instructions and suggestions follow the installations and cover some typical hydrological problems.  For a complete understanding, certainly the Tutorial at PostgreSQL site can help.  Around the web, one can also find other video tutorials, as this one provided by David Fetter, or this comprehensive set  on ITunesU screencasts by Selena Deckelmann and others.

Obviously I am open to any contribution to improve this post.

^* - Recently PostgreSQl/PostGIS acquired the capability to store and manage "raster data" and images, which makes it even more appealing.

Monday, January 14, 2013

Alpine Convention

Tha Alpine Convention is an international treaty between the Alpine countries (Austria, France, Germany, Italy, Liechtenstein, Monaco, Slovenia and Switzerland) as well as the EU, aimed at promoting sustainable development in the Alpine area and at protecting the interests of the people living within it.  It embraces the environmental, social, economic and cultural dimensions.
The following slides (in Italian) clarify the structure of the Convention and try to delineate the way the Water Platform works from  a couple of seminars that Andrea Bianchini gave here in Trento last week. Here the Wikipedia page.

For 2013 and 2014, under the Italian Presidency of the Convention, I will be the president of the Water Platform. The official mandate for 201-2014 of the platform has already stipulated among the participants  (you have a synthesis on slideshare): you can have in Italian (an official version) and an unofficial version in English (english is not among the languages of the Alpine Regions).
The official website of the platform is here.