MPC has made available an interesting web service to download data from their databases.
After that, I had to convert the XML file (quite big: about 131000 ateroids) in a CSV file.
I used another python utility called xml2csv as follows (for every single file):
xml2csv --input part1.xml --output part1.csv --tag property
Finally, I concatenated all CSV files together.
I used a data mining package called Weka developed by the University of Waikato in New Zealand.
The distribution of asteroids is like this:
Within a set of 131072 asteroids, we can easily see the top three groups:
- Nr. of S type - 111565
- Nr. of C type - 13207
- Nr. of E type - 5671
Let's see how we can recognize these three groups based on some physical parameters chosen among color indexes.
Panstarrs parameter distribution
These graphs, made with the ggplot2 tool of the R package, confirm that the panstarrs parameters allow to discriminate between S-type, C-type and E-type asteroids.
In fact, every single distribution is constituted mostly of asteroids belonging to the same taxonomy class.
We can also visually display the covariance matrix as a "heatmap".
I found a very interesting link that explains how to do this:
ggplot2 : Quick correlation matrix heatmap - R software and data visualization
This is the result:
Finally, let's go back to Weka and perform:
- cluster alanysis
- logistic model
- The S-type asteroids were associated to cluster 0
- The C-type asteroids were associated to cluster 1
- The E-type asteroids were associated to cluster 2
- All different types were mainly attributed to cluster 0 with the exception of V-type that were grouped in the same clusters of E-type asteroids. Cluster 1 is entirely constituted of C-Type asteroids.
The three cluster centroids are as follows:
The Weka Logistic model
First of all, we must establish a performance boundary about what we expect to get (ZeroR model).
There are 111565 S-type asteroid in a set of 131072 asteroids: the accuracy of any "true" model must be much better that 85%.
After running the logistic model with a N=10 cross-validation, I got these results:
As expected, very good performance not only for S-type asteroids but also C-type asteroids (precision and recall = 1) and E-type asteroids (precision=0.928, recall=1) - but failure to predict the other less numerous types.