Where did you aggregate all this info? Is the min/max data based on converting known habitats aka Range to discrete points / well defined geographical areas and further translating those areas to min/max values? Or just a composite reported from various sources?
Speaking of synonyms, how are you deduping? Are you tracking common names as well?
Interesting project!
Common names are tracked elsewhere in the DB. This is just one tiny part of it.
Honestly, if I had to choose the hardest part of the DB, it's finding pollination data for different species. There's no central collections, and not all info you can find is as readily available as others. E.g. whether a plant has hermaphroditic flowers may be proportionally common info, whether they're functionally dioecious or not is less common (and if so, what kind of dioecy), whether the plant will self-fertilize or is parthenocarpic is less common still, and probably the least common: whether a plant that isn't self fertile is nonetheless self-compatible. For far too many species I'm having to flag data as "best guesses based on relatives", which I don't like doing.
Habitat and range info is originally from PFAF and the sister site Useful Tropical Plants; I made donations to each of them for use of their data (of course, they just took their habitat info from various books and papers, so it's not really "their" data to begin with - but I also wanted to support their good aggregation work, and plan to support them more in the future - and encourage others to as well, as it makes projects like this possible
). However, as I've been going through curating species some of them have been being changed to be more accurate. Altitudes come from a variety of sources, including habitat descriptions, but also other sources. One weakness I forgot to mention above is sometimes a plant has different altitude ranges in different areas; my code is only equipped to deal with a single altitude range.
The hardest part in the above program was defining geographic boundaries; I found some couple hundred boundary files (and use them), but boundaries for most places aren't so readily available. So I use the Geonames heirarchy; if an object in the database (with its own lat, lon, and alt coordinates) says it belongs to some higher-order region, then it adds its lat/lon/alt to that region. Lat/lon/alt entries are grouped into grid points, 360 of lat by 720 of lon, in order to map to the IPCC climate data (I use climate data from the past 20 years, although I could see arguments for using older data instead).
Once I have grid points for each place - be it an entire country, state/province, country, city, other place, etc, it parses the text, and attempts to match up direction adjectives (north, northwest, west,... etc) with nouns (aka placenames) - e.g. "northern and central Kenya" maps to "north Kenya" and "central Kenya". If there's no adjective it uses the entire locale stated. Northwest / southwest / northeast / southeast are interpreted as being anywhere to one side of the place's centre on both axes (aka, quadrants). North / south / east / west have to be at least a certain distance away from the centre on their defining axis (aka, rather than each taking up about half of the country, they take up more like a quarter of the locale). The adjective "central" describes about a third of the place in question. The exact amounts depend on the shape. Of course, if a species' range description says "China", but they really mean "southeast China"... hey, the program isn't psychic!
By mapping the species to locales, we now have a list of gridpoints for each species. It then averages the data from the gridpoints (in each category, and across all 12 months), for all of them that fall within the species' altitude range (with the aforementioned exception that if none fall within the altitude range, it uses all datapoints present and applies a lapse rate to adjust for altitude). It then returns, for each category, the averages for the lowest-value month (so for example, for rainfall, average per-day rainfall in the driest month), the average of the whole year (in the aforementioned example, average per-day rainfall across the year), and the highest-value month (e.g., the per-day rainfall for the wettest month)
Oh, lastly species duplications: I use two authoritative species list registries (unfortunately, I don't remember which ones off the top of my head), which come with synonym lists. Unfortunately, the synonym lists are only within each given genus, so for example you'll see some rheedias in there and things like that. But I'll work out the glaring examples of that eventually
The policy is to always use current names, even if they're annoying and nobody uses them - for example, Rosenbergiodendron formosum rather than Randia formosa.
My purpose for making the DB is to serve my needs, and it hasn't really been designed with the general public in mind, as there's some data that I don't have the rights to share (and so I've been writing a number of things in Icelandic
). But I figured that this aspect of it might be nice to share with people