Mea culpa

December 7th, 2009

This morning, taking a nice long bath, I realised there could be a problem with my parsing of the temperature records. This turns out to be the case.

Following the format specification, I’d jumped to column 25, skipped any spaces or tabs, read the digits, read a dot (silently aborting if not present), and then optionally read another digit. That produced records for 285057 station-years.

What I didn’t do, and what my checks had failed to spot, was handle negative Fahrenheit temperatures. I guess I’m not used to winters being colder than salted ice. A minus sign is neither whitespace nor a digit, so it fell through those tests and silently aborted at the next as it’s not a dot! Net result: no records from any station where the temperature dips below 0.0F. Oops. I’ve fixed that (check if there’s a minus sign, and negate the value read if so) and also put in a message on stderr if that abort is triggered.

I’ve run through all the years, with no messages on stderr, and there are now 394264 station-years of data produced. Code will follow when I figure a good way to post it.

I’ve updated the Proposition 02 results, though it’s actually increased the margin from 97.7% to 98.2% of stations not showing the warmest ten years as post-1997.

Proposition 02: False

December 6th, 2009

Some more horrific Perl, but it did the job…

rmw42@pandora:~/NOAA$ cat Average.txt | ./warmest 10
32 stations had 10 of their 10 warmest years post 1997
1360 stations did not have 10 of their 10 warmest years post 1997
11347 stations rejected for having insufficient data

So, I make that 98% of weather stations finding that the warmest ten years in their history were not post-1997. That’s quite shocking, really. I think it’s safe to say that the statement is completely and utterly false – if 98% of weather stations active for the last 24 years don’t show the last 12 as containing the ten warmest, by what measure can we claim they were the warmest years?

I want to test this to see if there’s any pattern to the stations used, whether requiring good data integrity skews the results, and whether rejecting so many stations (~90% of the total) was necessary – but I think the reasons I gave on Friday are sound. It doesn’t matter a damn to me if a station had good data during WW2 – if it hasn’t been active for 60 years, it can’t tell me how warm 2007 was! And surely a station giving only one temperature reading per year – yes, there are some like that – is hopeless?

As Adam and Jamie might say: Myth Busted!

Updated 2009/12/7 09:23:

rmw42@pandora:~/NOAA$ cat Average.txt | ./warmest 10
44 stations had 10 of their 10 warmest years post 1997
2460 stations did not have 10 of their 10 warmest years post 1997
13614 stations rejected for having insufficient data

Station locations against time

December 6th, 2009

I don’t know why I chose Perl, it’s a hateful language. Turns out it wasn’t quite so simple – ‘sort’ works lexicographically (i.e. as text) so “117″ is less than “20″ as 1<2. Converting lines to integers and sorting on those, using delightful syntax like

@values = sort {$a <=> $b} @values;

and then split cases for odd/even list lengths for the median and again for upper/lower quartiles. Still, in the end it gives me the results I’m looking for – a list of comma-separated values for year, number of stations, mean distance, median distance, lower quartile distance and upper quartile distance. Or “year,0,,,,” if there are no stations. That should be just what I need to paste into Excel and produce graphs…

Data ahoy!

December 6th, 2009

I’ve finally uploaded the early years’ weather data (about 350mb) to my shell account, which took about three hours this morning. I’ve processed it and so now I’ve a complete set of averages, about 280000 station-years.

I can use this with the country station list/distances to determine the set of stations in each year – something like

join AfricaStations.txt Average-1969.txt -t',' -o1.3 | sort

will get me the distances (3rd field in the first file, so 1.3) for all the stations active in 1969 in order of distance. Then I just need to get the order statistics and graph the result to get the answers for Proposition 01.

For Proposition 02, I can use the same station-year average temperature list, sorted by station, to extract and check the warmest years. I think that will need some Perl…

Station locations

December 6th, 2009

Excellent news – the six-digit part of the station codes is geographic and the list is sorted. Africa is in the range 600000-689999, so I can just cut out the lines from the middle of the file – same for Europe, etc.

Ocean distance from station

December 5th, 2009

Only needs a small change to the map-drawing Java to take the NOAA list of stations and calculate their distances to the nearest ocean. There are 30800 of those so it’ll take a while, but I should then have a map of station ID to distance-to-ocean. Yay!

Distance to ocean map

December 5th, 2009

Well, the DTED collection took rather longer than it should have – having to dig around for all the flyspeck islands in the pacific. I’m still not sure I’ve got them all, but it’ll do for now.

I’ve updated the map-drawing code to calculate Great Circle distances between points, and therefore calculate the nearest ocean to each point of land:

(The image contains the actual values obtained, (red*16) + (green & 15) = distance in km. The red channel alone is probably sufficient given the accuracies.)

This is only rough, as it is working at the level of 1-degree x 1-degree cells, so distances are only accurate to around +/-50km. This can be improved by using the data in the DTED cells, rather than just their existence, and should get to well under 1km errors in the temperate latitudes, but for a quick test it’ll suffice.

Temperature records

December 5th, 2009

The NOAA data is awkwardly arranged. It’s in (literally) thousands of GZip files, one per year per station, stored in yearly TAR files. About 3gb compressed, so probably double that. My PC chokes on them as the virus checker sees an archive file and decides to look inside, so I’ve had to download them using a shell account on a Linux box. I’ve now got the years 1929-1959 and 1970-1973 on my PC, and 1960-1969 and 1974-2009 on the shell account.

I’ve written a C program (“annual”) to parse the temperature records and calculate the mean (and standard deviation), and a shell script to de-TAR a year’s data to a temp directory and to do

cat $f | gzip -d | tail -n+2 | ./annual $1 >> Average-$1.txt

for each station record ($f). The ‘tail’ call strips off the first line (column headers), and $1 passes through the year given to the shell script. All this results in a comma-separated text file containing station ID, year, mean temperature, number of samples, and the standard deviation.

Now just to create another shell script to loop through this lot, and deal with the files on my home PC as well as the ones on the linux box… If I ‘nice’ everything hopefully nobody will notice it running all night!

Data collection

December 5th, 2009

Yesterday and this morning I’ve collected a huge number of DTED files from here.

The terrain information is made available for free from the USGS, but is generally provided on physical media as it’s so large – multiple DVDs for level 1 DTED. Of course, there’s no such thing as a free lunch, so the files are grouped into 10-degree blocks and then arranged by country, so somewhat awkward to find – or a fun geography quiz, if you prefer!

I’ve also cobbled together a bit of Java to produce a map showing which terrain cells I’ve got. Sadly, this shows I’ve got some more work to do:

Proof of concept

December 4th, 2009

I’ve downloaded the data for station 723150-03812, Asheville Municipal Airport in North Carolina. This is the one the NCDC/NOAA use for their sample data and is, I guess, their local airport.

Throwing the full 1948-2009 data – about 26000 records - into an Excel spreadsheet, using the SUMIF and COUNTIF functions to pull out the relevant days, and sorting the results shows that the ten warmest years in Asheville are (from lowest to highest): 1974, 1980, 2001, 1999, 2002, 1991, 2007, 2009, 1998, 1990

2009 is an aberration as there are a bunch of cold days to come before the end of the year, but it’s clear that there have been equally warm years in recent decades.

This surprised me, I expected at least eight or nine out of the ten to be the warmest if the effect were clear-cut – particularly with all the fuss people have made about airport locations for weather stations, the El Nino in ’98, and the satellite data showing warming throughout the 1990s.

It remains to be seen how typical (or not) Asheville is…