Posts Tagged ‘data’

GHCN data

Monday, December 14th, 2009

I’ve been looking at the GHCN data as a possible replacement for/addition to the GSOD series. It’s mostly a set of monthly averages, though there is a 1.7gb lump of “daily” data (which, glancing at extracts, appears to have a huge number of missing entries).

The GHCN does appear to have better coverage of Africa and 1934, so it will be interesting to see how that affects the results.

Mea culpa

Monday, December 7th, 2009

This morning, taking a nice long bath, I realised there could be a problem with my parsing of the temperature records. This turns out to be the case.

Following the format specification, I’d jumped to column 25, skipped any spaces or tabs, read the digits, read a dot (silently aborting if not present), and then optionally read another digit. That produced records for 285057 station-years.

What I didn’t do, and what my checks had failed to spot, was handle negative Fahrenheit temperatures. I guess I’m not used to winters being colder than salted ice. A minus sign is neither whitespace nor a digit, so it fell through those tests and silently aborted at the next as it’s not a dot! Net result: no records from any station where the temperature dips below 0.0F. Oops. I’ve fixed that (check if there’s a minus sign, and negate the value read if so) and also put in a message on stderr if that abort is triggered.

I’ve run through all the years, with no messages on stderr, and there are now 394264 station-years of data produced. Code will follow when I figure a good way to post it.

I’ve updated the Proposition 02 results, though it’s actually increased the margin from 97.7% to 98.2% of stations not showing the warmest ten years as post-1997.

Station locations against time

Sunday, December 6th, 2009

I don’t know why I chose Perl, it’s a hateful language. Turns out it wasn’t quite so simple – ‘sort’ works lexicographically (i.e. as text) so “117″ is less than “20″ as 1<2. Converting lines to integers and sorting on those, using delightful syntax like

@values = sort {$a <=> $b} @values;

and then split cases for odd/even list lengths for the median and again for upper/lower quartiles. Still, in the end it gives me the results I’m looking for – a list of comma-separated values for year, number of stations, mean distance, median distance, lower quartile distance and upper quartile distance. Or “year,0,,,,” if there are no stations. That should be just what I need to paste into Excel and produce graphs…

Data ahoy!

Sunday, December 6th, 2009

I’ve finally uploaded the early years’ weather data (about 350mb) to my shell account, which took about three hours this morning. I’ve processed it and so now I’ve a complete set of averages, about 280000 station-years.

I can use this with the country station list/distances to determine the set of stations in each year – something like

join AfricaStations.txt Average-1969.txt -t',' -o1.3 | sort

will get me the distances (3rd field in the first file, so 1.3) for all the stations active in 1969 in order of distance. Then I just need to get the order statistics and graph the result to get the answers for Proposition 01.

For Proposition 02, I can use the same station-year average temperature list, sorted by station, to extract and check the warmest years. I think that will need some Perl…

Station locations

Sunday, December 6th, 2009

Excellent news – the six-digit part of the station codes is geographic and the list is sorted. Africa is in the range 600000-689999, so I can just cut out the lines from the middle of the file – same for Europe, etc.

Ocean distance from station

Saturday, December 5th, 2009

Only needs a small change to the map-drawing Java to take the NOAA list of stations and calculate their distances to the nearest ocean. There are 30800 of those so it’ll take a while, but I should then have a map of station ID to distance-to-ocean. Yay!

Distance to ocean map

Saturday, December 5th, 2009

Well, the DTED collection took rather longer than it should have – having to dig around for all the flyspeck islands in the pacific. I’m still not sure I’ve got them all, but it’ll do for now.

I’ve updated the map-drawing code to calculate Great Circle distances between points, and therefore calculate the nearest ocean to each point of land:

(The image contains the actual values obtained, (red*16) + (green & 15) = distance in km. The red channel alone is probably sufficient given the accuracies.)

This is only rough, as it is working at the level of 1-degree x 1-degree cells, so distances are only accurate to around +/-50km. This can be improved by using the data in the DTED cells, rather than just their existence, and should get to well under 1km errors in the temperate latitudes, but for a quick test it’ll suffice.

Temperature records

Saturday, December 5th, 2009

The NOAA data is awkwardly arranged. It’s in (literally) thousands of GZip files, one per year per station, stored in yearly TAR files. About 3gb compressed, so probably double that. My PC chokes on them as the virus checker sees an archive file and decides to look inside, so I’ve had to download them using a shell account on a Linux box. I’ve now got the years 1929-1959 and 1970-1973 on my PC, and 1960-1969 and 1974-2009 on the shell account.

I’ve written a C program (“annual”) to parse the temperature records and calculate the mean (and standard deviation), and a shell script to de-TAR a year’s data to a temp directory and to do

cat $f | gzip -d | tail -n+2 | ./annual $1 >> Average-$1.txt

for each station record ($f). The ‘tail’ call strips off the first line (column headers), and $1 passes through the year given to the shell script. All this results in a comma-separated text file containing station ID, year, mean temperature, number of samples, and the standard deviation.

Now just to create another shell script to loop through this lot, and deal with the files on my home PC as well as the ones on the linux box… If I ‘nice’ everything hopefully nobody will notice it running all night!

Data collection

Saturday, December 5th, 2009

Yesterday and this morning I’ve collected a huge number of DTED files from here.

The terrain information is made available for free from the USGS, but is generally provided on physical media as it’s so large – multiple DVDs for level 1 DTED. Of course, there’s no such thing as a free lunch, so the files are grouped into 10-degree blocks and then arranged by country, so somewhat awkward to find – or a fun geography quiz, if you prefer!

I’ve also cobbled together a bit of Java to produce a map showing which terrain cells I’ve got. Sadly, this shows I’ve got some more work to do: