Client
Pydap can be used as a client to inspect and retrieve data from any dataset stored in a DAP server. Remote datasets can be introspected by the client before any data is retrieved, allowing the user to download only the desired variables and for the area and period of interest, for example.
Gridded data
A quick example. Suppose we want to access data from the COADS climatology, available (among other places) at the location http://test.pydap.org/coads.nc. We simply need to open this location using the dap.client.open function:
>>> from dap.client import open
>>> dataset = open('http://test.pydap.org/coads.nc')
Now, you should know that pydap defines a few special type of objects in dap.dtypes. The first one, the type of our variable dataset in this example, is DatasetType. This is just a fancy dictionary, as you can see:
>>> print dataset.keys()
['UWND', 'WSPD', 'SST', 'VWND', 'SLP', 'AIRT', 'SPEH', 'COADSX', 'COADSY', 'TIME']
The values associated with each of these keys are also objects from dap.dtypes. Let's inspect one of them:
>>> print type(dataset['TIME'])
<class 'dap.dtypes.ArrayType'>
An ArrayType is one of the most fundamental pydap types. As you'll see, you can work with these array–like objects as you would work with any other multi–dimensional arrays. But first, let's inspect it a little more:
>>> time = dataset.TIME # dataset is a "lazy" dictionary
>>> print time.type
Float64
>>> print time.shape
(12,)
>>> print time.dimensions
('TIME',)
>>> print time.units
['hour since 0000-01-01 00:00:00']
A fundamental concept of the DAP is that until this point no data has been downloaded — only metadata. Data is only downloaded when a variable is sliced:
>>> print time[:2]
[ 366. 1096.485]
Here, when we requested the first two values from the time object, the client quickly requested the data from the server in the background; only the two values were requested. This is what makes pydap (or any other DAP client) so useful: it retrieves the data in a transparent and efficient way, giving the impression that the client has the whole dataset stored locally.
Another type of object present in this dataset is the GridType:
>>> print type(dataset.SST)
<class 'dap.dtypes.GridType'>
A GridType works pretty much like an ArrayType; the major difference is that it has an attribute called maps that holds the axes on which the data is defined:
>>> print dataset.SST.dimensions
('TIME', 'COADSY', 'COADSX')
>>> print dataset.SST.maps
{'TIME': <dap.dtypes.ArrayType object at ...>, 'COADSY': <dap.dtypes.ArrayType object at ...>, 'COADSX': <dap.dtypes.ArrayType object at ...>}
The data is contained in the array attribute:
>>> print dataset.SST.array
<dap.dtypes.ArrayType object at ...>
And we can access data either by slicing the object or its array:
>>> print dataset.SST.array[-1,20,:4]
[[[ 3.8144443 3.7166667 3.58500004 3.68599987]]]
>>> print dataset.SST[-1,20,:4]
[[[ 3.8144443 3.7166667 3.58500004 3.68599987]]]
Sequential data
Another important pydap type is the SequenceType; these objects hold sequential instances of other variables inside them. An example will make this more clear; let's open the CSV file test.csv (you can open it in your browser to see how it looks):
>>> dataset = open('http://test.pydap.org/test.csv')
>>> print dataset.keys()
['test']
>>> seq = dataset['test']
>>> print type(seq)
<class 'dap.dtypes.SequenceType'>
A Sequence also works like a dictionary, containing additional variables:
>>> print seq.keys()
['id', 'lat', 'lon']
In this case, each of these variables is of type BaseType. Normally, they would hold a single value, but since they are inside a SequenceType they hold a series of records — you can think of them as columns in a database.
We can retrieve data from a Sequence using two different syntaxes. If we only want to download a single variable inside the Sequence, it's easier to slice it directly:
>>> print seq['id'][:]
[1, 2, 3, 4, 5]
Note that we cannot index Sequences; we have to retrieve everything using the [:] slice. Technically the DAP allows indexing Sequences; this is a (current) limitation of pydap since few servers support this.
We can also treat a Sequence as a typical Python iterable; in this case, it will return a series of objects of type StructureType. A Structure is also a dict–like object that holds other variables:
>>> for struct in seq:
... print struct['lat'].data, struct['lon'].data
10.1000003815 103.0
10.1999998093 93.0
10.3000001907 83.0
10.3999996185 73.0
10.5 63.0
Filtering Sequences
A great advantage of Sequences is that they can be filtered on the server–side using a SQL–like syntax. Pydap has two different syntaxes for filtering Sequences: the sure way, and the fun way.
The sure way requires you to know a little bit about how the DAP works. If you want to retrieve values where lon is smaller than 100, for example, you would do this:
>>> filtered_seq = seq.filter('%s<100' % seq.lon.id)
>>> for struct in filtered_seq:
... print struct['lat'].data, struct['lon'].data
10.1999998093 93.0
10.3000001907 83.0
10.3999996185 73.0
10.5 63.0
The filter() method allows any number of filters, that must be defined using the variables' ids — that's why we used seq.lon.id to build it. There's an easier way to filter Sequences though: just filter it like you would filter any other Python iterable. You could use a list comprehension, for example:
>>> filtered_seq = [struct for struct in seq if struct['lon'] < 100]
>>> for struct in filtered_seq:
... print struct['lat'].data, struct['lon'].data
10.1999998093 93.0
10.3000001907 83.0
10.3999996185 73.0
10.5 63.0
Tipically, this would download all the data (at the struct for struct in seq part) and then filter it on the client–side. But pydap is smart enough to inspect the frame stack, extract the appropriate source code, parse it, build a server–side filter and use it while retrieving the data. Here's the proof, when we add debugging and run it non-interactively:
dataset = open('http://test.opendap.org/test.csv', verbose=1)
http://localhost:8080/test.csv.dds
http://localhost:8080/test.csv.das
filtered_seq = [struct for struct in dataset.test if struct['lon'] < 100]
http://localhost:8080/test.csv.dods?test.id&test.lon<100
http://localhost:8080/test.csv.dods?test.lat&test.lon<100
http://localhost:8080/test.csv.dods?test.lon&test.lon<100
This method also works with generator expressions. The nice thing about it is that in the event that it fails the data will still be filtered on the client–side by the list comprehension / generator expression! This means that it will always work, even though sometimes it's not as efficient as possible.