This tutorial demonstrates how to retrieve data from the the CUAHSI-HIS system using PyHIS. It will cover the basics of data access, available convinience functions and some simple scripts using the toolkit.
This tutorial requires PyHIS to be installed on your computer (http://pypi.python.org/pypi/pyhis).
Screenshots and examples in this tutorial are shown using the IPython (http://ipython.org/) interactive python prompt for convinience but IPython is not required to use PyHIS. A Python code editor or IDE is recommended for writing scripts or viewing code. This tutorial follows the convention below. A basic familiarity with Python is useful but the examples have been chosen for readibility and a non python user should be able to follow the tutorial.
Code Snippet to be run in IPython or Python prompt:
a = 5
b = 9
print a + b
# Output:
# In [1]: a = 5
# In [2]: b = 9
# In [3]: print a + b
# 14
The line starting with a # are comment lines. The section following # Output: demonstrates the expected output of the code snippet. You can follow along by cutting and pasting the each code snippet into the IPython prompt.
PyHIS is available on the Python Package Index and can be installed on most systems using either easy_install pyhis or pip install pyhis. PyHIS uses several scientific python plotting and analysis libraries. Although easy_install and pip will attempt to install these automatically it is better to use a scientific python distribution like PythonXY (http://www.pythonxy.com, Windows Only)) or the Enthought Python Distribution - EPD Free (http://enthought.com/products/epd_free.php, Windows/Linux/Mac OS X) prior to installing PyHIS.
To start off lets import pyhis into python:
import pyhis
# Output:
#
# In [1]: import pyhis
# cache initialized with database: sqlite:////var/folders/j0/3qhzgkzn7xgdzh0488y4_b340000gn/T/pyhis_cache.db
This command makes pyhis and all of its abilities available to python. If you look at the output above you can see that by default PyHIS has created a sqlite database to automatically cache (store locally) any data and metadata you access with PyHIS.
If you would like to store the data somewhere else you can provide a path and name for your database:
pyhis.cache.init_cache(cache_database_uri='mycache.db')
# Output:
#
# In [2]: pyhis.cache.init_cache(cache_database_uri='mycache.db')
# cache initialized with database: sqlite:///mycache.db
This stores downloaded data in the current directory in a sqlite file called mycache.db. You can also turn off caching for individual services or use different database backends. This is not covered here.
Accessing CUAHSI-HIS data services requires knowing the WSDL (Web Service Definition Language) of the service you are interested in. This is a very complicated way of saying you need to now the special web link that provides the data. WSDL’s look like this http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL . Clicking on that link takes you to a page full of gobbledy gook XML that is impossible to understand. This is what PyHIS needs to make the magic happen.
CUAHSI-HIS maintains a list of registered HIS data services on at HIS Central (http://hiscentral.cuahsi.org/). These can be viewed from within PyHIS by using the following function:
list_of_services = pyhis.his_central.services()
len(list_of_services) #number of available services
list_of_services #to see the list
# Output:
# In [3]: list_of_services = pyhis.his_central.services()
# In [4]: len(list_of_services)
# 76
# In [5]: list_of_services
# [(u'USGS: NWIS Daily Values',
# 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL',
# 'NWISDV'),
# (u'USGS: NWIS Instantaneous Irregular Data',
# 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL',
# 'NWISIID'),
# (u'USGS: NWIS Unit Values',
# 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL',
# 'NWISUV'),
# (u'EPA: EPA STORET',
# 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL',
# 'EPA'),
# (u'Chesapeake Bay Information Management System: Chesapeake Bay Information Management System',
# 'http://eddy.ccny.cuny.edu/CIMS/cuahsi_1_1.asmx?WSDL',
# 'CIMS'),
# ...
# ]
The above command returns a lust of all available services. You can also request a list of services with data available within a bounding box:
list_of_texas_services = pyhis.his_central.services(xmin=-106.7, ymin=25.5, xmax=-93.4, ymax=36.6)
len(list_of_texas_services) #number of services available within bounding box
list_of_services #to see the list
# Output:
# In [6]: list_of_texas_services = pyhis.his_central.services()
# In [7]: len(list_of_texas_services)
# 27
# In [8]: list_of_texas_services
# [(u'USGS: NWIS Daily Values',
# 'http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL',
# 'NWISDV'),
# (u'USGS: NWIS Instantaneous Irregular Data',
# 'http://river.sdsc.edu/wateroneflow/NWIS/Data.asmx?WSDL',
# 'NWISIID'),
# (u'USGS: NWIS Unit Values',
# 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL',
# 'NWISUV'),
# (u'EPA: EPA STORET',
# 'http://river.sdsc.edu/wateroneflow/EPA/cuahsi_1_0.asmx?WSDL',
# 'EPA'),
# (u'USGS: NWIS Ground Water Level',
# 'http://river.sdsc.edu/wateroneflow/NWIS/Groundwater.asmx?WSDL',
# 'NWISGW'),
# (u'Texas Instream Flow Program: Texas Instream Flow, Lower Sabine',
# 'http://his.crwr.utexas.edu/SabineBio/cuahsi_1_0.asmx?WSDL',
# 'TIFP_LowerSabine'),
# (u'Texas Instream Flow Program: Texas Instream Flow, Lower San Antonio',
# 'http://his.crwr.utexas.edu/SanAntonioBio/cuahsi_1_0.asmx?WSDL',
# 'TIFP_LowerSanAntonio'),
# ...
# ]
The services a returned as a list of tuples each containing the name of the service, the WSDL to access the service and the network name. PyHIS is not limited to the services registered at HIS Central, any WaterOneFlow compliant data service can be accessed as long as your know the WSDL url.
Once you have the WSDL of a service you want to access data from, the service needs to be initialized in pyhis. For the following several examples we will use the USGS NWIS Unit Values service.
Initialize the service:
wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
nwisuv = pyhis.Service(wsdl_url)
nwisuv #lets see what this is ...
# Output:
#
#
# In [9]: wsdl_url = 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
# In [10]: nwisuv = pyhis.Service(wsdl_url)
# In [11]: nwisuv
# <Service: http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL>
This creates a pyhis Service object called nwisuv that knows everything it needs to about the USGS NWIS service.
Now that we have a pyhis Service object nwisuv, lets see what functions are available. If you are using IPython, type nwisuv. and press the tab button:
In [12]: nwisuv.
nwisuv.default_network nwisuv.get_site nwisuv.sites
nwisuv.description nwisuv.get_sites_within_polygon nwisuv.sites_array
nwisuv.generate_sites_array nwisuv.get_sites_within_radius_r nwisuv.suds_client
nwisuv.get_all_sites nwisuv.get_sites_within_shapefile nwisuv.url
You will see a list of functions and attributes attached to the nwisuv object. Lets see what some of these do (Note: Some of these are internal functions that will be hidden in furture versions of PyHIS):
nwisuv.url
# Output: 'http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL'
# i.e. the original url we specified
nwisuv.sites
# Out: making GetSites query...
# Out:
{u'01010000': <Site: St. John River at Ninemile Bridge, Maine [01010000]>,
u'01010070': <Site: Big Black River near Depot Mtn, Maine [01010070]>,
u'01010500': <Site: St. John River at Dickey, Maine [01010500]>,
u'01011000': <Site: Allagash River near Allagash, Maine [01011000]>,
u'01011500': <Site: St. Francis River near Connors, New Brunswick [01011500]>,
u'01013500': <Site: Fish River near Fort Kent, Maine [01013500]>,
u'01014000': <Site: St. John River below Fish R, at Fort Kent, Maine [01014000]>,
u'01015800': <Site: Aroostook River near Masardis, Maine [01015800]>,
u'01017000': <Site: Aroostook River at Washburn, Maine [01017000]>,
u'01017060': <Site: Hardwood Brook below Glidden Brk nr Caribou, Mai [01017060]>,
u'01017290': <Site: Little Madawaska River at Caribou, Maine [01017290]>,
...
}
len(nwisuv.sites)
# Out: 11758
Note: The first time you type nwisuv.sites pyhis has to connect to the USGS NWIS Unit Values service and download the list of sites by making a web service request. Hence you will see a message saying ‘making GetSites query’. Retrieving and parsing this data can take a few minutes for some of the larger datasets like the USGS. The next time you use nwisuv.sites or any function that needs the list of sites, PyHIS uses its automated local cache. so the response is immediate. This is true even if you close the Python window and open a new prompt. PyHIS has options to force refreshing of the cache on demand.
Lets find some sites within the Austin, TX area. PyHIS has three convinience functions you can use to narrow down the list of sites you are interested in. These are: get_sites_within_polygon, get_sites_within_shapefile and get_sites_within_radius_r. These do basically what you would expect from the name:
travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp')
len(travis_sites)
# Out: 27
travis_sites
# Out: {'08154500/agency=TX071': <Site: LCRA Lk Travis nr Austin, TX [08154500/agency=TX071]>,
# '08154700': <Site: Bull Ck at Loop 360 nr Austin, TX [08154700]>,
# '08154900/agency=TX071': <Site: LCRA Lk Austin at Austin, TX [08154900/agency=TX071]>,
# '08155200': <Site: Barton Ck at SH 71 nr Oak Hill, TX [08155200]>,
# '08155240': <Site: Barton Ck at Lost Ck Blvd nr Austin, TX [08155240]>,
# '08155300': <Site: Barton Ck at Loop 360, Austin, TX [08155300]>,
# '08155400': <Site: Barton Ck abv Barton Spgs at Austin, TX [08155400]>,
# '08155500': <Site: Barton Spgs at Austin, TX [08155500]>,
# '08155541': <Site: W Bouldin Ck at Oltorf Rd, Austin, TX [08155541]>,
# '08156675': <Site: Shoal Ck at Silverway Dr, Austin, TX [08156675]>,
# '08156800': <Site: Shoal Ck at W 12th St, Austin, TX [08156800]>,
# '08156910': <Site: Waller Ck at Koenig Lane, Austin, TX [08156910]>,
# '08158000': <Site: Colorado Rv at Austin, TX [08158000]>,
# '08158030': <Site: Boggy Ck at Manor Rd, Austin, TX [08158030]>,
# '08158035': <Site: Boggy Ck at Webberville Rd, Austin, TX [08158035]>,
# '08158045': <Site: Ft Br Boggy Ck at Manor Rd, Austin, TX [08158045]>,
# '08158200': <Site: Walnut Ck at Dessau Rd, Austin, TX [08158200]>,
# '08158380': <Site: Little Walnut Ck at Georgian Dr, Austin, TX [08158380]>,
# '08158600': <Site: Walnut Ck at Webberville Rd, Austin, TX [08158600]>,
# '08158827': <Site: Onion Ck at Twin Creeks Rd nr Manchaca, TX [08158827]>,
# '08158840': <Site: Slaughter Ck at FM 1826 nr Austin, TX [08158840]>,
# '08158860': <Site: Slaughter Ck at FM 2304 nr Austin, TX [08158860]>,
# '08158920': <Site: Williamson Ck at Oak Hill, TX [08158920]>,
# '08158927': <Site: Kincheon Br at William Cannon Blvd, Austin, TX [08158927]>,
# '08158930': <Site: Williamson Ck at Manchaca Rd, Austin, TX [08158930]>,
# '08158970': <Site: Williamson Ck at Jimmy Clay Rd, Austin, TX [08158970]>,
# '08159000': <Site: Onion Ck at US Hwy 183, Austin, TX [08159000]>}
Lets look at one of the sites:
nwisuv.sites['08158000']
# Out: <Site: Colorado Rv at Austin, TX [08158000]>
This is a PyHIS Site object describing the USGS Gage 08158000 on the Colorado River at Austin TX. Lets see what we can find out about this gage:
mysite = nwisuv.sites['08158000']
mysite. #hit tab
# Out: mysite.code mysite.latitude mysite.network mysite.timeseries
mysite.dataframe mysite.longitude mysite.service
mysite.id mysite.name mysite.site_info_response
Lets look at some of these:
mysite.name
# Out: u'Colorado Rv at Austin, TX'
mysite.code
# Out: u'08158000'
mysite.latitude
# Out: 30.244653701782227
mysite.longitude
# Out: -97.69445037841797
mysite.service
# Out: <Service: http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL>
mysite.network
# Out: u'NWISUV'
So the PyHIS site object has pretty much all the metadata you should need about the site you are accessing.
Patience, we are there:
mysite.timeseries
#Out: making GetSiteInfo request for "NWISUV:08158000"...
#Out: {00060: <TimeSeries: Discharge, cubic feet per second (2011-07-18 07:51:46 - 2011-11-15 07:51:46)>,
00065: <TimeSeries: Gage height, feet (2011-07-18 07:51:46 - 2011-11-15 07:51:46)>}
Ah hah! This USGS gage has two available time series datasets; Discharge and Gage height. Also summarized are the available date ranges of the data. 00060 and 00065 are the parameter codes for these timeseries. In the background PyHIS has made a GetSiteInfo webservice request to get the data.
Note: In practice, time period and value counts are not very reliable from service to service depending on how they have been generated by the data provider. It is better to treat them as estimates.
Lets look at some of the metadata about the discharge timeseries:
mydischarge = mysite.timeseries['00060']
mydischarge. #press tab
# Out: mydischarge.begin_datetime mydischarge.quality_control_level mydischarge.value_count
mydischarge.data mydischarge.quantity mydischarge.variable
mydischarge.end_datetime mydischarge.series
mydischarge.method mydischarge.site
mydischarge.value_count
# Out: 11520
So a time
Lets get the discharge data for the entire time period:
mydata = mydischarge.data
# --- OR ---
# mydata = mysite.timeseries['00060'].data #i.e. You can string all the commands together.
#
# Out: making timeseries request for "NWISUV:08158000:00060 (None - None)"...
/Users/dharhas/work/pyhis/pyhis/waterml.py:128: UserWarning: Unit conversion not available for 00060: UNKNOWN [cfs]
(variable_code, quantity, unit_code))
/Users/dharhas/work/pyhis/pyhis/cache.py:816: UserWarning: value_count (11520) doesn't match number of values (11415) for Colorado Rv at Austin, TX:00060
cached_timeseries.variable.code))
Data has been downloaded and placed a pandas time series object. Pandas (http://pandas.sourceforge.net/) is a powerful Python data analysis toolkit. The data has also been cached. Next time this particular data is requested by default only new data will be retrieved from the USGS service. Previously retrieved data will be read from the local cache.
Lets look at the data:
mydata
# Out:
# 2011-07-18 09:00:00 126.0
# 2011-07-18 09:15:00 126.0
# 2011-07-18 09:30:00 139.0
# 2011-07-18 09:45:00 216.0
# 2011-07-18 10:00:00 372.0
# 2011-07-18 10:15:00 553.0
# 2011-07-18 10:30:00 714.0
# ...
# 2011-11-15 08:00:00 101.0
# 2011-11-15 08:15:00 105.0
# 2011-11-15 08:30:00 105.0
# 2011-11-15 08:45:00 108.0
# 2011-11-15 09:00:00 108.0
# 2011-11-15 09:15:00 108.0
# 2011-11-15 09:30:00 108.0
# 2011-11-15 09:45:00 108.0
# length: 11415
mydata. #press tab
# Out: mydata.
# Display all 131 possibilities? (y or n)
# mydata.T mydata.data mydata.map mydata.shift
# mydata.add mydata.describe mydata.max mydata.size
# mydata.all mydata.diagonal mydata.mean mydata.skew
# mydata.any mydata.diff mydata.median mydata.sort
# mydata.append mydata.div mydata.merge mydata.sort_index
# mydata.apply mydata.dot mydata.min mydata.sortlevel
# mydata.applymap mydata.drop mydata.mul mydata.squeeze
# mydata.argmax mydata.dropna mydata.nbytes mydata.std
# mydata.argmin mydata.dtype mydata.ndim mydata.strides
# mydata.argsort mydata.dump mydata.newbyteorder mydata.sub
# mydata.asOf mydata.dumps mydata.nonzero mydata.sum
# mydata.asfreq mydata.fill mydata.order mydata.swapaxes
# mydata.asof mydata.fillna mydata.plot mydata.swaplevel
# mydata.astype mydata.first_valid_index mydata.prod mydata.take
# mydata.autocorr mydata.flags mydata.ptp mydata.toCSV
# mydata.base mydata.flat mydata.put mydata.toDict
# mydata.byteswap mydata.flatten mydata.quantile mydata.toString
# mydata.choose mydata.fromValue mydata.ravel mydata.to_csv
# mydata.clip mydata.get mydata.real mydata.to_dict
# mydata.clip_lower mydata.getfield mydata.reindex mydata.to_sparse
# mydata.clip_upper mydata.groupby mydata.reindex_like mydata.tofile
# mydata.combine mydata.hist mydata.rename mydata.tolist
# mydata.combineFirst mydata.imag mydata.repeat mydata.tostring
# mydata.combine_first mydata.index mydata.reshape mydata.trace
# mydata.compress mydata.interpolate mydata.resize mydata.transpose
# mydata.conj mydata.item mydata.round mydata.truncate
# mydata.conjugate mydata.itemset mydata.save mydata.unstack
# mydata.copy mydata.itemsize mydata.searchsorted mydata.valid
# mydata.corr mydata.iteritems mydata.select mydata.values
# mydata.count mydata.ix mydata.setasflat mydata.var
# mydata.ctypes mydata.keys mydata.setfield mydata.view
# mydata.cumprod mydata.last_valid_index mydata.setflags mydata.weekday
# mydata.cumsum mydata.load mydata.shape
As you can see there is a pretty long List of functions available for a pandas timeseries object. Lets try a few:
mydata.min()
# Out: 52.0
mydata.max()
# Out: 2850.0
mydata.median()
# Out: 548.0
mydata.std()
# Out: 774.01376212573985
# or just look at all the basic stats
mydata.describe()
# Out: count 11415.0
mean 872.066579063
std 774.013762126
min 52.0
10% 108.0
50% 548.0
90% 2030.0
max 2850.0
mydata.cumsum()
# Out: 2011-07-18 09:00:00 126.0
2011-07-18 09:15:00 252.0
2011-07-18 09:30:00 391.0
2011-07-18 09:45:00 607.0
2011-07-18 10:00:00 979.0
2011-07-18 10:15:00 1532.0
2011-07-18 10:30:00 2246.0
...
2011-11-15 08:30:00 9954100.0
2011-11-15 08:45:00 9954208.0
2011-11-15 09:00:00 9954316.0
2011-11-15 09:15:00 9954424.0
2011-11-15 09:30:00 9954532.0
2011-11-15 09:45:00 9954640.0
length: 11415
# Plot data
mydata.plot()
# Plot Cumulative Sum
mydata.cumsum().plot()
# save data to csv file
mydata.to_csv('mydata.csv')
Look at tut_example1.py:
"""
Example script that returns a list of USGS gages within Travis county area whose discharge is below
20th percentile.
"""
#Initialize pyhis
import pyhis
#Initialize some other python modules for plotting etc
import matplotlib.pyplot as plt
# Connect to the USGS NWIS Unit values and Daily values data services.
# The daily values service will be used to calculate historical percentiles
# and the unit values service provides 15 mn data
nwis_dv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/DailyValues.asmx?WSDL')
nwis_uv = pyhis.Service('http://river.sdsc.edu/wateroneflow/NWIS/UnitValues.asmx?WSDL')
# get sites within Travis county
travis_sites = nwis_uv.get_sites_within_shapefile('travis_county.shp')
# target percentile (this could be made into a input parameter)
target_percentile = 0.2 #(i.e. 20%)
# start with an empty list
list_of_critical_gages = []
# loop through the sites in Travis County
for site in travis_sites:
print 'processing site: ', site
#check if there is discharge data atthis site
if '00060' in nwis_uv.sites[site].timeseries:
print 'Discharge data available'
# calculate the target quantile from historical daily value data
threshold = nwis_dv.sites[site].timeseries['00060/DataType=Average'].data.quantile(target_percentile)
print 'threshold value: ' , threshold
#if the latest downloaded data is below threshold then add to list of critical gages
if nwis_uv.sites[site].timeseries['00060'].data[-1] < threshold:
print 'adding gage to critical gage list: ', site
list_of_critical_gages.append(nwis_uv.sites[site])
#plot discharge at the critical gages
plt.figure()
for gage in list_of_critical_gages:
gage.timeseries['00060'].data.plot(label=gage.name)
plt.legend()
plt.ylabel = 'Discharge (cfs)'
plt.title('Critical Gages in Travis County')
plt.savefig('travis_county_critical_gages.png')