doc/source/Tutorials/io.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289

Getting started with silx.io
============================

This tutorial explains how to read data files using the :meth:`silx.io.open` function.

The target audience are developers without knowledge of the *h5py* library.

If you are already familiar with *h5py*, you just need to know that
the :meth:`silx.io.open` function returns objects that mimic *h5py* file objects,
and that the main supported file formats are:

  - HDF5
  - all formats supported by the *FabIO* library
  - SPEC data files

Knowledge about the python *dictionary* type and the numpy *ndarray* type
are prerequisites for this tutorial.


Background
----------

In the past, it was necessary to learn how to use multiple libraries to read multiple
data formats. The library *FabIO* was designed to read images in many formats, but not to read
more heterogeneous formats, such as *HDF5* or *SPEC*.

To read *SPEC* data files in Python, a common solution was to use the *PyMca* module
:mod:`PyMca5.PyMcaIO.specfilewrapper`.
Regarding HDF5 files, the de-facto standard for reading them in Python is to
use the *h5py* library.

*silx* tries to address this situation by providing a unified way to read all
data formats supported at the ESRF. Today, HDF5 is the preffered format to store
data for many scientific institutions, including most synchrotrons.
So it was decided to provide tools for reading data that mimic the *h5py* library's API.


Definitions
-----------

HDF5
++++

The *HDF5* format is a *hierarchical data format*, designed to store and
organize large amounts of data.

A HDF5 file contains a number of *datasets*, which are multidimensional arrays
of a homogeneous type.

These datasets are stored in container structures
called *groups*. Groups can also be stored in other groups, allowing to
define a hierarchical tree structure.

Both datasets and groups may have *attributes* attached to them. Attributes are
used to document the object. They are similar to datasets in several ways
(data container of homogeneous type), but they are typically much smaller.

It is a common analogy to compare a HDF5 file to a filesystem.
Groups are analogous to directories, while datasets are analogous to files,
and attributes are analogous to file metadata (creation date, last modification...).

.. image:: img/silx_view_edf.png
    :width: 400px


h5py
++++

The *h5py* library is a Pythonic interface to the `HDF5`_ binary data format.

It exposes an HDF5 group as a python object that resembles a python
dictionary, and an HDF5 dataset or attribute as an object that resembles a
numpy array.

API description
---------------

All main objects, File, Group and Dataset, share the following attributes:

 - :attr:`attrs`: Attributes, as a dictionary of metadata for the group or dataset.
 - :attr:`basename`: String giving the basename of this group or dataset.
 - :attr:`name`: String giving the full path to this group or dataset, relative
   to the root group (file).
 - :attr:`file`: File object at the root of the tree structure containing this
   group or dataset.
 - :attr:`parent`: Group object containing this group or dataset.

File object
+++++++++++

The API of the file objects returned by the :meth:`silx.io.open`
function tries to be as close as possible to the API of the :class:`h5py.File`
objects used to read HDF5 data.

A h5py file is a group with just a few extra attributes and methods.

The objects defined in `silx.io` implement a subset of these attributes and methods:

 - :attr:`filename`: Name of the file on disk.
 - :attr:`mode`: String indicating if the file is open in read mode ("r")
   or write mode ("w"). :meth:`silx.io.open` always returns objects in read mode.
 - :meth:`close`: Close this file. All open objects will become invalid.

The :attr:`parent` of a file is `None`, and its :attr:`name` is an empty string.

Group object
++++++++++++

Group objects behave like python dictionaries.

You can iterate over a group's :meth:`keys`, which are the names of the objects
encapsulated by the group (datasets and sub-groups). The :meth:`values` method
returns an iterator over the encapsulated objects. The :meth:`items` method returns
an iterator over `(name, value)` pairs.

Groups provide a :meth:`get` method that retrieves an item, or information about an item.
Like standard python dictionaries, a `default` parameter can be used to specify
a value to be returned if the given name is not a member of the group.

Two methods are provided to visit recursively all members of a group, :meth:`visit`
and :meth:`visititems`. The former takes as argument a *callable* with the signature
``callable(name) -> None or return value``. The latter  takes as argument a *callable*
with the signature ``callable(name, object) -> None or return value`` (``object`` being a
a group or dataset instance.)

Example
-------

Accessing data
++++++++++++++

In this first example, we open a Spec data file and we print some of its information.

.. code-block:: python

    >>> import silx.io
    >>> sf = silx.io.open("data/CuZnO_2.spec")
    <silx.io.spech5.SpecH5 at 0x7f00d0760f90>
    >>> print(sf.keys())
    ['1.1', '2.1', '3.1', '4.1', '5.1', '6.1', '7.1', ...]
    >>> print(sf["1.1"])
    <silx.io.spech5.ScanGroup object at 0x7f00d0715b90>


We just opened a file, keeping a reference to the file object as ``sf``.
We then printed all items contained in this root group. We can see that all
these items are groups. Lets looks at what is inside these groups, and find
datasets:


.. code-block:: python

    >>> grp = sf["2.1"]
    ... for name in grp:
    ...     item = grp[name]
    ...     print("Found item " + name)
    ...     if silx.io.is_dataset(item):
    ...         print("'%s' is a dataset.\n" % name)
    ...     elif silx.io.is_group(item):
    ...         print("'%s' is a group.\n" % name)
    ...
    Found item title
    title is a dataset.

    Found item start_time
    start_time is a dataset.

    Found item instrument
    instrument is a group.

    Found item measurement
    measurement is a group.

    Found item sample
    sample is a group.

We could have replaced the first three lines with this single line,
by iterating over the iterator returned by the group method :meth:`items`:

.. code-block:: python

    >>> for name, item in sf["2.1"].items():
    ...

In addition to :meth:`silx.io.is_group` and :meth:`silx.io.is_dataset`,
you can also use :meth:`silx.io.is_file` and :meth:`silx.io.is_softlink`.


Let's look at a dataset:

.. code-block:: python

    >>> print(sf["2.1/title"])
    <HDF5-like dataset "title": shape (), type "|S29">

As you can see, printing a dataset does not print the data itself, it only print a
representation of the dataset object. The information printed tells us that the
object is similar to a numpy array, with a *shape* and a *type*.

In this case, we are dealing with a scalar dataset, so we can use the same syntax as
in numpy to access the scalar value, ``result = dset[()]``:

.. code-block:: python

    >>> print(sf["2.1/title"][()])
    2  ascan  phi 0.61 1.61  20 1

Similarly, you need to use numpy slicing to access values in numeric array:

.. code-block:: python

    >>> print (sf["2.1/measurement/Phi"])
    <HDF5-like dataset "Phi": shape (21,), type "<f4">
    >>> print (sf["2.1/measurement/Phi"][0:10])
    [ 0.61000001  0.66000003  0.70999998  0.75999999  0.81        0.86000001
      0.91000003  0.95999998  1.00999999  1.05999994]
    >>> entire_phi_array = sf["2.1/measurement/Phi"][:]

Here we could read the entire array by slicing it with ``[:]``, because we know
it is a 1D array. For a 2D array, the slicing argument would have been ``[:, :]``.

For a dataset of unknown dimensionality (including scalar datasets), the
``Ellipsis`` object (represented by ``...``) can be used to slice the object.

.. code-block:: python

    >>> print(sf["2.1/title"][...])
    2  ascan  phi 0.61 1.61  20 1
    >>> print (sf["2.1/measurement/Phi"][...])
    [ 0.61000001  0.66000003  0.70999998  0.75999999  0.81        0.86000001
      0.91000003  0.95999998  1.00999999  1.05999994  1.11000001  1.15999997
      1.21000004  1.25999999  1.30999994  1.36000001  1.40999997  1.46000004
      1.50999999  1.55999994  1.61000001]

To read more about the usage of ``Ellipsis`` to slice arrays, see
`Indexing numpy arrays <http://scipy-cookbook.readthedocs.io/items/Indexing.html?highlight=indexing#Multidimensional-slices>`_
in the scipy documentation.

Note that slicing a scalar dataset with ``[()]`` is not strictly equivalent to
slicing with ``[...]``. The former gives you the actual scalar value in
the dataset, while the latter always gives you an array object, which happens to
be 0D in the case of a scalar.

    >>> sf["2.1/instrument/positioners/Delta"][()]
    0.0
    >>> sf["2.1/instrument/positioners/Delta"][...]
    array(0.0, dtype=float32)

Closing the file
++++++++++++++++

You should always make sure to close the files that you opened. The simple way of
closing a file is to call its :meth:`close` method.

.. code-block:: python

    import silx.io
    sf = silx.io.open("data/CuZnO_2.spec")

    # read the information you need...
    maxPhi = sf["2.1/measurement/Phi"][...].max()

    sf.close()

The drawback of this method is that, if an error is raised while processing
the file, the program might never reach the ``sf.close()`` line.
Leaving files open can cause various issues for the rest of your program,
such as consuming memory, not being able to reopen the file when you need it...

The best way to ensure the file is always properly closed is to use the file
inside its context manager:

.. code-block:: python

    import silx.io

    with silx.io.open("data/CuZnO_2.spec") as sf:
        # read the information you need...
        maxPhi = sf["2.1/measurement/Phi"][...].max()


Additional resources
--------------------

- `h5py documentation <http://docs.h5py.org/en/latest/>`_
- `Formats supported by FabIO <http://www.silx.org/doc/fabio/dev/getting_started.html#list-of-file-formats-that-fabio-can-read-and-write>`_
- `Spec file h5py-like structure <http://www.silx.org/doc/silx/dev/modules/io/spech5.html#api-description>`_
- `HDF5 format documentation <https://support.hdfgroup.org/HDF5/>`_