Calculating hashes of IMAS data¶
IMAS-Python can calculate hashes of IMAS data. As Wikipedia explains better than I could do:
A hash function is any function that can be used to map data of arbitrary size to fixed-size values, […]. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes.
IMAS-Python is using the XXH3 hash function from the xxHash project. This is a non-cryptographic hash and returns 64-bit hashes.
Use cases¶
Hashes of IMAS data are probably most useful as checksums: when the hashes of two IDSs match, there is a very decent chance that they contain identical data. [1] This can be useful to verify data integrity, and detect whether data has been accidentally corrupted or altered.
Exercise 1: Calculate some hashes¶
In this exercise we will use imas.util.calc_hash() to calculate
hashes of some IDSs. Use bytes.hex() to show a more readable
hexidecimal format of the hash.
Create an empty
equilibriumIDS and print its hash.Now fill
ids_properties.homogeneous_timeand print the hash. Did it change?Resize the
time_sliceArray of Structures to size 2. Calculate the hash oftime_slice[0]andtime_slice[1]. What do you notice?Resize
time_slice[0].profiles_2dto size 1. For convenience, you can create a variablep2d = time_slice[0].profiles_2d[0].Fill
p2d.r = [[1., 2.]]andp2d.z = p2d.r, then calculate their hashes. What do you notice?del p2d.zand calculate the hash ofp2d. Then setp2d.z = p2d.randdel p2d.r. What do you notice?
import imas
# 1. Create IDS
eq = imas.IDSFactory().equilibrium()
print(imas.util.calc_hash(eq).hex(' ', 2)) # 2d06 8005 38d3 94c2
# 2. Update homogeneous_time
eq.ids_properties.homogeneous_time = 0
print(imas.util.calc_hash(eq).hex(' ', 2)) # 3b9b 9297 56a2 42fd
# Yes: the hash changed (significantly!). This was expected, because the data is no
# longer the same
# 3. Resize time_slice
eq.time_slice.resize(2)
print(imas.util.calc_hash(eq.time_slice[0]).hex(' ', 2)) # 2d06 8005 38d3 94c2
print(imas.util.calc_hash(eq.time_slice[1]).hex(' ', 2)) # 2d06 8005 38d3 94c2
# What do you notice?
#
# The hashes of both time_slice[0] and time_slice[1] are identical, because both
# contain no data.
#
# The hashes are also identical to the empty IDS hash from step 1. An IDS, or a
# structure within an IDS, that has no fields filled will always have this hash value.
# 4. Resize profiles_2d
eq.time_slice[0].profiles_2d.resize(1)
p2d = eq.time_slice[0].profiles_2d[0]
# 5. Fill data
p2d.r = [[1., 2.]]
p2d.z = p2d.r
print(imas.util.calc_hash(p2d.r).hex(' ', 2)) # 352b a6a6 b40c 708d
print(imas.util.calc_hash(p2d.z).hex(' ', 2)) # 352b a6a6 b40c 708d
# These hashes are identical, because they contain the same data
# 6. Only r or z
del p2d.z
print(imas.util.calc_hash(p2d).hex(' ', 2)) # 0dcb ddaa 78ea 83a3
p2d.z = p2d.r
del p2d.r
print(imas.util.calc_hash(p2d).hex(' ', 2)) # f86b 8ea8 9652 3768
# Although the data inside `r` and `z` is identical, we get different hashes because the
# data is in a different attribute.
Properties of IMAS-Python’s hashes¶
The implementation of the hash function has the following properties:
Only fields that are filled are included in the hash.
If a newer version of the Data Dictionary introduces additional data fields, then this won’t affect the hash of your data.
As long as there are no Non Backwards Compatible changes in the Data Dictionary for the filled fields, the data hashes should not change.
The
ids_properties/version_putstructure is not included in the hash.This means that the precise Access Layer version, Data Dictionary version or high level interface that was used to store the data, does not affect the hash of the data.
Hashes are different for ND arrays with different shapes that share the same underlying data.
For example, the following arrays are stored the same way in your RAM, but they result in different hashes:
array1 = [1, 2] array2 = [[1, 2]] array3 = [[1], [2]]
Technical details and specification¶
You can find the technical details, and a specification for calculating the hashes, in
the documentation of imas.util.calc_hash().