Using dask¶

dask is a Python package built upon the scientific stack to enable scalling of Python through interactive sessions to multi-core and multi-node.

Of particular relevance to SEGY-SAK is that xrray.Dataset loads naturally into dask.

Imports and Setup¶

Here we import the plotting tools, numpy and setup the dask.Client which will auto start a localcluster. Printing the client returns details about the dashboard link and resources.

In [1]:

Copied!

import warnings

warnings.filterwarnings("ignore")
import warnings

warnings.filterwarnings("ignore")

In [2]:

Copied!

import numpy as np
from segysak import open_seisnc, segy

import matplotlib.pyplot as plt

%matplotlib inline
import numpy as np
from segysak import open_seisnc, segy

import matplotlib.pyplot as plt

%matplotlib inline

In [3]:

Copied!

from dask.distributed import Client

client = Client()
client
from dask.distributed import Client

client = Client()
client

Out[3]:

Client

Client-34f5802c-a544-11ef-88e2-b5f27fe4abd9

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

be4189e6

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 15.61 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-9011cbbf-fb8a-4662-afc0-a1ba541ac1c4

Comm: tcp://127.0.0.1:44647	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 15.61 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:42857	Total threads: 1
Dashboard: http://127.0.0.1:36189/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:37025
Local directory: /tmp/dask-scratch-space/worker-9fepqgzt

Worker: 1

Comm: tcp://127.0.0.1:45883	Total threads: 1
Dashboard: http://127.0.0.1:44565/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:40325
Local directory: /tmp/dask-scratch-space/worker-xqq1u9cr

Worker: 2

Comm: tcp://127.0.0.1:37657	Total threads: 1
Dashboard: http://127.0.0.1:45221/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:44979
Local directory: /tmp/dask-scratch-space/worker-9mzpod7o

Worker: 3

Comm: tcp://127.0.0.1:39375	Total threads: 1
Dashboard: http://127.0.0.1:46029/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:38499
Local directory: /tmp/dask-scratch-space/worker-ztbo6i8y

We can also scale the cluster to be a bit smaller.

In [4]:

Copied!

client.cluster.scale(2, memory="0.5gb")
client
client.cluster.scale(2, memory="0.5gb")
client

Out[4]:

Client

Client-34f5802c-a544-11ef-88e2-b5f27fe4abd9

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

LocalCluster

be4189e6

Dashboard: http://127.0.0.1:8787/status	Workers: 4
Total threads: 4	Total memory: 15.61 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-9011cbbf-fb8a-4662-afc0-a1ba541ac1c4

Comm: tcp://127.0.0.1:44647	Workers: 4
Dashboard: http://127.0.0.1:8787/status	Total threads: 4
Started: Just now	Total memory: 15.61 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:42857	Total threads: 1
Dashboard: http://127.0.0.1:36189/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:37025
Local directory: /tmp/dask-scratch-space/worker-9fepqgzt

Worker: 1

Comm: tcp://127.0.0.1:45883	Total threads: 1
Dashboard: http://127.0.0.1:44565/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:40325
Local directory: /tmp/dask-scratch-space/worker-xqq1u9cr

Worker: 2

Comm: tcp://127.0.0.1:37657	Total threads: 1
Dashboard: http://127.0.0.1:45221/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:44979
Local directory: /tmp/dask-scratch-space/worker-9mzpod7o

Worker: 3

Comm: tcp://127.0.0.1:39375	Total threads: 1
Dashboard: http://127.0.0.1:46029/status	Memory: 3.90 GiB
Nanny: tcp://127.0.0.1:38499
Local directory: /tmp/dask-scratch-space/worker-ztbo6i8y

Lazy loading from SEISNC using chunking¶

If your data is in SEG-Y to use dask it must be converted to SEISNC. If you do this with the CLI it only need happen once.

In [5]:

Copied!

segy_file = "data/volve10r12-full-twt-sub3d.sgy"
seisnc_file = "data/volve10r12-full-twt-sub3d.seisnc"
segy.segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdp_x=181, cdp_y=185)
segy_file = "data/volve10r12-full-twt-sub3d.sgy"
seisnc_file = "data/volve10r12-full-twt-sub3d.seisnc"
segy.segy_converter(segy_file, seisnc_file, iline=189, xline=193, cdp_x=181, cdp_y=185)

header_loaded
is_3d
Fast direction is CROSSLINE_3D

By specifying the chunks argument to the open_seisnc command we can ask dask to fetch the data in chunks of size n. In this example the iline dimension will be chunked in groups of 100. The valid arguments to chunks depends on the dataset but any dimension can be used.

Even though the seis of the dataset is 2.14GB it hasn't yet been loaded into memory, not will dask load it entirely unless the operation demands it.

In [6]:

Copied!

seisnc = open_seisnc("data/volve10r12-full-twt-sub3d.seisnc", chunks={"iline": 100})
seisnc.seis.humanbytes
seisnc = open_seisnc("data/volve10r12-full-twt-sub3d.seisnc", chunks={"iline": 100})
seisnc.seis.humanbytes

Out[6]:

'40.05 MB'

Lets see what our dataset looks like. See that the variables are dask.array. This means they are references to the on disk data. The dimensions must be loaded so dask knows how to manage your dataset.

In [7]:

Copied!

seisnc
seisnc

Out[7]:

<xarray.Dataset> Size: 42MB
Dimensions:  (iline: 61, xline: 202, twt: 850)
Coordinates:
  * iline    (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150
  * xline    (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351
  * twt      (twt) float64 7kB 4.0 8.0 12.0 16.0 ... 3.392e+03 3.396e+03 3.4e+03
    cdp_x    (iline, xline) float32 49kB dask.array<chunksize=(61, 202), meta=np.ndarray>
    cdp_y    (iline, xline) float32 49kB dask.array<chunksize=(61, 202), meta=np.ndarray>
Data variables:
    data     (iline, xline, twt) float32 42MB dask.array<chunksize=(61, 202, 850), meta=np.ndarray>
Attributes: (12/17)
    sample_rate:         4.0
    text:                C 1 SEGY OUTPUT FROM Petrel 2017.2 Saturday, June 06...
    measurement_system:  m
    source_file:         volve10r12-full-twt-sub3d.sgy
    percentiles:         [-6.97198262e+00 -6.52054033e+00 -1.49142619e+00 -5....
    coord_scalar:        -100.0
    ...                  ...
    srd:                 None
    datatype:            None
    coord_scaled:        None
    dimensions:          None
    vert_dimension:      None
    vert_domain:         None

Operations on SEISNC using `dask`¶

In this simple example we calculate the mean, of the entire cube. If you check the dashboard (when running this example yourself). You can see the task graph and task stream execution.

In [8]:

Copied!

mean = seisnc.data.mean()
mean
mean = seisnc.data.mean()
mean

Out[8]:

<xarray.DataArray 'data' ()> Size: 4B
dask.array<mean_agg-aggregate, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray>

Whoa-oh, the mean is what? Yeah, dask won't calculate anything until you ask it to. This means you can string computations together into a task graph for lazy evaluation. To get the mean try this

In [9]:

Copied!

mean.compute().values
mean.compute().values

Out[9]:

array(-7.317369e-05, dtype=float32)

Plotting with `dask`¶

The lazy loading of data means we can plot what we want using xarray style slicing and dask will fetch only the data we need.

In [10]:

Copied!





fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(20, 10))

iline = seisnc.sel(iline=10100).transpose("twt", "xline").data
xline = seisnc.sel(xline=2349).transpose("twt", "iline").data
zslice = seisnc.sel(twt=2900, method="nearest").transpose("iline", "xline").data

q = iline.quantile([0, 0.001, 0.5, 0.999, 1]).values
rq = np.max(np.abs([q[1], q[-2]]))

iline.plot(robust=True, ax=axs[0, 0], yincrease=False)
xline.plot(robust=True, ax=axs[0, 1], yincrease=False)
zslice.plot(robust=True, ax=axs[0, 2])

imshow_kwargs = dict(
    cmap="seismic", aspect="auto", vmin=-rq, vmax=rq, interpolation="bicubic"
)

axs[1, 0].imshow(iline.values, **imshow_kwargs)
axs[1, 0].set_title("iline")
axs[1, 1].imshow(xline.values, **imshow_kwargs)
axs[1, 1].set_title("xline")
axs[1, 2].imshow(zslice.values, origin="lower", **imshow_kwargs)
axs[1, 2].set_title("twt")
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(20, 10))

iline = seisnc.sel(iline=10100).transpose("twt", "xline").data
xline = seisnc.sel(xline=2349).transpose("twt", "iline").data
zslice = seisnc.sel(twt=2900, method="nearest").transpose("iline", "xline").data

q = iline.quantile([0, 0.001, 0.5, 0.999, 1]).values
rq = np.max(np.abs([q[1], q[-2]]))

iline.plot(robust=True, ax=axs[0, 0], yincrease=False)
xline.plot(robust=True, ax=axs[0, 1], yincrease=False)
zslice.plot(robust=True, ax=axs[0, 2])

imshow_kwargs = dict(
    cmap="seismic", aspect="auto", vmin=-rq, vmax=rq, interpolation="bicubic"
)

axs[1, 0].imshow(iline.values, **imshow_kwargs)
axs[1, 0].set_title("iline")
axs[1, 1].imshow(xline.values, **imshow_kwargs)
axs[1, 1].set_title("xline")
axs[1, 2].imshow(zslice.values, origin="lower", **imshow_kwargs)
axs[1, 2].set_title("twt")

Out[10]:

Text(0.5, 1.0, 'twt')

No description has been provided for this image

Using dask¶

Imports and Setup¶

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Client

Cluster Info

LocalCluster

Scheduler Info

Scheduler

Workers

Worker: 0

Worker: 1

Worker: 2

Worker: 3

Lazy loading from SEISNC using chunking¶

Operations on SEISNC using dask¶

Plotting with dask¶

Operations on SEISNC using `dask`¶

Plotting with `dask`¶