SEG-Y to Vector DataFrames and Back¶
The connection of segysak to xarray
greatly simplifies the process of vectorising segy 3D data and returning it to SEGY. To do this, one can use the close relationship between pandas
and xarray
.
Loading Data¶
We start by loading data normally using the segy_loader
utility. For this example we will use the Volve example sub-cube.
import pathlib
import xarray as xr
from IPython.display import display
volve_3d_path = pathlib.Path("data/volve10r12-full-twt-sub3d.sgy")
print("3D", volve_3d_path.exists())
volve_3d = xr.open_dataset(volve_3d_path, dim_byte_fields={'iline': 5, 'xline': 21}, extra_byte_fields={'cdp_x': 73, 'cdp_y': 77})
3D True
Vectorisation¶
Once the data is loaded it can be converted to a pandas.DataFrame
directly from the loaded Dataset
. The Dataframe is multi-index and contains columns for each variable in the originally loaded dataset. This includes the seismic amplitude as data
and the cdp_x
and cdp_y
locations. If you require smaller volumes from the input data, you can use xarray selection methods prior to conversion to a DataFrame.
volve_3d_df = volve_3d.to_dataframe()
display(volve_3d_df)
cdp_x | cdp_y | data | |||
---|---|---|---|---|---|
iline | xline | samples | |||
10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 |
8.0 | 43640052 | 647744704 | 0.022041 | ||
12.0 | 43640052 | 647744704 | 0.019659 | ||
16.0 | 43640052 | 647744704 | 0.025421 | ||
20.0 | 43640052 | 647744704 | 0.025436 | ||
... | ... | ... | ... | ... | ... |
10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 |
3388.0 | 43414413 | 647878266 | 0.000000 | ||
3392.0 | 43414413 | 647878266 | 0.000000 | ||
3396.0 | 43414413 | 647878266 | 0.000000 | ||
3400.0 | 43414413 | 647878266 | 0.000000 |
10473700 rows × 3 columns
We can remove the multi-index by resetting the index of the DataFrame. Vectorized workflows such as machine learning can then be easily applied to the DataFrame.
volve_3d_df_reindex = volve_3d_df.reset_index()
display(volve_3d_df_reindex)
iline | xline | samples | cdp_x | cdp_y | data | |
---|---|---|---|---|---|---|
0 | 10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 |
1 | 10090 | 2150 | 8.0 | 43640052 | 647744704 | 0.022041 |
2 | 10090 | 2150 | 12.0 | 43640052 | 647744704 | 0.019659 |
3 | 10090 | 2150 | 16.0 | 43640052 | 647744704 | 0.025421 |
4 | 10090 | 2150 | 20.0 | 43640052 | 647744704 | 0.025436 |
... | ... | ... | ... | ... | ... | ... |
10473695 | 10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 |
10473696 | 10150 | 2351 | 3388.0 | 43414413 | 647878266 | 0.000000 |
10473697 | 10150 | 2351 | 3392.0 | 43414413 | 647878266 | 0.000000 |
10473698 | 10150 | 2351 | 3396.0 | 43414413 | 647878266 | 0.000000 |
10473699 | 10150 | 2351 | 3400.0 | 43414413 | 647878266 | 0.000000 |
10473700 rows × 6 columns
Return to Xarray¶
It is possible to return the DataFrame to the Dataset for output to SEGY. To do this the multi-index must be reset. Afterward, pandas
provides the to_xarray
method.
volve_3d_df_multi = volve_3d_df_reindex.set_index(["iline", "xline", "samples"])
display(volve_3d_df_multi)
volve_3d_ds = volve_3d_df_multi.to_xarray()
display(volve_3d_ds)
cdp_x | cdp_y | data | |||
---|---|---|---|---|---|
iline | xline | samples | |||
10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 |
8.0 | 43640052 | 647744704 | 0.022041 | ||
12.0 | 43640052 | 647744704 | 0.019659 | ||
16.0 | 43640052 | 647744704 | 0.025421 | ||
20.0 | 43640052 | 647744704 | 0.025436 | ||
... | ... | ... | ... | ... | ... |
10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 |
3388.0 | 43414413 | 647878266 | 0.000000 | ||
3392.0 | 43414413 | 647878266 | 0.000000 | ||
3396.0 | 43414413 | 647878266 | 0.000000 | ||
3400.0 | 43414413 | 647878266 | 0.000000 |
10473700 rows × 3 columns
<xarray.Dataset> Size: 126MB Dimensions: (iline: 61, xline: 202, samples: 850) Coordinates: * iline (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150 * xline (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351 * samples (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03 Data variables: cdp_x (iline, xline, samples) int32 42MB 43640052 43640052 ... 43414413 cdp_y (iline, xline, samples) int32 42MB 647744704 ... 647878266 data (iline, xline, samples) float32 42MB 0.02057 0.02204 ... 0.0 0.0
The resulting dataset requires some changes to make it compatible again for export to SEGY.
Firstly, the attributes need to be set. The simplest way is to copy these from the original SEG-Y input. Otherwise they can be set manually. segysak
specifically needs the sample_rate
and the coord_scalar
attributes.
volve_3d_ds.attrs = volve_3d.attrs
display(volve_3d_ds.attrs)
{'seisnc': '{"coord_scalar": -100.0, "coord_scaled": false}'}
The cdp_x
and cdp_y
positions must be reduced to 2D along the vertical axis "twt" and set as coordinates.
volve_3d_ds["cdp_x"]
<xarray.DataArray 'cdp_x' (iline: 61, xline: 202, samples: 850)> Size: 42MB array([[[43640052, 43640052, 43640052, ..., 43640052, 43640052, 43640052], [43638839, 43638839, 43638839, ..., 43638839, 43638839, 43638839], [43637626, 43637626, 43637626, ..., 43637626, 43637626, 43637626], ..., [43398692, 43398692, 43398692, ..., 43398692, 43398692, 43398692], [43397480, 43397480, 43397480, ..., 43397480, 43397480, 43397480], [43396267, 43396267, 43396267, ..., 43396267, 43396267, 43396267]], [[43640354, 43640354, 43640354, ..., 43640354, 43640354, 43640354], [43639141, 43639141, 43639141, ..., 43639141, 43639141, 43639141], [43637928, 43637928, 43637928, ..., 43637928, 43637928, 43637928], ... [43416536, 43416536, 43416536, ..., 43416536, 43416536, 43416536], [43415323, 43415323, 43415323, ..., 43415323, 43415323, 43415323], [43414110, 43414110, 43414110, ..., 43414110, 43414110, 43414110]], [[43658198, 43658198, 43658198, ..., 43658198, 43658198, 43658198], [43656985, 43656985, 43656985, ..., 43656985, 43656985, 43656985], [43655772, 43655772, 43655772, ..., 43655772, 43655772, 43655772], ..., [43416839, 43416839, 43416839, ..., 43416839, 43416839, 43416839], [43415626, 43415626, 43415626, ..., 43415626, 43415626, 43415626], [43414413, 43414413, 43414413, ..., 43414413, 43414413, 43414413]]], dtype=int32) Coordinates: * iline (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150 * xline (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351 * samples (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03
volve_3d_ds["cdp_x"] = volve_3d_ds["cdp_x"].mean(dim=["samples"])
volve_3d_ds["cdp_y"] = volve_3d_ds["cdp_y"].mean(dim=["samples"])
volve_3d_ds = volve_3d_ds.set_coords(["cdp_x", "cdp_y"])
volve_3d_ds
<xarray.Dataset> Size: 42MB Dimensions: (iline: 61, xline: 202, samples: 850) Coordinates: cdp_x (iline, xline) float64 99kB 4.364e+07 4.364e+07 ... 4.341e+07 cdp_y (iline, xline) float64 99kB 6.477e+08 6.477e+08 ... 6.479e+08 * iline (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150 * xline (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351 * samples (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03 Data variables: data (iline, xline, samples) float32 42MB 0.02057 0.02204 ... 0.0 0.0 Attributes: seisnc: {"coord_scalar": -100.0, "coord_scaled": false}
Afterwards, use the to_segy
method as normal to return to SEGY.
volve_3d_ds.seisio.to_segy("data/test.segy", iline=189, xline=193, trace_header_map={'cdp_x':181, 'cdp_y':185})
Very large datasets¶
If you have a very large dataset (SEG-Y file), it may be possible to use ds.to_dask_dataframe()
which can perform operations, including the writing of data in a lazy manner.