SEG-Y to Vector DataFrames and Back¶
The connection of segysak to xarray greatly simplifies the process of vectorising segy 3D data and returning it to SEGY. To do this, one can use the close relationship between pandas and xarray.
Loading Data¶
We start by loading data normally using the segy_loader utility. For this example we will use the Volve example sub-cube.
import pathlib
import xarray as xr
from IPython.display import display
volve_3d_path = pathlib.Path("data/volve10r12-full-twt-sub3d.sgy")
print("3D", volve_3d_path.exists())
volve_3d = xr.open_dataset(volve_3d_path, dim_byte_fields={'iline': 5, 'xline': 21}, extra_byte_fields={'cdp_x': 73, 'cdp_y': 77})
3D True
Vectorisation¶
Once the data is loaded it can be converted to a pandas.DataFrame directly from the loaded Dataset. The Dataframe is multi-index and contains columns for each variable in the originally loaded dataset. This includes the seismic amplitude as data and the cdp_x and cdp_y locations. If you require smaller volumes from the input data, you can use xarray selection methods prior to conversion to a DataFrame.
volve_3d_df = volve_3d.to_dataframe()
display(volve_3d_df)
| cdp_x | cdp_y | data | |||
|---|---|---|---|---|---|
| iline | xline | samples | |||
| 10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 | 
| 8.0 | 43640052 | 647744704 | 0.022041 | ||
| 12.0 | 43640052 | 647744704 | 0.019659 | ||
| 16.0 | 43640052 | 647744704 | 0.025421 | ||
| 20.0 | 43640052 | 647744704 | 0.025436 | ||
| ... | ... | ... | ... | ... | ... | 
| 10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 | 
| 3388.0 | 43414413 | 647878266 | 0.000000 | ||
| 3392.0 | 43414413 | 647878266 | 0.000000 | ||
| 3396.0 | 43414413 | 647878266 | 0.000000 | ||
| 3400.0 | 43414413 | 647878266 | 0.000000 | 
10473700 rows × 3 columns
We can remove the multi-index by resetting the index of the DataFrame. Vectorized workflows such as machine learning can then be easily applied to the DataFrame.
volve_3d_df_reindex = volve_3d_df.reset_index()
display(volve_3d_df_reindex)
| iline | xline | samples | cdp_x | cdp_y | data | |
|---|---|---|---|---|---|---|
| 0 | 10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 | 
| 1 | 10090 | 2150 | 8.0 | 43640052 | 647744704 | 0.022041 | 
| 2 | 10090 | 2150 | 12.0 | 43640052 | 647744704 | 0.019659 | 
| 3 | 10090 | 2150 | 16.0 | 43640052 | 647744704 | 0.025421 | 
| 4 | 10090 | 2150 | 20.0 | 43640052 | 647744704 | 0.025436 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 10473695 | 10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 | 
| 10473696 | 10150 | 2351 | 3388.0 | 43414413 | 647878266 | 0.000000 | 
| 10473697 | 10150 | 2351 | 3392.0 | 43414413 | 647878266 | 0.000000 | 
| 10473698 | 10150 | 2351 | 3396.0 | 43414413 | 647878266 | 0.000000 | 
| 10473699 | 10150 | 2351 | 3400.0 | 43414413 | 647878266 | 0.000000 | 
10473700 rows × 6 columns
Return to Xarray¶
It is possible to return the DataFrame to the Dataset for output to SEGY. To do this the multi-index must be reset. Afterward, pandas provides the to_xarray method.
volve_3d_df_multi = volve_3d_df_reindex.set_index(["iline", "xline", "samples"])
display(volve_3d_df_multi)
volve_3d_ds = volve_3d_df_multi.to_xarray()
display(volve_3d_ds)
| cdp_x | cdp_y | data | |||
|---|---|---|---|---|---|
| iline | xline | samples | |||
| 10090 | 2150 | 4.0 | 43640052 | 647744704 | 0.020575 | 
| 8.0 | 43640052 | 647744704 | 0.022041 | ||
| 12.0 | 43640052 | 647744704 | 0.019659 | ||
| 16.0 | 43640052 | 647744704 | 0.025421 | ||
| 20.0 | 43640052 | 647744704 | 0.025436 | ||
| ... | ... | ... | ... | ... | ... | 
| 10150 | 2351 | 3384.0 | 43414413 | 647878266 | 0.000000 | 
| 3388.0 | 43414413 | 647878266 | 0.000000 | ||
| 3392.0 | 43414413 | 647878266 | 0.000000 | ||
| 3396.0 | 43414413 | 647878266 | 0.000000 | ||
| 3400.0 | 43414413 | 647878266 | 0.000000 | 
10473700 rows × 3 columns
<xarray.Dataset> Size: 126MB
Dimensions:  (iline: 61, xline: 202, samples: 850)
Coordinates:
  * iline    (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150
  * xline    (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351
  * samples  (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03
Data variables:
    cdp_x    (iline, xline, samples) int32 42MB 43640052 43640052 ... 43414413
    cdp_y    (iline, xline, samples) int32 42MB 647744704 ... 647878266
    data     (iline, xline, samples) float32 42MB 0.02057 0.02204 ... 0.0 0.0The resulting dataset requires some changes to make it compatible again for export to SEGY.
Firstly, the attributes need to be set. The simplest way is to copy these from the original SEG-Y input. Otherwise they can be set manually. segysak specifically needs the sample_rate and the coord_scalar attributes.
volve_3d_ds.attrs = volve_3d.attrs
display(volve_3d_ds.attrs)
{'seisnc': '{"coord_scalar": -100.0, "coord_scaled": false}'}
The cdp_x and cdp_y positions must be reduced to 2D along the vertical axis "twt" and set as coordinates.
volve_3d_ds["cdp_x"]
<xarray.DataArray 'cdp_x' (iline: 61, xline: 202, samples: 850)> Size: 42MB
array([[[43640052, 43640052, 43640052, ..., 43640052, 43640052,
         43640052],
        [43638839, 43638839, 43638839, ..., 43638839, 43638839,
         43638839],
        [43637626, 43637626, 43637626, ..., 43637626, 43637626,
         43637626],
        ...,
        [43398692, 43398692, 43398692, ..., 43398692, 43398692,
         43398692],
        [43397480, 43397480, 43397480, ..., 43397480, 43397480,
         43397480],
        [43396267, 43396267, 43396267, ..., 43396267, 43396267,
         43396267]],
       [[43640354, 43640354, 43640354, ..., 43640354, 43640354,
         43640354],
        [43639141, 43639141, 43639141, ..., 43639141, 43639141,
         43639141],
        [43637928, 43637928, 43637928, ..., 43637928, 43637928,
         43637928],
...
        [43416536, 43416536, 43416536, ..., 43416536, 43416536,
         43416536],
        [43415323, 43415323, 43415323, ..., 43415323, 43415323,
         43415323],
        [43414110, 43414110, 43414110, ..., 43414110, 43414110,
         43414110]],
       [[43658198, 43658198, 43658198, ..., 43658198, 43658198,
         43658198],
        [43656985, 43656985, 43656985, ..., 43656985, 43656985,
         43656985],
        [43655772, 43655772, 43655772, ..., 43655772, 43655772,
         43655772],
        ...,
        [43416839, 43416839, 43416839, ..., 43416839, 43416839,
         43416839],
        [43415626, 43415626, 43415626, ..., 43415626, 43415626,
         43415626],
        [43414413, 43414413, 43414413, ..., 43414413, 43414413,
         43414413]]], dtype=int32)
Coordinates:
  * iline    (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150
  * xline    (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351
  * samples  (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03volve_3d_ds["cdp_x"] = volve_3d_ds["cdp_x"].mean(dim=["samples"])
volve_3d_ds["cdp_y"] = volve_3d_ds["cdp_y"].mean(dim=["samples"])
volve_3d_ds = volve_3d_ds.set_coords(["cdp_x", "cdp_y"])
volve_3d_ds
<xarray.Dataset> Size: 42MB
Dimensions:  (iline: 61, xline: 202, samples: 850)
Coordinates:
    cdp_x    (iline, xline) float64 99kB 4.364e+07 4.364e+07 ... 4.341e+07
    cdp_y    (iline, xline) float64 99kB 6.477e+08 6.477e+08 ... 6.479e+08
  * iline    (iline) int16 122B 10090 10091 10092 10093 ... 10148 10149 10150
  * xline    (xline) int16 404B 2150 2151 2152 2153 2154 ... 2348 2349 2350 2351
  * samples  (samples) float32 3kB 4.0 8.0 12.0 ... 3.392e+03 3.396e+03 3.4e+03
Data variables:
    data     (iline, xline, samples) float32 42MB 0.02057 0.02204 ... 0.0 0.0
Attributes:
    seisnc:   {"coord_scalar": -100.0, "coord_scaled": false}Afterwards, use the to_segy method as normal to return to SEGY.
volve_3d_ds.seisio.to_segy("data/test.segy", iline=189, xline=193, trace_header_map={'cdp_x':181, 'cdp_y':185})
Very large datasets¶
If you have a very large dataset (SEG-Y file), it may be possible to use ds.to_dask_dataframe() which can perform operations, including the writing of data in a lazy manner.