1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
Configuration
=============
x Pre-configured values in C-sources or shall everything be specified in the configuration? [ full-config, less prone to errors if all in one place ]
- Default header size can be computed using defined 'struct', but thats it.
- Do not try to compute maximum packet size, etc.
x Drop all computable parameters from the config [ with a few exceptions ]
- Should be possible to have network receiver without configuring the rest of the ROOF
- Should be possible to keep both dataset size in network config (to avoid rewriting) and the ROOF hardware configuration, but they should match
- n_streams vs. n_modules: having multiple streams per module in future
- samples_per_rotations vs. sample_rate / image_rate: Don't know what happens with Switch rate control, etc.?
x How precise we should verify configuration consistency? [ implement JSON schema at some point ]
x Propogate broken frames by default? Or just drop marking the missing frames with metadata. [ ingest when necessary ]
- I.e. provide filter removing the broken frame or the one generating when necessary (to ingest the uninterrupted flow in the standard UFO filters)?
- How to handle partially broken frames?
x How to handle data planes? [ metadata passes trough processors, but not reductors ]
- Metadata seems preserved while passing trough standard UFO filters. So, this is easiest way out.
- We can build a stand-alone subgraph for each plane, but this likely involves a full data copy.
- This probably OK for simple use case: "raw storage + visualization" as only two instances, but for could be too challenging for multi-plane analysis.
Overloaded Mode
===============
x How/when do we stop capturing in fixed frame mode. [ After building required number of frames ]
- After capturing theoretical number of packets required for reconstruction (or plus one frame to address half)
* Initail packets are likely broken as our experiments with LibVMA shows... Likely will be less...
* As work-around, we can _always_ skip first few frames to allow system become responsive.
* Overall, this seems a less reliable system.
=> After initial building of the required number of packets.
* Unlike previous case, this may never finish if processing is bottleneck as buffers will be overwritten...
* Alternative is to stop capturing frames if buffers are exhausted...
x Do we pause receiving when buffer is exhausted or do we start over-writting (original ufo variant is overwritting). [ stop ]
- For streaming, the overwritting is better as it is better skipping older frames rather than newer.
=> But for capturing fixed number of frames, we need to stop streaming (in overloaded case) or the frames will be continuously overwritten.
* This is more repeatable way to handle the frames and something has to be optimized anyway if we are to slow in streaming mode,
but in this mode we can reliably get first frames even if event-building is overloaded.
x Do we always skip first frames to keep system responsive or we process normally? [ Not now, TBD later if necessary ]
- Technically, skipping first frames could allow faster stabilization.
- Could only cause problems if streaming is started by some trigger and first frames are really important.
- This would be mandatory in fixed-frame-mode if stopping independent of reconstruction.
Data Flow Model
===============
x MP-RQ vs recvfrom_zcopy vs SocketXtreme
=> MP-RQ is least overhead method (seems to remove one memcpy) with requirements fitting our use-case
x How to abstract read/LibVMA read access?
=> Return pointer to buffer, number of packets, and padding between packets.
x Possible Models: Buffer independent streams or buffer sinograms
- Sinogram-buffer (push-model)
* Sinograms are buffered. The readers are directly writting to appropriate place in the sinogram buffers and increment fragment
counter for each buffer.
* Readers are paused on the first item out of current buffer and wait until buffer start is advanced.
* After receving fragment for a new sinogram, the reader informs controller about number of missing framgements in the buffer.
E.g. counting 'missing' fragments using atomic array (on top of completed).
* The controller advances buffer start after 'missing' is increased above the specified threshold.
- Stream-buffers (pull-model)
* Data is directly buffered in the independent receive-buffers. So, there is no memcpy in the receiving part of the code.
* Master builder thread determines sinogram to build (maximum amongst buffer).
* Builder threads skip to the required sinogram and start copying data until missing fragment is detected
* On missing fragment, new sinogram is determined and operation restarted (when to skip large amount?)
x Locking with global buffer (No problems if buffering independent streams)
- We can syncrhonize "Advances of Buffer Start" and "Copy out" with locks as this is low frequency events, but we need to ensure that
copy fragments are lock-less.
- In the 'counting' scenario this is problematc:
- Different threads write to diffent parts of the buffer. If the buffer start moves, it is OK to finish the old fragment. It will be
rewritten by the same thread with new data later. No problems here.
- But how to handle framgent counting? Atomics are fine for concurrent threads. But if we move buffer, the fragment count should not
be increased. This means we need execute 'if' and 'inc' atomicly (increase only if buffer has not moved and the move could happen between
'if' and 'inc'). This is the main problem.
- We can push increases to pipe and use the main thread (which also advances the start of the buffer) to read from the pipe and
increase counts (or ignore if buffer moved). Is it faster than locking (or linux pipes perform locking or kernel/user-space
switches anyway?)
? Can we find alternative ways without locks/pipes? E.g. using hashes? Really large counting arrays?
? Performance considerations
- Sinogram-buffer advantage
- Sinogram-buffer works fine with re-ordered data streams. The stream-buffer approach can handle the re-odered streams in theory, but
with large performance penalty and incresed complexithy. But in fact, there is little reason why re-ordering could happen and experiments
with existing switch doesn't show any re-ordering.
- There is always uncached reads. However, in sinogram-buffer it is uncached copy of large image. And in case of stream-buffer, the
many small images are accessed uncached.
? With Sinogram-buffer we can prepare data in advance and only single memcpy will be required. With stream-buffer all steps are performed
only once the buffer is available. But does it has any impact on performance?
? With stream-buffer, building _seems_ more complex and requires more synchronization (but only if we could find a simple method to avoid
locking in the sinogram-buffer scenario).
=> Stream-buffer advantage
- With MP-RQ we can use Mellanox ring-buffer (and probably Mellanox in general) as stream-buffer.
- Stream-buffer incures singifincantly less load during the reading phase. If building overloads system, with this approach we can
buffer as much data as memory permits and process it later. This is not possible with sinogram-buffer approach. But we still have
socket buffers...
- Stream-buffer removes one 'memcpy' unless zero-copy SocketExtreme is used. We also need to store raw data with fastwriter and the
new external sinogram buffer could be used to store data also for fastwriter. Hence, removing the necessity to memcpy there.
I.e. solved with SocketExtreme otherwise either performance penalty or additional complexity here.
? Stream-buffer simplifies lock management. We don't need to provide additional pipes to reduce amount of required locking. Or is
there a simple solution?
? Single-thread builder vs multi-thread builder?
? LRO (Large Receive Offload). Can we use it? How it compareswith LibVMA?
Dependencies
============
x C11 threads vs pthreads vs glib threads
* C11 threads are only supported starting with glibc 2.28. Ubuntu 18.04 sheeps 2.27. There are work-arounds, but there is
little advantage over pthreads.
* We still will try to use atomics from C11.
|