This is unfinished work implemeting out-of-UFO network serversHEAD master

author: Suren A. Chilingaryan <csa@suren.me> 2020-09-03 03:00:30 +0200
committer: Suren A. Chilingaryan <csa@suren.me> 2020-09-03 03:00:30 +0200
commit: 5172421d248250b4ab3b69eb57fd83656e23a4da (patch)
tree: a499d9f1dd0b74b754816884a59927b3171656fc /docs
parent: 7b2e6168b049be9e7852b2d364d897592eff69fc (diff)
download: ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.gz
ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.bz2
ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.xz
ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.zip
3 files changed, 193 insertions, 0 deletions
diff --git a/docs/architecture.txt b/docs/architecture.txt
index 2b00ab3..e38da42 100644
--- a/docs/architecture.txt
+++ b/docs/architecture.txt
@@ -16,3 +16,101 @@ Configuration
     - Metadata seems preserved while passing trough standard UFO filters. So, this is easiest way out.
     - We can build a stand-alone subgraph for each plane, but this likely involves a full data copy.
     - This probably OK for simple use case: "raw storage + visualization" as only two instances, but for could be too challenging for multi-plane analysis.
+
+
+Overloaded Mode
+===============
+ x How/when do we stop capturing in fixed frame mode. [ After building required number of frames ]
+    - After capturing theoretical number of packets required for reconstruction (or plus one frame to address half)
+	* Initail packets are likely broken as our experiments with LibVMA shows... Likely will be less...
+	* As work-around, we can _always_ skip first few frames to allow system become responsive.
+	* Overall, this seems a less reliable system.
+    => After initial building of the required number of packets.
+	* Unlike previous case, this may never finish if processing is bottleneck as buffers will be overwritten...
+	* Alternative is to stop capturing frames if buffers are exhausted...
+ x Do we pause receiving when buffer is exhausted or do we start over-writting (original ufo variant is overwritting). [ stop ]
+    - For streaming, the overwritting is better as it is better skipping older frames rather than newer.
+    => But for capturing fixed number of frames, we need to stop streaming (in overloaded case) or the frames will be continuously overwritten.
+	* This is more repeatable way to handle the frames and something has to be optimized anyway if we are to slow in streaming mode,
+	but in this mode we can reliably get first frames even if event-building is overloaded.
+ x Do we always skip first frames to keep system responsive or we process normally? [ Not now, TBD later if necessary ]
+    - Technically, skipping first frames could allow faster stabilization.
+    - Could only cause problems if streaming is started by some trigger and first frames are really important.
+    - This would be mandatory in fixed-frame-mode if stopping independent of reconstruction.
+
+Data Flow Model
+===============
+ x MP-RQ vs recvfrom_zcopy vs SocketXtreme
+    => MP-RQ is least overhead method (seems to remove one memcpy) with requirements fitting our use-case
+
+ x How to abstract read/LibVMA read access?
+    => Return pointer to buffer, number of packets, and padding between packets.
+
+ x Possible Models: Buffer independent streams or buffer sinograms
+    - Sinogram-buffer (push-model)
+	* Sinograms are buffered. The readers are directly writting to appropriate place in the sinogram buffers and increment fragment
+	counter for each buffer.
+	* Readers are paused on the first item out of current buffer and wait until buffer start is advanced.
+	* After receving fragment for a new sinogram, the reader informs controller about number of missing framgements in the buffer.
+	E.g. counting 'missing' fragments using atomic array (on top of completed).
+	* The controller advances buffer start after 'missing' is increased above the specified threshold.
+    - Stream-buffers (pull-model)
+	* Data is directly buffered in the independent receive-buffers. So, there is no memcpy in the receiving part of the code.
+	* Master builder thread determines sinogram to build (maximum amongst buffer).
+	* Builder threads skip to the required sinogram and start copying data until missing fragment is detected
+	* On missing fragment, new sinogram is determined and operation restarted (when to skip large amount?)
+
+ x Locking with global buffer (No problems if buffering independent streams)
+    - We can syncrhonize "Advances of Buffer Start" and "Copy out" with locks as this is low frequency events, but we need to ensure that
+    copy fragments are lock-less.
+    - In the 'counting' scenario this is problematc:
+	- Different threads write to diffent parts of the buffer. If the buffer start moves, it is OK to finish the old fragment. It will be 
+	rewritten by the same thread with new data later. No problems here.
+	- But how to handle framgent counting? Atomics are fine for concurrent threads. But if we move buffer, the fragment count should not 
+	be increased. This means we need execute 'if' and 'inc' atomicly (increase only if buffer has not moved and the move could happen between
+	'if' and 'inc'). This is the main problem.
+	- We can push increases to pipe and use the main thread (which also advances the start of the buffer) to read from the pipe and 
+	increase counts (or ignore if buffer moved). Is it faster than locking (or linux pipes perform locking or kernel/user-space 
+	switches anyway?)
+	? Can we find alternative ways without locks/pipes? E.g. using hashes? Really large counting arrays?
+
+ ? Performance considerations
+    - Sinogram-buffer advantage
+	- Sinogram-buffer works fine with re-ordered data streams. The stream-buffer approach can handle the re-odered streams in theory, but
+        with large performance penalty and incresed complexithy. But in fact, there is little reason why re-ordering could happen and experiments
+	with existing switch doesn't show any re-ordering.
+	- There is always uncached reads. However, in sinogram-buffer it is uncached copy of large image. And in case of stream-buffer, the
+	many small images are accessed uncached.
+	? With Sinogram-buffer we can prepare data in advance and only single memcpy will be required. With stream-buffer all steps are performed
+        only once the buffer is available. But does it has any impact on performance?
+	? With stream-buffer, building _seems_ more complex and requires more synchronization (but only if we could find a simple method to avoid 
+	locking in the sinogram-buffer scenario).
+
+    => Stream-buffer advantage
+	- With MP-RQ we can use Mellanox ring-buffer (and probably Mellanox in general) as stream-buffer.
+	- Stream-buffer incures singifincantly less load during the reading phase. If building overloads system, with this approach we can
+	buffer as much data as memory permits and process it later. This is not possible with sinogram-buffer approach. But we still have
+	socket buffers...
+        - Stream-buffer removes one 'memcpy' unless zero-copy SocketExtreme is used. We also need to store raw data with fastwriter and the 
+	new external sinogram buffer could be used to store data also for fastwriter. Hence, removing the necessity to memcpy there. 
+	I.e. solved with SocketExtreme otherwise either performance penalty or additional complexity here.
+	? Stream-buffer simplifies lock management. We don't need to provide additional pipes to reduce amount of required locking. Or is 
+	there a simple solution?
+
+      ? Single-thread builder vs multi-thread builder?
+
+ ? LRO (Large Receive Offload). Can we use it? How it compareswith LibVMA?
+
+
+
+
+
+
+Dependencies
+============
+ x C11 threads vs pthreads vs glib threads
+    * C11 threads are only supported starting with glibc 2.28. Ubuntu 18.04 sheeps 2.27. There are work-arounds, but there is 
+    little advantage over pthreads.
+    * We still will try to use atomics from C11.
+
+
diff --git a/docs/mellanox.txt b/docs/mellanox.txt
new file mode 100644
index 0000000..ed20048
--- /dev/null
+++ b/docs/mellanox.txt
@@ -0,0 +1,88 @@
+Terminology
+===========
+ - Send/Receive Queues
+    QP (Queue Pair): Combines RQ and SQ. Generally, irrelevant for the following
+    RQ (Receive Queue): 
+    SQ (Send Queue):
+    CQ (Completion Queue): Completed operations reported here
+    EQ (Event Queue): Completions generate events (at specified rate) which in turn generate IRQs
+    WR/WQ (Work Request Queue): This is basically buffers (SG-lists) which should be either send or used for data reception
+    *QE (* Queue Event)
+
+    Flow: WQE --submit work--> WQ --execute--> SQ/RQ --on completion-> CQ --signal--> EQ -> IRQ
+    * Completion Event Moderation: Redeuce amount of reported events (EQ)
+
+ - Ofloads
+    RSS (Receive Side Scalling): Distribute load across CPU cores
+    LRO (Large Receive Offload): Group packets and deliver to user-space as a large single grouped packet [ ethtool -K shows if LRO on/off ]
+
+ - Various
+    AEV (Asynchronous Event): Errors,etc.
+    SRQ (Shared Receive Queue):
+    ICM (Interconnect Context Memory): Address Translation Tables, Control Objects, User Access Region (registers)
+    MPT (Memory Protection Table):
+    RMP (Receive Memory Pool):
+    TIR (Transport Interface Receive):
+    RQT (RQ Table):
+    MCG (Multicast Group):
+
+Driver
+======
+ - Network packets is/are streamed to ring buffers (with all Ethernet, IP, UDP/TCP headers). 
+ The number of ring buffers dependents on VMA_RING_ALLOCATION parameter:
+     0 - per network interface
+     1 - per IP
+ => 10 - per socket
+    20 - per thread (which was used to create the socket)
+    30 - per core
+    31 - per core (with some affinity of threads to cores)
+
+ - The memory for ring buffer is allocated based on VMA_MEM_ALLOC_TYPE:
+    0 - malloc (this will be very slow if large buffers are requested)
+    1 - contigous 
+ => 2 - HugePages
+
+ - The number of buffers per ring is controlled with VMA_RX_BUFS (this is total in all rings)
+    * Each buffer VMA_MTU bytes
+    * Recommended: VMA_RX_BUFS ~ #rings * VMA_RX_WRE (number of WRE allocated on all interfaces)
+
+LibVMA
+======
+ There is 3 interfaces: 
+ - MP-RQ (Multi-packet Receive Queue): vma_cyclic_buffer_read
+    This is useful for processing data streams when packet size stays contant and the packet flow doesn't change
+    drastically over time. Requires ConntextX-5 or newer.
+
+    * Use 'vma_add_ring_profile' to configure the size of ring buffer (specifies buffer size & the packet size)
+    * Set per-socket SO_VMA_RING_ALLOC_LOGIC using setsockopt
+    * Call 'vma_cyclic_buffer_read' to access raw ring buffer, specifies minimum and maximum packets to return
+
+    * The returned 'completion' structure referencing the position in the ring buffer. Packets in ring buffer
+    include all headers (ethernet - 14 bytes, ip - 20 bytes, udp - 8 bytes). 
+    * New packets meanwhile are written in the remaining part of the ring buffer (until the linear end of the
+    buffer - consequently the returned data is not overwritten).
+    * The buffer rewinded only on call to 'vma_cyclic_buffer_read'. Less than the specified minimum amount of
+    packets can be returned if currently near the end of buffer and not enough space to fullfil the minimum
+    requirement.
+
+    * To ensure enough space for the follow up packets, synchronization between buffer size and min/max packet
+    is required. It should never happen that the space for only few packets is left when end of the buffer is
+    close.
+
+ - SocketXtreme: socketxtreme_poll
+    More complex interface allowing more control over process particularly processing packets with varing size.
+    Requires ConnectX-5 or newer.
+
+    * Get ring buffers associated with socket 'get_socket_rings_num' and 'get_socket_rings_fds'
+    * Get ready completions on the specified ring buffer with 'socketxtreme_poll' (pass 'fd' returned with 'get_socket_rings_fds')
+    * Two types of completions: 'VMA_SOCKETXTREME_NEW_CONNECTION_ACCEPTED' and 'VMA_SOCKETXTREME_PACKET'. 
+    * For the second type, process an associated list of buffers and keep reference counting with 'socketxtreme_ref_vma_buf',
+    'socketxtreme_free_vma_buf'.
+    * Clean/unreference received packets with socketxtreme_free_vma_packets
+
+ - Zero Copy: recvfrom_zcopy
+    The simplest interface working with ConnectX-3 cards. The packet is still written to ring-buffers. The data is not copied out 
+    of ring buffers. This interface provides a way to get pointers to locations in ring buffer. There is a slight overhead compared
+    to MP-RQ approach to prepare list of packet pointers. 
+
+
diff --git a/docs/todo.txt b/docs/todo.txt
index a9ab4c8..07258f3 100644
--- a/docs/todo.txt
+++ b/docs/todo.txt
@@ -11,6 +11,13 @@ Main
  - Try UFO visualization filter
  - "Reconstructed data storage" and "Visualization + raw data storage" modes. Implement stand-alone 'roof-converter' filter.
 
+Network
+=======
+ - Implement MP-RQ and corresponding abstractions
+ - LRO (Large Receive Offload). Can we use it? How it compareswith LibVMA?
+ - Check we can pre-allocate big enough buffers with LibVMA to receive required number of sinograms without loses
+ 
+
 If necesary
 ===========
  - Task 'roof-ingest-missing' to ingest zero-padded broken frames (and include get_writer())
author	Suren A. Chilingaryan <csa@suren.me>	2020-09-03 03:00:30 +0200
committer	Suren A. Chilingaryan <csa@suren.me>	2020-09-03 03:00:30 +0200
commit	5172421d248250b4ab3b69eb57fd83656e23a4da (patch)
tree	a499d9f1dd0b74b754816884a59927b3171656fc /docs
parent	7b2e6168b049be9e7852b2d364d897592eff69fc (diff)
download	ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.gz ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.bz2 ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.tar.xz ufo-roof-temp-5172421d248250b4ab3b69eb57fd83656e23a4da.zip