1. 30 Jun, 2023 7 commits
  2. 29 Jun, 2023 3 commits
  3. 28 Jun, 2023 2 commits
    • Alexander Motin's avatar
      ZIL: Fix another use-after-free. · a9d6b069
      Alexander Motin authored
      lwb->lwb_issued_txg can not be accessed after lwb_state is set to
      LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
      freed by zil_sync().  We must save the txg number before that.
      
      This is similar to the 55b1842f
      
      , but as I see the bug is not new.
      It existed for quite a while, just was not triggered due to smaller
      race window.
      Reviewed-by: default avatarAllan Jude <allan@klarasystems.com>
      Reviewed-by: default avatarBrian Atkinson <batkinson@lanl.gov>
      Signed-off-by: default avatarAlexander Motin <mav@FreeBSD.org>
      Sponsored by:	iXsystems, Inc.
      Closes #14988
      Closes #14999
      a9d6b069
    • Alexander Motin's avatar
      Use big transactions for small recordsize writes. · b0cbc1aa
      Alexander Motin authored
      
      When ZFS appends files in chunks bigger than recordsize, it borrows
      buffer from ARC and fills it before opening transaction.  This
      supposed to help in case of page faults to not hold transaction open
      indefinitely.  The problem appears when recordsize is set lower than
      default 128KB. Since each block is committed in separate transaction,
      per-transaction overhead becomes significant, and what is even worse,
      active use of of per-dataset and per-pool locks to protect space use
      accounting for each transaction badly hurts the code SMP scalability.
      The same transaction size limitation applies in case of file rewrite,
      but without even excuse of buffer borrowing.
      
      To address the issue, disable the borrowing mechanism if recordsize
      is smaller than default and the write request is 4x bigger than it.
      In such case writes up to 32MB are executed in single transaction,
      that dramatically reduces overhead and lock contention.  Since the
      borrowing mechanism is not used for file rewrites, and it was never
      used by zvols, which seem to work fine, I don't think this change
      should create significant problems, partially because in addition to
      the borrowing mechanism there are also used pre-faults.
      
      My tests with 4/8 threads writing several files same time on datasets
      with 32KB recordsize in 1MB requests show reduction of CPU usage by
      the user threads by 25-35%.  I would measure it in GB/s, but at that
      block size we are now limited by the lock contention of single write
      issue taskqueue, which is a separate problem we are going to work on.
      Reviewed-by: default avatarBrian Atkinson <batkinson@lanl.gov>
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarAlexander Motin <mav@FreeBSD.org>
      Sponsored by:	iXsystems, Inc.
      Closes #14964 
      b0cbc1aa
  4. 27 Jun, 2023 2 commits
    • Laevos's avatar
      bc9d0084
    • Alexander Motin's avatar
      Another set of vdev queue optimizations. · 8469b5aa
      Alexander Motin authored
      
      Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
      time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
      for such a simple task.  To change I/O priority without searching
      through the trees, add io_queue_state field to struct zio.
      
      To not check number of queued I/Os for each priority add vq_cqueued
      bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
      Make vq_cactive a separate array instead of struct vdev_queue_class
      member.  Together those allow to avoid lots of cache misses when
      looking for work in vdev_queue_class_to_issue().
      
      Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
      saw some I/Os waiting in a queue for up to 8 seconds and possibly
      more due to starvation.  With this change I no longer see it.  I
      had to slightly more complicate the comparison function, but since
      it uses all the same cache lines the difference is minimal.  For a
      sequential I/Os the new code in vdev_queue_io_to_issue() actually
      often uses more simple avl_first(), falling back to avl_find() and
      avl_nearest() only when needed.
      
      Arrange members in struct zio to access only one cache line when
      searching through vdev queues.  While there, remove io_alloc_node,
      reusing the io_queue_node instead.  Those two are never used same
      time.
      
      Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
      years since implemented, while still wasted time maintaining the
      offset-sorted tree of TRIM requests.  Just remove the tree.
      
      Remove locking from txg_all_lists_empty().  It is racy by design,
      while 2 pair of locks/unlocks take noticeable time under the vdev
      queue lock.
      
      With these changes in my tests with volblocksize=4KB I measure vdev
      queue lock spin time reduction by 50% on read and 75% on write.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarAlexander Motin <mav@FreeBSD.org>
      Sponsored by:	iXsystems, Inc.
      Closes #14925 
      8469b5aa
  5. 26 Jun, 2023 1 commit
    • Rich Ercolani's avatar
      Add a delay to tearing down threads. · 35a6247c
      Rich Ercolani authored
      
      It's been observed that in certain workloads (zvol-related being a
      big one), ZFS will end up spending a large amount of time spinning
      up taskqs only to tear them down again almost immediately, then
      spin them up again...
      
      I noticed this when I looked at what my mostly-idle system was doing
      and wondered how on earth taskq creation/destroy was a bunch of time...
      
      So I added a configurable delay to avoid it tearing down tasks the
      first time it notices them idle, and the total number of threads at
      steady state went up, but the amount of time being burned just
      tearing down/turning up new ones almost vanished.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarRich Ercolani <rincebrain@gmail.com>
      Closes #14938 
      35a6247c
  6. 18 Jun, 2023 1 commit
  7. 15 Jun, 2023 2 commits
    • George Amanakis's avatar
      Shorten arcstat_quiescence sleep time · 10e36e17
      George Amanakis authored
      
      With the latest L2ARC fixes, 2 seconds is too long to wait for
      quiescence of arcstats like l2_size. Shorten this interval to avoid
      having the persistent L2ARC tests in ZTS prematurely terminated.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Closes #14981 
      10e36e17
    • Alexander Motin's avatar
      Remove ARC/ZIO physdone callbacks. · ccec7fbe
      Alexander Motin authored
      
      Those callbacks were introduced many years ago as part of a bigger
      patch to smoothen the write throttling within a txg. They allow to
      account completion of individual physical writes within a logical
      one, improving cases when some of physical writes complete much
      sooner than others, gradually opening the write throttle.
      
      Few years after that ZFS got allocation throttling, working on a
      level of logical writes and limiting number of writes queued to
      vdevs at any point, and so limiting latency distribution between
      the physical writes and especially writes of multiple copies.
      The addition of scheduling deadline I proposed in #14925 should
      further reduce the latency distribution.  Grown memory sizes over
      the past 10 years should also reduce importance of the smoothing.
      
      While the use of physdone callback may still in theory provide
      some smoother throttling, there are cases where we simply can not
      afford it.  Since dirty data accounting is protected by pool-wide
      lock, in case of 6-wide RAIDZ, for example, it requires us to take
      it 8 times per logical block write, creating huge lock contention.
      
      My tests of this patch show radical reduction of the lock spinning
      time on workloads when smaller blocks are written to RAIDZ pools,
      when each of the disks receives 8-16KB chunks, but the total rate
      reaching 100K+ blocks per second.  Same time attempts to measure
      any write time fluctuations didn't show anything noticeable.
      
      While there, remove also io_child_count/io_parent_count counters.
      They are used only for couple assertions that can be avoided.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarAlexander Motin <mav@FreeBSD.org>
      Sponsored by:	iXsystems, Inc.
      Closes #14948 
      ccec7fbe
  8. 14 Jun, 2023 3 commits
  9. 10 Jun, 2023 1 commit
    • George Amanakis's avatar
      Fix the L2ARC write size calculating logic (2) · feff9dfe
      George Amanakis authored
      While commit bcd53210
      
       adjusts the write size based on the size of the log
      block, this happens after comparing the unadjusted write size to the
      evicted (target) size.
      
      In this case l2ad_hand will exceed l2ad_evict and violate an assertion
      at the end of l2arc_write_buffers().
      
      Fix this by adding the max log block size to the allocated size of the
      buffer to be committed before comparing the result to the target
      size.
      
      Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
      write size exceeds the size of the L2ARC device.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarGeorge Amanakis <gamanakis@gmail.com>
      Closes #14936
      Closes #14954 
      feff9dfe
  10. 09 Jun, 2023 5 commits
  11. 07 Jun, 2023 2 commits
  12. 06 Jun, 2023 1 commit
  13. 05 Jun, 2023 4 commits
  14. 02 Jun, 2023 2 commits
  15. 01 Jun, 2023 4 commits