1. 30 Mar, 2021 1 commit
    • Brian Behlendorf's avatar
      Update META · 9ac82cab
      Brian Behlendorf authored
      
      Increase the version to 2.1.99 to indicate the master branch is
      newer than the 2.1.x release.  This ensures packages built from
      master branch are considered to be newer than the last release.
      Signed-off-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      9ac82cab
  2. 29 Mar, 2021 1 commit
  3. 26 Mar, 2021 5 commits
  4. 22 Mar, 2021 2 commits
  5. 20 Mar, 2021 10 commits
    • Alexander Motin's avatar
      Split dmu_zfetch() speculation and execution parts · 891568c9
      Alexander Motin authored
      
      To make better predictions on parallel workloads dmu_zfetch() should
      be called as early as possible to reduce possible request reordering.
      In particular, it should be called before dmu_buf_hold_array_by_dnode()
      calls dbuf_hold(), which may sleep waiting for indirect blocks, waking
      up multiple threads same time on completion, that can significantly
      reorder the requests, making the stream look like random.  But we
      should not issue prefetch requests before the on-demand ones, since
      they may get to the disks first despite the I/O scheduler, increasing
      on-demand request latency.
      
      This patch splits dmu_zfetch() into two functions: dmu_zfetch_prepare()
      and dmu_zfetch_run().  The first can be executed as early as needed.
      It only updates statistics and makes predictions without issuing any
      I/Os.  The I/O issuance is handled by dmu_zfetch_run(), which can be
      called later when all on-demand I/Os are already issued.  It even
      tracks the activity of other concurrent threads, issuing the prefetch
      only when _all_ on-demand requests are issued.
      
      For many years it was a big problem for storage servers, handling
      deeper request queues from their clients, having to either serialize
      consequential reads to make ZFS prefetcher usable, or execute the
      incoming requests as-is and get almost no prefetch from ZFS, relying
      only on deep enough prefetch by the clients.  Benefits of those ways
      varied, but neither was perfect.  With this patch deeper queue
      sequential read benchmarks with CrystalDiskMark from Windows via
      iSCSI to FreeBSD target show me much better throughput with almost
      100% prefetcher hit rate, comparing to almost zero before.
      
      While there, I also removed per-stream zs_lock as useless, completely
      covered by parent zf_lock.  Also I reused zs_blocks refcount to track
      zf_stream linkage of the stream, since I believe previous zs_fetch ==
      NULL check in dmu_zfetch_stream_done() was racy.
      
      Delete prefetch streams when they reach ends of files.  It saves up
      to 1KB of RAM per file, plus reduces searches through the stream list.
      
      Block data prefetch (speculation and indirect block prefetch is still
      done since they are cheaper) if all dbufs of the stream are already
      in DMU cache.  First cache miss immediately fires all the prefetch
      that would be done for the stream by that time.  It saves some CPU
      time if same files within DMU cache capacity are read over and over.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Reviewed-by: default avatarAdam Moss <c@yotes.com>
      Reviewed-by: default avatarMatthew Ahrens <mahrens@delphix.com>
      Signed-off-by: default avatarAlexander Motin <mav@FreeBSD.org>
      Sponsored-By: iXsystems, Inc.
      Closes #11652 
      891568c9
    • Chunwei Chen's avatar
      Fix zfs_get_data access to files with wrong generation · 296a4a36
      Chunwei Chen authored
      
      If TX_WRITE is create on a file, and the file is later deleted and a new
      directory is created on the same object id, it is possible that when
      zil_commit happens, zfs_get_data will be called on the new directory.
      This may result in panic as it tries to do range lock.
      
      This patch fixes this issue by record the generation number during
      zfs_log_write, so zfs_get_data can check if the object is valid.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarChunwei Chen <david.chen@nutanix.com>
      Closes #10593
      Closes #11682 
      296a4a36
    • Andrew's avatar
      Fix regression in POSIX mode behavior · 66e6d3f1
      Andrew authored
      Commit 235a8565
      
       introduced a regression in evaluation of POSIX modes
      that require group DENY entries in the internal ZFS ACL. An example
      of such a POSX mode is 007. When write_implies_delete_child is set,
      then ACE_WRITE_DATA is added to `wanted_dirperms` in prior to calling
      zfs_zaccess_common(). This occurs is zfs_zaccess_delete().
      
      Unfortunately, when zfs_zaccess_aces_check hits this particular DENY
      ACE, zfs_groupmember() is checked to determine whether access should be
      denied, and since zfs_groupmember() always returns B_TRUE on Linux and
      so this check is failed, resulting ultimately in EPERM being returned.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Reviewed-by: default avatarRyan Moeller <ryan@iXsystems.com>
      Signed-off-by: default avatarAndrew Walker <awalker@ixsystems.com>
      Closes #11760 
      66e6d3f1
    • Palash Gandhi's avatar
      ZTS: New test for kernel panic induced by redacted send · c2385075
      Palash Gandhi authored
      This change adds a new test that covers a bug fix in the binary search
      in the redacted send resume logic that causes a kernel panic.
      The bug was fixed in https://github.com/openzfs/zfs/pull/11297
      
      .
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Co-authored-by: default avatarJohn Kennedy <john.kennedy@delphix.com>
      Signed-off-by: default avatarPalash Gandhi <palash.gandhi@delphix.com>
      Closes #11764 
      c2385075
    • Martin Matuška's avatar
      Allow setting bootfs property on pools with indirect vdevs · cd5b8128
      Martin Matuška authored
      
      The FreeBSD boot loader relies on the bootfs property and is capable
      of booting from removed (indirect) vdevs.
      
      Reviewed-by Eric van Gyzen
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarMartin Matuska <mm@FreeBSD.org>
      Closes #11763
      cd5b8128
    • Ryan Moeller's avatar
      Fix typo in zgenhostid.8 · 0ab84bff
      Ryan Moeller authored
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Reviewed-by: default avatarGeorge Melikov <mail@gmelikov.ru>
      Signed-off-by: default avatarRyan Moeller <ryan@iXsystems.com>
      Closes #11770
      0ab84bff
    • Brian Atkinson's avatar
      Removing old code for k(un)map_atomic · f52124dc
      Brian Atkinson authored
      
      It used to be required to pass a enum km_type to kmap_atomic() and
      kunmap_atomic(), however this is no longer necessary and the wrappers
      zfs_k(un)map_atomic removed these. This is confusing in the ABD code as
      the struct abd_iter member iter_km no longer exists and the wrapper
      macros simply compile them out.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Reviewed-by: default avatarAdam Moss <c@yotes.com>
      Signed-off-by: default avatarBrian Atkinson <batkinson@lanl.gov>
      Closes #11768 
      f52124dc
    • Serapheim Dimitropoulos's avatar
      Initialize metaslab range trees in metaslab_init · 793c958f
      Serapheim Dimitropoulos authored
      
      = Motivation
      
      We've noticed several zloop crashes within Delphix generated
      due to the following sequence of events:
      
      - A device gets expanded and new metaslabas are allocated for
        it. These metaslabs go through `metaslab_init()` but haven't
        gone through `metaslab_sync_done()` yet. This meas that the
        only range tree that's actually set is the `ms_allocatable`.
        All the others are NULL.
      
      - A vdev_initialization is issues and `vdev_initialize_thread`
        starts processing one of these new metaslabs of the expanded
        vdev.
      
      - As part of `vdev_initialize_calculate_progress()` we call
        into `metaslab_load()` and `metaslab_load_impl()` which
        in turn tries to dereference the metaslabs trees that
        are still NULL and therefore we crash.
      
      The same failure can come up from the `vdev_trim` code paths.
      
      = This Patch
      
      We considered the following solutions to deal with this issue:
      
      [A] Add logic to `vdev_initialize/trim` to skip those new
          metaslabs. We decided against this as it would be good
          to avoid exposing this lower-level detail to higer-level
          operations.
      
      [B] Have `metaslab_load_impl()` return early for new metaslabs
          and thus never touch those range_trees that are NULL at
          that time. This seemed more of a work-around for the bug
          and not a clear-cut solution.
      
      [C] Refactor our logic so all metaslabs have their range_trees
          created at the time of their creatin in `metaslab_init()`.
      
      In this patch we decided to go with [C] because:
      
      (1) It doesn't expose more metaslab details to higher level
          operations such as vdev initialize and trim.
      
      (2) The current behavior of creating the range trees lazily
          in `metaslab_sync_done()` is unnecessarily complicated.
      
      (3) Always initializing the metaslab range_trees makes other
          parts of the codebase cleaner. For example, we used to
          use `ms_freed` as the reference value for knowing whether
          all the range_trees have been initialized. Now we no
          longer need to do that check in most places (and in the
          few that we do we use the `ms_new` boolean field now
          which is more readable).
      
      = Side Changes
      
      Probably due to a mismerge we set `ms_loaded` to `B_TRUE` twice
      in `metasloab_load_impl()`. In this patch we remove the extraneous
      assignment.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Reviewed-by: default avatarMatthew Ahrens <mahrens@delphix.com>
      Signed-off-by: default avatarSerapheim Dimitropoulos <serapheim@delphix.com>
      Closes #11737 
      793c958f
    • Coleman Kane's avatar
      Linux 5.12 update: bio_max_segs() replaces BIO_MAX_PAGES · ffd6978e
      Coleman Kane authored
      
      The BIO_MAX_PAGES macro is being retired in favor of a bio_max_segs()
      function that implements the typical MIN(x,y) logic used throughout the
      kernel for bounding the allocation, and also the new implementation is
      intended to be signed-safe (which the former was not).
      Reviewed-by: default avatarTony Hutter <hutter2@llnl.gov>
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarColeman Kane <ckane@colemankane.org>
      Closes #11765 
      ffd6978e
    • Coleman Kane's avatar
      Linux 5.12 compat: idmapped mounts · e2a82961
      Coleman Kane authored
      
      In Linux 5.12, the filesystem API was modified to support ipmapped
      mounts by adding a "struct user_namespace *" parameter to a number
      functions and VFS handlers. This change adds the needed autoconf
      macros to detect the new interfaces and updates the code appropriately.
      This change does not add support for idmapped mounts, instead it
      preserves the existing behavior by passing the initial user namespace
      where needed.  A subsequent commit will be required to add support
      for idmapped mounted.
      Reviewed-by: default avatarTony Hutter <hutter2@llnl.gov>
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Co-authored-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarColeman Kane <ckane@colemankane.org>
      Closes #11712 
      e2a82961
  6. 19 Mar, 2021 1 commit
    • Matthew Ahrens's avatar
      Clean up RAIDZ/DRAID ereport code · 330c6c05
      Matthew Ahrens authored
      
      The RAIDZ and DRAID code is responsible for reporting checksum errors on
      their child vdevs.  Checksum errors represent events where a disk
      returned data or parity that should have been correct, but was not.  In
      other words, these are instances of silent data corruption.  The
      checksum errors show up in the vdev stats (and thus `zpool status`'s
      CKSUM column), and in the event log (`zpool events`).
      
      Note, this is in contrast with the more common "noisy" errors where a
      disk goes offline, in which case ZFS knows that the disk is bad and
      doesn't try to read it, or the device returns an error on the requested
      read or write operation.
      
      RAIDZ/DRAID generate checksum errors via three code paths:
      
      1. When RAIDZ/DRAID reconstructs a damaged block, checksum errors are
      reported on any children whose data was not used during the
      reconstruction.  This is handled in `raidz_reconstruct()`.  This is the
      most common type of RAIDZ/DRAID checksum error.
      
      2. When RAIDZ/DRAID is not able to reconstruct a damaged block, that
      means that the data has been lost.  The zio fails and an error is
      returned to the consumer (e.g. the read(2) system call).  This would
      happen if, for example, three different disks in a RAIDZ2 group are
      silently damaged.  Since the damage is silent, it isn't possible to know
      which three disks are damaged, so a checksum error is reported against
      every child that returned data or parity for this read.  (For DRAID,
      typically only one "group" of children is involved in each io.)  This
      case is handled in `vdev_raidz_cksum_finish()`. This is the next most
      common type of RAIDZ/DRAID checksum error.
      
      3. If RAIDZ/DRAID is not able to reconstruct a damaged block (like in
      case 2), but there happens to be additional copies of this block due to
      "ditto blocks" (i.e. multiple DVA's in this blkptr_t), and one of those
      copies is good, then RAIDZ/DRAID compares each sector of the data or
      parity that it retrieved with the good data from the other DVA, and if
      they differ then it reports a checksum error on this child.  This
      differs from case 2 in that the checksum error is reported on only the
      subset of children that actually have bad data or parity.  This case
      happens very rarely, since normally only metadata has ditto blocks.  If
      the silent damage is extensive, there will be many instances of case 2,
      and the pool will likely be unrecoverable.
      
      The code for handling case 3 is considerably more complicated than the
      other cases, for two reasons:
      
      1. It needs to run after the main raidz read logic has completed.  The
      data RAIDZ read needs to be preserved until after the alternate DVA has
      been read, which necessitates refcounts and callbacks managed by the
      non-raidz-specific zio layer.
      
      2. It's nontrivial to map the sections of data read by RAIDZ to the
      correct data.  For example, the correct data does not include the parity
      information, so the parity must be recalculated based on the correct
      data, and then compared to the parity that was read from the RAIDZ
      children.
      
      Due to the complexity of case 3, the rareness of hitting it, and the
      minimal benefit it provides above case 2, this commit removes the code
      for case 3.  These types of errors will now be handled the same as case
      2, i.e. the checksum error will be reported against all children that
      returned data or parity.
      Reviewed-by: default avatarBrian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: default avatarMatthew Ahrens <mahrens@delphix.com>
      Closes #11735 
      330c6c05
  7. 18 Mar, 2021 3 commits
  8. 16 Mar, 2021 5 commits
  9. 13 Mar, 2021 4 commits
  10. 12 Mar, 2021 8 commits