- 30 Jun, 2023 7 commits
-
-
Rich Ercolani authored
The DDT is really inefficient on 4k and up vdevs, because it always allocates 4k blocks, and while compression could save us somewhat at ashift 9, that stops being true. So let's change the default to 32 KiB, which seems like a reasonable compromise between improved space savings and inflated write sizes for DDT updates. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rich Ercolani <rincebrain@gmail.com> Closes #14654
-
Rob N authored
The previous comment wondered if this case could happen; it turns out that it really can't. This block can only be entered if dde_type and dde_class are "real"; that only happens when a ddt entry has been previously synced to a ddt store, that is, it was created on a previous txg. Since its gone through that sync, its dde_refcount must be >0. ddt_addref() is called from brt_pending_apply(), which is called at the beginning of spa_sync(), before pending DMU writes/frees are issued. Freeing a dedup block is the only thing that can decrement dde_refcount, so there's no way for it to drop to zero before applying the clone bumps it. Further, even if it _could_ go to zero, it wouldn't be necessary to fill the entry from the block. The phys content is not cleared until the free is issued, which happens when the refcount goes to zero, when the last real free comes through. The cloned block should be identical to what's in the phys already, so the fill should be a no-op anyway. I've replaced this with an assertion because this is all very dependent on the ordering in which BRT and DDT changes are applied, and that might change in the future. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <rob.norris@klarasystems.com> Sponsored-By: Klara, Inc. Closes #15004
-
Alexander Motin authored
With zl_suspend read in zil_commit() not protected by any locks it is possible for new ZIL writes to be in progress while zil_destroy() called by zil_suspend() freeing them. This patch closes the race by taking zl_issuer_lock in zil_suspend() and adding the second zl_suspend check to zil_get_commit_list(), protected by the lock. It allows all already queued transactions to be logged normally, while blocks any new ones, calling txg_wait_synced() for the TXGs. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14979
-
Alexander Motin authored
- Pack struct zio_prop by 4 bytes from 84 to 80. - Skip new child ZIO locking while linking to parent. The newly allocated ZIO is not externally visible yet, so nobody should care. - Skip io_bp_copy writes when not used (write && non-debug). Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14985
-
Alexander Motin authored
Scan process may skip blocks based on their birth time, DVA, etc. Traditionally those blocks were accounted as issued, that caused reporting of hugely over-inflated numbers, having nothing to do with actual disk I/O. This change utilizes never used field in struct dsl_scan_phys to account such skipped bytes, allowing to report how much data were actually scrubbed/resilvered and what is the actual I/O speed. While formally it is an on-disk format change, it should be compatible both ways, so should not need a feature flag. This should partially address the same issue as c85ac731 , but from a different perspective, complementing it. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Akash B <akash-b@hpe.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15007
-
Arshad Hussain authored
This patch changes the passing of "size" to snprintf from hard-coded (openended) to sizeof(errbuf). This is bringing to standard with rest of the code where- ever 'errbuf' is used. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Arshad Hussain <arshad.hussain@aeoncomputing.com> Closes #15003
-
Alexander Motin authored
The previous code was checking zfs_is_namespace_prop() only for the last property on the list. If one was not "namespace", then remount wasn't called. To fix that move zfs_is_namespace_prop() inside the loop and remount if at least one of properties was "namespace". Reviewed-by:
Umer Saleem <usaleem@ixsystems.com> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15000
-
- 29 Jun, 2023 3 commits
-
-
vimproved authored
The issue that this is designed to work around is only applicable to glibc, since it's caused by glibc's pthread_cancel() implementation using dlopen on libgcc_s.so.1 (and therefor not triggering dracut to include it in the initramfs). This commit adds an extra condition to the workaround that tests for glibc via "ldconfig -p | grep -qF 'libc.so.6'" (which should only be present on glibc systems). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Violet Purcell <vimproved@inventati.org> Closes #14992
-
Yuri Pankov authored
Consistently get the proper default value for autotrim. Currently, only the kernel module is built with IN_FREEBSD_BASE, and libzfs get the wrong default value, leading to confusion and incorrect output when autotrim value was not set explicitly. Reviewed-by:
Warner Losh <imp@bsdimp.com> Signed-off-by:
Yuri Pankov <yuripv@FreeBSD.org> Closes #15016
-
Mateusz Piotrowski authored
Reviewed-by:
Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Mateusz Piotrowski <0mp@FreeBSD.org> Sponsored-by: Klara Inc. Closes #15014
-
- 28 Jun, 2023 2 commits
-
-
Alexander Motin authored
lwb->lwb_issued_txg can not be accessed after lwb_state is set to LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be freed by zil_sync(). We must save the txg number before that. This is similar to the 55b1842f , but as I see the bug is not new. It existed for quite a while, just was not triggered due to smaller race window. Reviewed-by:
Allan Jude <allan@klarasystems.com> Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14988 Closes #14999
-
Alexander Motin authored
When ZFS appends files in chunks bigger than recordsize, it borrows buffer from ARC and fills it before opening transaction. This supposed to help in case of page faults to not hold transaction open indefinitely. The problem appears when recordsize is set lower than default 128KB. Since each block is committed in separate transaction, per-transaction overhead becomes significant, and what is even worse, active use of of per-dataset and per-pool locks to protect space use accounting for each transaction badly hurts the code SMP scalability. The same transaction size limitation applies in case of file rewrite, but without even excuse of buffer borrowing. To address the issue, disable the borrowing mechanism if recordsize is smaller than default and the write request is 4x bigger than it. In such case writes up to 32MB are executed in single transaction, that dramatically reduces overhead and lock contention. Since the borrowing mechanism is not used for file rewrites, and it was never used by zvols, which seem to work fine, I don't think this change should create significant problems, partially because in addition to the borrowing mechanism there are also used pre-faults. My tests with 4/8 threads writing several files same time on datasets with 32KB recordsize in 1MB requests show reduction of CPU usage by the user threads by 25-35%. I would measure it in GB/s, but at that block size we are now limited by the lock contention of single write issue taskqueue, which is a separate problem we are going to work on. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14964
-
- 27 Jun, 2023 2 commits
-
-
Laevos authored
Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Laevos <5572812+Laevos@users.noreply.github.com> Closes #15011
-
Alexander Motin authored
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from time-sorted AVL-trees to simple lists. AVL-trees are too expensive for such a simple task. To change I/O priority without searching through the trees, add io_queue_state field to struct zio. To not check number of queued I/Os for each priority add vq_cqueued bitmap to struct vdev_queue. Update it when adding/removing I/Os. Make vq_cactive a separate array instead of struct vdev_queue_class member. Together those allow to avoid lots of cache misses when looking for work in vdev_queue_class_to_issue(). Introduce deadline of ~0.5s for LBA-sorted queues. Before this I saw some I/Os waiting in a queue for up to 8 seconds and possibly more due to starvation. With this change I no longer see it. I had to slightly more complicate the comparison function, but since it uses all the same cache lines the difference is minimal. For a sequential I/Os the new code in vdev_queue_io_to_issue() actually often uses more simple avl_first(), falling back to avl_find() and avl_nearest() only when needed. Arrange members in struct zio to access only one cache line when searching through vdev queues. While there, remove io_alloc_node, reusing the io_queue_node instead. Those two are never used same time. Remove zfs_vdev_aggregate_trim parameter. It was disabled for 4 years since implemented, while still wasted time maintaining the offset-sorted tree of TRIM requests. Just remove the tree. Remove locking from txg_all_lists_empty(). It is racy by design, while 2 pair of locks/unlocks take noticeable time under the vdev queue lock. With these changes in my tests with volblocksize=4KB I measure vdev queue lock spin time reduction by 50% on read and 75% on write. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14925
-
- 26 Jun, 2023 1 commit
-
-
Rich Ercolani authored
It's been observed that in certain workloads (zvol-related being a big one), ZFS will end up spending a large amount of time spinning up taskqs only to tear them down again almost immediately, then spin them up again... I noticed this when I looked at what my mostly-idle system was doing and wondered how on earth taskq creation/destroy was a bunch of time... So I added a configurable delay to avoid it tearing down tasks the first time it notices them idle, and the total number of threads at steady state went up, but the amount of time being burned just tearing down/turning up new ones almost vanished. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rich Ercolani <rincebrain@gmail.com> Closes #14938
-
- 18 Jun, 2023 1 commit
-
-
Alexander Motin authored
482da24e missed arc_buf_destroy() calls on log parse errors, possibly leaking up to 128KB of memory per dataset during ZIL replay. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Paul Dagnelie <pcd@delphix.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14987
-
- 15 Jun, 2023 2 commits
-
-
George Amanakis authored
With the latest L2ARC fixes, 2 seconds is too long to wait for quiescence of arcstats like l2_size. Shorten this interval to avoid having the persistent L2ARC tests in ZTS prematurely terminated. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
George Amanakis <gamanakis@gmail.com> Closes #14981
-
Alexander Motin authored
Those callbacks were introduced many years ago as part of a bigger patch to smoothen the write throttling within a txg. They allow to account completion of individual physical writes within a logical one, improving cases when some of physical writes complete much sooner than others, gradually opening the write throttle. Few years after that ZFS got allocation throttling, working on a level of logical writes and limiting number of writes queued to vdevs at any point, and so limiting latency distribution between the physical writes and especially writes of multiple copies. The addition of scheduling deadline I proposed in #14925 should further reduce the latency distribution. Grown memory sizes over the past 10 years should also reduce importance of the smoothing. While the use of physdone callback may still in theory provide some smoother throttling, there are cases where we simply can not afford it. Since dirty data accounting is protected by pool-wide lock, in case of 6-wide RAIDZ, for example, it requires us to take it 8 times per logical block write, creating huge lock contention. My tests of this patch show radical reduction of the lock spinning time on workloads when smaller blocks are written to RAIDZ pools, when each of the disks receives 8-16KB chunks, but the total rate reaching 100K+ blocks per second. Same time attempts to measure any write time fluctuations didn't show anything noticeable. While there, remove also io_child_count/io_parent_count counters. They are used only for couple assertions that can be avoided. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14948
-
- 14 Jun, 2023 3 commits
-
-
Brian Behlendorf authored
On FreeBSD 14 this test runs slowly in the CI environment and is killed by the 10 minute timeout. Skip the test on FreeBSD until the slow down is resolved. Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Issue #14961
-
Alexander Motin authored
With large number of tracked references list searches under the lock become too expensive, creating enormous lock contention. On my tests with ZFS_DEBUG enabled this increases write throughput with 32KB blocks from ~1.2GB/s to ~7.5GB/s. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14970
-
George Amanakis authored
If this is not done, and the pool has an ashift other than the default (at the moment 9) then the following happens: 1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but upon export it is not stored anywhere 2) at the first import, vdev_open() sees an vdev_ashift() of 0 and assigns the logical_ashift, which is 9 3) reading the contents of L2ARC, including the header fails 4) L2ARC buffers are not restored in ARC. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
George Amanakis <gamanakis@gmail.com> Closes #14313 Closes #14963
-
- 10 Jun, 2023 1 commit
-
-
George Amanakis authored
While commit bcd53210 adjusts the write size based on the size of the log block, this happens after comparing the unadjusted write size to the evicted (target) size. In this case l2ad_hand will exceed l2ad_evict and violate an assertion at the end of l2arc_write_buffers(). Fix this by adding the max log block size to the allocated size of the buffer to be committed before comparing the result to the target size. Also reset the l2arc_trim_ahead ZFS module variable when the adjusted write size exceeds the size of the L2ARC device. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
George Amanakis <gamanakis@gmail.com> Closes #14936 Closes #14954
-
- 09 Jun, 2023 5 commits
-
-
Alexander Motin authored
It was a vdev level read cache, designed to aggregate many small reads by speculatively issuing bigger reads instead and caching the result. But since it has almost no idea about what is going on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers, it was found to make more harm than good, for which reason it was disabled for the past 12 years. These days we have much better instruments to enlarge the I/Os, such as speculative and prescient prefetches, I/O scheduler, I/O aggregation etc. Besides just the dead code removal this removes one extra mutex lock/unlock per write inside vdev_cache_write(), not otherwise disabled and trying to do some work. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14953
-
Brian Behlendorf authored
Until the ASSERT which is occasionally hit while running checkpoint_discard_busy is resolved skip this test case. Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Issue #12053 Closes #14952
-
Alexander Motin authored
- Do not report L2ARC as FAULTED in presence of in-flight writes. - Report read and write I/Os, bytes and errors. - Remove few numbers not important to average user. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #12304 Closes #14946
-
Alexander Motin authored
... instead of list_head() + list_remove(). On FreeBSD the list functions are not inlined, so in addition to more compact code this also saves another function call. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14955
-
Alexander Motin authored
We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE state and dropping zl_lock, since it may be freed by zil_sync(). To free itxs and waiters after dropping the lock we need to move lwb_itxs and lwb_waiters lists elements to local storage. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14957 Closes #14959
-
- 07 Jun, 2023 2 commits
-
-
Rich Ercolani authored
This reverts commit 79b20949 since it doesn't work with the systemd version shipped with RHEL7-based systems. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rich Ercolani <rincebrain@gmail.com> Closes #14943 Closes #14945
-
Brian Behlendorf authored
When a kmem cache is exhausted and needs to be expanded a new slab is allocated. KM_SLEEP callers can block and wait for the allocation, but KM_NOSLEEP callers were incorrectly allowed to block as well. Resolve this by attempting an emergency allocation as a best effort. This may fail but that's fine since any KM_NOSLEEP consumer is required to handle an allocation failure. Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Adam Moss <c@yotes.com> Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by:
Tony Hutter <hutter2@llnl.gov>
-
- 06 Jun, 2023 1 commit
-
-
George Amanakis authored
l2arc_write_size() should return the write size after adjusting for trim and overhead of the L2ARC log blocks. Also take into account the allocated size of log blocks when deciding when to stop writing buffers to L2ARC. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
George Amanakis <gamanakis@gmail.com> Closes #14939
-
- 05 Jun, 2023 4 commits
-
-
Rob Norris authored
This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Reviewed-by:
Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by:
WHR <msl0000023508@gmail.com> Signed-off-by:
Rob Norris <rob.norris@klarasystems.com> Closes #14642
-
Rob Norris authored
There's no particular reason this function should be kernel-only, and I want to use it (indirectly) from zdb. I've moved it to zfs_znode.c because libzpool does not compile in zfs_vfsops.c, and this at least matches the header its imported from. Sponsored-By: Klara, Inc. Reviewed-by:
Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by:
WHR <msl0000023508@gmail.com> Signed-off-by:
Rob Norris <rob.norris@klarasystems.com> Closes #14642
-
Alexander Motin authored
There are two places where we need to add/remove several references with semantics of zfs_refcount_(add|remove). But when debug/tracing is disabled, it is a crime to run multiple atomic_inc() in a loop, especially under congested pool-wide allocator lock. Introduced new functions implement the same semantics as the loop, but without overhead in production builds. Reviewed-by:
Rich Ercolani <rincebrain@gmail.com> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14934
-
Brian Behlendorf authored
Update the META file to reflect compatibility with the 6.3 kernel. Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Tony Hutter <hutter2@llnl.gov>
-
- 02 Jun, 2023 2 commits
-
-
Graham Perrin authored
Make the section heading more generic (the section relates to ZFS files as well as ZFS volumes). Swapping to a ZFS volume is prone to deadlock. Remove the related instruction, direct readers to OpenZFS FAQ. Related, but not linked from within the manual page: <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux > (Using a zvol for a swap device on Linux). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Graham Perrin <grahamperrin@freebsd.org> Issue #7734 Closes #14756
-
Alexander Motin authored
There seems to be no reason for ZIL blocks to be limited by 128KB other than replay code is written in such a way. This change does not increase the limit yet, just removes the artificial limitation. Avoided extra memcpy() may save us a second during replay. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Prakash Surya <prakash.surya@delphix.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14910
-
- 01 Jun, 2023 4 commits
-
-
Val Packett authored
Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Felix Dörre <felix@dogcraft.de> Signed-off-by:
Val Packett <val@packett.cool> Closes #14834
-
Val Packett authored
There's usually no requirement that a user be logged in for changing their password, so let's not be surprising here. We need to use the fetch_lazy mechanism for the old password to avoid a double prompt for it, so that mechanism is now generalized a bit. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Felix Dörre <felix@dogcraft.de> Signed-off-by:
Val Packett <val@packett.cool> Closes #14834
-
Val Packett authored
Instead of a fixed >=1000 check, allow the configuration to override the minimum UID and add a maximum one as well. While here, add the uid range check to the authenticate method as well, and fix the return in the chauthtok method (seems very wrong to report success when we've done absolutely nothing). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Felix Dörre <felix@dogcraft.de> Signed-off-by:
Val Packett <val@packett.cool> Closes #14834
-
Val Packett authored
Probably not always a good idea, but it's nice to have the option. It is a workaround for FreeBSD calling the PAM session end earier than the last process is actually done touching the mount, for example. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Felix Dörre <felix@dogcraft.de> Signed-off-by:
Val Packett <val@packett.cool> Closes #14834
-