- 01 Jan, 2024 1 commit
-
-
Umer Saleem authored
For block cloning, if we mmap the cloned file and write from the map into the file, it triggers a panic in dbuf_redirty() on Linux. The same scenario causes data corruption on FreeBSD. Both these issues are fixed under PR#15656 and PR#15665. It would be good to add a test for this scenario in ZTS. The test program and issue was produced by @robn. Signed-off-by:
Umer Saleem <usaleem@ixsystems.com>
-
- 27 Dec, 2023 1 commit
-
-
Mark Johnston authored
- Mark some parameters to zpool_power*() as unused. - Add a stub zpool_disk_wait(). Fixes: a9520e6e ("zpool: Add slot power control, print power status") Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Tony Hutter <hutter2@llnl.gov>
-
- 26 Dec, 2023 1 commit
-
-
Pawel Jakub Dawidek authored
The test mostly focus on testing various corner cases. The tests take a long time to run, so for the common.run runfile we randomly select a hundred tests. To run all the bclone tests, bclone.run runfile should be used. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15631
-
- 21 Dec, 2023 4 commits
-
-
Brian Behlendorf authored
On some systems we already have blkdev_get_by_path() with 4 args but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined. The vdev_bdev_mode() function was added to handle this case but there was no generic way to specify exclusive access. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15692
-
chrisperedun authored
While 763ca47f closes the situation of block cloning creating unencrypted records in encrypted datasets, existing data still causes panic on read. Setting zfs_recover bypasses this but at the cost of potentially ignoring more serious issues. Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Chris Peredun <chris.peredun@ixsystems.com> Closes #15677
-
Alexander Motin authored
Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15635
-
Tony Hutter authored
Add `zpool` flags to control the slot power to drives. This assumes your SAS or NVMe enclosure supports slot power control via sysfs. The new `--power` flag is added to `zpool offline|online|clear`: zpool offline --power <pool> <device> Turn off device slot power zpool online --power <pool> <device> Turn on device slot power zpool clear --power <pool> [device] Turn on device slot power If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power' option is automatically implied for `zpool online` and `zpool clear` and does not need to be passed. zpool status also gets a --power option to print the slot power status. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Mart Frauenlob <AllKind@fastest.cc> Signed-off-by:
Tony Hutter <hutter2@llnl.gov> Closes #15662
-
- 20 Dec, 2023 5 commits
-
-
Rob N authored
We are finding that as customers get larger and faster machines (hundreds of cores, large NVMe-backed pools) they keep hitting relatively low performance ceilings. Our profiling work almost always finds that they're running into bottlenecks on the SPA IO taskqs. Unfortunately there's often little we can advise at that point, because there's very few ways to change behaviour without patching. This commit adds two load-time parameters `zio_taskq_read` and `zio_taskq_write` that can configure the READ and WRITE IO taskqs directly. This achieves two goals: it gives operators (and those that support them) a way to tune things without requiring a custom build of OpenZFS, which is often not possible, and it lets us easily try different config variations in a variety of environments to inform the development of better defaults for these kind of systems. Because tuning the IO taskqs really requires a fairly deep understanding of how IO in ZFS works, and generally isn't needed without a pretty serious workload and an ability to identify bottlenecks, only minimal documentation is provided. Its expected that anyone using this is going to have the source code there as well. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <rob.norris@klarasystems.com> Closes #15675
-
Rob Norris authored
6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in b37f2934 to atime and mtime as well. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
6.7 changed the names of the time members in struct inode, so we can't assign back to it because we don't know its name. In practice this doesn't matter though - if we're missing current_time(), then we must be on <4.9, and we know our fallback will need to return timespec. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
- 15 Dec, 2023 2 commits
-
-
Umer Saleem authored
PR#15634 removes 128K into 2x68K LWB split optimization, since it was found to cause LWB buffer overflow while trying to write 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, with multiple VDEVs ZIL. This commit adds a test for this particular scenario by writing maximum sizes TX_CLONE_RANE record with 1022 block pointers into 68KB buffer, with two SLOG devices. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Umer Saleem <usaleem@ixsystems.com> Closes #15672
-
Alexander Motin authored
When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665
-
- 12 Dec, 2023 3 commits
-
-
Alexander Motin authored
Block cloning normally creates dirty record without dr_data. But if the block is read after cloning, it is moved into DB_CACHED state and receives the data buffer. If after that we call dbuf_unoverride() to convert the dirty record into normal write, we should give it the data buffer from dbuf and release one. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15654 Closes #15656
-
Alexander Motin authored
In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15653
-
Brian Behlendorf authored
Align the raidz_expand_005_pos test with the raidz_expand_004_pos test and only verify no errors were reported. Allow scrub repair IO. Reviewed-by:
Don Brady <don.brady@klarasystems.com> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15663
-
- 11 Dec, 2023 3 commits
-
-
Chunwei Chen authored
While evicting dbufs of a dnode, a marker node is added to the AVL. The marker node should be inserted in AVL tree ahead of the dbuf its trying to delete. The blkid and level is used to ensure this. However, this could go wrong there's another dbufs with the same blkid and level in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare() could fail to give the right location or could cause confusion and trigger ASSERTs. To ensure that the marker is inserted before the deleting dbuf, use the pointer value of the original dbuf for comparision. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by:
Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com> Signed-off-by:
Chunwei Chen <david.chen@nutanix.com> Closes #12482 Closes #15643
-
Tony Hutter authored
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in https://github.com/openzfs/zfs/issues/15526 The bug was fixed in https://github.com/openzfs/zfs/pull/15571 and was backported to 2.2.2 and 2.1.14. This test case is just to make sure it does not come back. seekflood.c originally written by Rob Norris. Reviewed-by:
Graham Perrin <grahamperrin@freebsd.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Tony Hutter <hutter2@llnl.gov> Closes #15608
-
Brian Behlendorf authored
The zpool_import_status.ksh test case was not being run because it was not included in the Makefile.am. Reviewed-by:
George Melikov <mail@gmelikov.ru> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15655
-
- 09 Dec, 2023 5 commits
-
-
Brian Behlendorf authored
The io_uring test fails on CentOS 9 with the following fio error. Disable the test for the benefit of the CI until this can be fully investigated. This basic test passes as expected on newer kernels. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15636
-
Alexander Motin authored
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in case dbuf_hold() failed. I don't have reproduction for this, but it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15644
-
Ameer Hamza authored
Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Signed-off-by:
Ameer Hamza <ahamza@ixsystems.com> Closes #15614
-
Ameer Hamza authored
Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Signed-off-by:
Ameer Hamza <ahamza@ixsystems.com> Closes #15614
-
Mauricio Faria de Oliveira authored
Replace ENCLO_US_RE with ENCLO_SU_RE in the name of the variable. Note this changes the user-visible string in zed.rc, thus might break current users with the wrong string, but it's ~2 months since zfs-2.2.0 tag is out, thus should not be widespread yet. Mechanical change: $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT cmd/zed/zed.d/zed.rc cmd/zed/zed.d/statechange-slot_off.sh $ sed -i 's/ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT/<linebreak> ZED_POWER_OFF_ENCLOSURE_SLOT_ON_FAULT/g' \ cmd/zed/zed.d/zed.rc \ cmd/zed/zed.d/statechange-slot_off.sh $ grep -rl ZED_POWER_OFF_ENCLOUSRE_SLOT_ON_FAULT $ Fixes 11fbcacf ("zed: Add zedlet to power off slot when drive is faulted") Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Mauricio Faria de Oliveira <mfo@canonical.com> Closes #15651
-
- 07 Dec, 2023 4 commits
-
-
Rob N authored
Just silencing a warning. Its totally fine for a hostid to not be there. Reported-by: Coverity (CID-1573336) Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Rob Norris <robn@despairlabs.com> Closes #15650
-
Rob N authored
Reported-by: Coverity (CID-1573333) Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Rob Norris <robn@despairlabs.com> Closes #15649
-
Rob N authored
Coverity noticed that sometimes we ignore the return, and sometimes we don't. Its not wrong, and I like consistent style, so here we are. Reported-by: Coverity (CID-1564584) Reported-by: Coverity (CID-1564585) Reported-by: Coverity (CID-1564586) Reported-by: Coverity (CID-1564587) Reported-by: Coverity (CID-1564588) Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Closes #15647
-
Mark Johnston authored
Otherwise the field is left uninitialized, leading to a possible kernel memory disclosure to userspace or to the network. Use the same initialization value we use in zfsctl_common_getattr(). Reported-by: KMSAN Sponsored-by: The FreeBSD Foundation Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Ed Maste <emaste@FreeBSD.org> Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Closes #15639
-
- 06 Dec, 2023 4 commits
-
-
Alexander Motin authored
Without this patch on pool of 60 vdevs with ZFS_DEBUG enabled clone takes much more time than copy, while heavily trashing dbgmsg for no good reason, repeatedly dumping all vdevs BRTs again and again, even unmodified ones. I am generally not sure this dumping is not excessive, but decided to keep it for now, just restricting its scope to more reasonable. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15625
-
Alexander Motin authored
To improve 128KB block write performance in case of multiple VDEVs ZIL used to spit those writes into two 64KB ones. Unfortunately it was found to cause LWB buffer overflow, trying to write maximum- sizes 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, since unlike TX_WRITE ZIL code can't split it. This is a minimally-invasive temporary block cloning fix until the following more invasive prediction code refactoring. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15634
-
Alexander Motin authored
Block pointers are not encrypted in TX_WRITE and TX_CLONE_RANGE records, so we can dump them, that may be useful for debugging. Related to #15543. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15629
-
Shengqi Chen authored
Since Linux 6.2, the implementation of flush_dcache_page on riscv references GPL-only symbol `PageHuge`, breaking the build of zfs. This patch uses existing mechanism to override flush_dcache_page, removing the call to `PageHuge`. According to comments in kernel, it is only used to do some check against HugeTLB pages, which only exist in userspace. ZFS uses flush_dcache_page only on kernel pages, thus this patch will not introduce any behaviour change. See also: torvalds/linux@d33deda, openzfs/zfs@589f59b Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Shengqi Chen <harry-chen@outlook.com> Closes #14974 Closes #15627
-
- 05 Dec, 2023 5 commits
-
-
Don Brady authored
Detail the import progress of log spacemaps as they can take a very long time. Also grab the spa_note() messages to, as they provide insight into what is happening Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Don Brady <don.brady@klarasystems.com> Co-authored-by:
Allan Jude <allan@klarasystems.com> Closes #15539
-
Shengqi Chen authored
My merged pull request #15557 fixes compilation of sha2 kernels on arm v5/6. However, the compiler guards only allows sha256/512_armv7_impl to be used when __ARM_ARCH > 6. This patch enables these ASM kernels on all arm architectures. Some compiler guards are adjusted accordingly to avoid the unnecessary compilation of SIMD (e.g., neon, armv8ce) kernels on old architectures. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Shengqi Chen <harry-chen@outlook.com> Closes #15623
-
Rob N authored
Several zpool commands (status, list, iostat) have modes that present some information, sleep a while, present the current state, sleep, etc. Some of those had ways to invoke them that when piped would appear to do nothing for a while, because non-terminals are block-buffered, not line-buffered, by default. Fix this by forcing a flush before sleeping. In particular, all of these buffered: - zpool status <pool> <interval> - zpool iostat -y<m> <pool> <interval> - zpool list <pool> <interval> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Closes #15593
-
oromenahar authored
When two datasets share the same master encryption key, it is safe to clone encrypted blocks. Currently only snapshots and clones of a dataset share with it the same encryption key. Added a test for: - Clone from encrypted sibling to encrypted sibling with non encrypted parent - Clone from encrypted parent to inherited encrypted child - Clone from child to sibling with encrypted parent - Clone from snapshot to the original datasets - Clone from foreign snapshot to a foreign dataset - Cloning from non-encrypted to encrypted datasets - Cloning from encrypted to non-encrypted datasets Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Signed-off-by:
Kay Pedersen <mail@mkwg.de> Closes #15544
-
Alexander Motin authored
ZIL claim can not handle block pointers cloned from the future, since they are not yet allocated at that point. It may happen either if the block was just written when it was cloned, or if the pool was frozen or somehow else rewound on import. Handle it from two sides: prevent cloning of blocks with physical birth time from not yet synced or frozen TXG, and abort ZIL claim if we still detect such blocks due to rewind or something else. While there, assert that any cloned blocks we claim are really allocated by calling metaslab_check_free(). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15617
-
- 01 Dec, 2023 2 commits
-
-
Dex Wood authored
This commit adds the zed_notify_ntfy() function and hooks it into zed_notify(). This will allow ZED to send notifications to ntfy.sh or a self-hosted Ntfy service, which can be received on a desktop or mobile device. It is configured with ZED_NTFY_TOPIC, ZED_NTFY_URL, and ZED_NTFY_ACCESS_TOKEN variables in zed.rc. Reviewed-by: @classabbyamp Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Dex Wood <slash2314@gmail.com> Closes #15584
-
Alexander Motin authored
zil_claim_clone_range() takes references on cloned blocks before ZIL replay. Later zil_free_clone_range() drops them after replay or on dataset destroy. The total balance is neutral. It means we do not need to do anything (drop the references) for not implemented yet TX_CLONE_RANGE replay for ZVOLs. This is a logical follow up to #15603. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15612
-