- 16 Jan, 2024 1 commit
-
-
Umer Saleem authored
For block cloning, if we mmap the cloned file and write from the map into the file, it triggers a panic in dbuf_redirty() on Linux. The same scenario causes data corruption on FreeBSD. Both these issues are fixed under PR#15656 and PR#15665. It would be good to add a test for this scenario in ZTS. The test program and issue was produced by @robn. Reviewed-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Umer Saleem <usaleem@ixsystems.com> Closes #15717
-
- 12 Jan, 2024 10 commits
-
-
Brian Behlendorf authored
Drop the no_memory() call from zpool_in_use() when reading the label fails and instead return the error to the caller. This prevents a misleading "internal error: out of memory" error when the label can't be read. This will result in is_spare() returning B_FALSE instead of aborting, which is already safely handled. Furthermore, on Linux it's possible for EREMOTEIO to returned by an NVMe device if the device has been low-level formatted and not rescanned. In this case we want to fallback to the legacy scanning method and read any of the labels we can. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Issue #13538 Closes #15747
-
Benjamin Sherman authored
This change provides rpm spec macros to sign the zfs and spl kmods as the final step after the %install scriptlet. This is needed since the find-debuginfo.sh script strips out debug symbols plus signatures. Kernel module signing only occurs when the required files are present as typically required in the Linux source tree: - certs/signing_key.pem - certs/signing_key.x509 The method for overriding the default __spec_install_post macro is inspired by (and largely copied from) the Fedora kernel.spec. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by:
Benjamin Sherman <benjamin@holyarmy.org> Closes #15744
-
Mark Johnston authored
For FreeBSD sysctls, we don't want the extra newline, since the sysctl(8) utility will format strings appropriately. Reviewed-by:
Rob Norris <robn@despairlabs.com> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reported-by:
Peter Holm <pho@FreeBSD.org> Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Closes #15719
-
Mark Johnston authored
sbuf_cpy() resets the sbuf state, which is wrong for sbufs allocated by sbuf_new_for_sysctl(). In particular, this code triggers an assertion failure in sbuf_clear(). Simplify by just using sysctl_handle_string() for both reading and setting the tunable. Fixes: 6930ecbb ("spa: make read/write queues configurable") Reviewed-by:
Rob Norris <robn@despairlabs.com> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reported-by:
Peter Holm <pho@FreeBSD.org> Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Closes #15719
-
Rich Ercolani authored
Profiling zdb -vvvvv on datasets with a lot of zstd blocks, we find ourselves spending quite a lot of time on malloc/free, because we allocate a 16M abd each call, and never free it, so we're leaking 16M per call as well. This seems sub-optimal. So let's just keep the buffer around and reuse it. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Rich Ercolani <rincebrain@gmail.com> Closes #15721
-
Stefan Lendl authored
When running zfs share -a resetting the exports.d/zfs.exports makes sense the get a clean state. Truncating was also called with zfs mount which would not populate the file again. Add test to verify shares persist after mount -a. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Stefan Lendl <s.lendl@proxmox.com> Closes #15607 Closes #15660
-
Brian Behlendorf authored
Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15749
-
Rich Ercolani authored
zdb -R with :d tries to use gzip decompression 9 times per size. There's absolutely no reason for that, they're all the same decompressor. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rich Ercolani <rincebrain@gmail.com> Closes #15726
-
Mark Johnston authored
In general, VOPs must not load the "z_log" field until having called zfs_enter_verify_zp(). Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Closes #15752
-
Mark Johnston authored
We need to wait until after having done a zfs_enter() to load some fields from the zfsvfs structure. Otherwise a use-after-free is possible in the face of a concurrent rollback. Other functions in this file are careful to avoid this bug, I believe this is the only instance. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Closes #15752
-
- 09 Jan, 2024 7 commits
-
-
gofaster authored
This commit adds the zed_notify_gotify() function and hooks it into zed_notify(). This will allow ZED to send notifications to a self-hosted Gotify service, which can be received on a desktop or mobile device. It is configured with ZED_GOTIFY_URL, ZED_GOTIFY_APPTOKEN and ZED_GOTIFY_PRIORITY variables in zed.rc. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
gofaster <felix.gofaster@gmail.com> Closes #15693
-
Alexander Motin authored
Two block pointers in livelist pointing to the same location may be caused not only by dedup, but also by block cloning. We should not assert D bit set in them. Two block pointers in livelist pointing to the same location may have different logical birth time in case of dedup or cloning. We should assert identical physical birth time instead. Assert identical physical block size between pointers in addition to checksum, since that is what checksums are calculated on. Reviewed-by:
Matthew Ahrens <mahrens@delphix.com> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15732
-
Alexander Motin authored
- Fail if source block is smaller than destination. We can only grow blocks, not shrink them. - Fail if we do not have full znode range lock. In that case grow is not even called. We should improve zfs_rangelock_cb() somehow to know when cloning needs to grow the block size unlike write. - Fail of we tried to resize, but failed. There are many reasons for it to fail that we can not predict at this level, so be ready for them. Unlike write, that may proceed after growth failure, block cloning can't and must return error. This fixes assertion inside dmu_brt_clone() when it sees different number of blocks held in destination than it got block pointers. Builds without ZFS_DEBUG returned EXDEV, so are not affected much. Reviewed-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15724 Closes #15735
-
Kent Ross authored
This function decompresses to two buffers and then compares them to check whether the (opaque) decompression process filled the whole buffer. Previously it began with lbuf uninitialized and lbuf2 filled with pseudorandom data. This neither guarantees that any bytes not written by the compressor would be different, nor seems incredibly sound otherwise! After these changes, instead of filling one buffer with generated pseudorandom data we overwrite each buffer with completely different data. This should remove the possibility of low-probability failures, as well as make the process simpler and cheaper. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rich Ercolani <rincebrain@gmail.com> Signed-off-by:
Kent Ross <k@mad.cash> Closes #15733
-
Jose Luis Duran authored
Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Jose Luis Duran <jlduran@gmail.com> Closes #15727
-
Alexander Motin authored
While picking parts from #14909 I've missed Linux tracing specific ones, that went unnoticed in default configurations, but breaks the build in some. Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15730
-
Shengqi Chen authored
This patch adds check for `kernel_neon_*` symbols on arm and arm64 platforms to address the following issues: 1. Linux 6.2+ on arm64 has exported them with `EXPORT_SYMBOL_GPL`, so license compatibility must be checked before use. 2. On both arm and arm64, the definitions of these symbols are guarded by `CONFIG_KERNEL_MODE_NEON`, but their declarations are still present. Checking in configuration phase only leads to MODPOST errors (undefined references). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Shengqi Chen <harry-chen@outlook.com> Closes #15711 Closes #14555 Closes: #15401
-
- 27 Dec, 2023 1 commit
-
-
Mark Johnston authored
- Mark some parameters to zpool_power*() as unused. - Add a stub zpool_disk_wait(). Fixes: a9520e6e ("zpool: Add slot power control, print power status") Signed-off-by:
Mark Johnston <markj@FreeBSD.org> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Tony Hutter <hutter2@llnl.gov>
-
- 26 Dec, 2023 1 commit
-
-
Pawel Jakub Dawidek authored
The test mostly focus on testing various corner cases. The tests take a long time to run, so for the common.run runfile we randomly select a hundred tests. To run all the bclone tests, bclone.run runfile should be used. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Pawel Jakub Dawidek <pawel@dawidek.net> Closes #15631
-
- 21 Dec, 2023 4 commits
-
-
Brian Behlendorf authored
On some systems we already have blkdev_get_by_path() with 4 args but still the old FMODE_EXCL and not BLK_OPEN_EXCL defined. The vdev_bdev_mode() function was added to handle this case but there was no generic way to specify exclusive access. Reviewed-by:
Brian Atkinson <batkinson@lanl.gov> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15692
-
chrisperedun authored
While 763ca47f closes the situation of block cloning creating unencrypted records in encrypted datasets, existing data still causes panic on read. Setting zfs_recover bypasses this but at the cost of potentially ignoring more serious issues. Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Chris Peredun <chris.peredun@ixsystems.com> Closes #15677
-
Alexander Motin authored
Track history in context of bursts, not individual log blocks. It allows to not blow away all the history by single large burst of many block, and same time allows optimizations covering multiple blocks in a burst and even predicted following burst. For each burst account its optimal block size and minimal first block size. Use that statistics from the last 8 bursts to predict first block size of the next burst. Remove predefined set of block sizes. Allocate any size we see fit, multiple of 4KB, as required by ZIL now. With compression enabled by default, ZFS already writes pretty random block sizes, so this should not surprise space allocator any more. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15635
-
Tony Hutter authored
Add `zpool` flags to control the slot power to drives. This assumes your SAS or NVMe enclosure supports slot power control via sysfs. The new `--power` flag is added to `zpool offline|online|clear`: zpool offline --power <pool> <device> Turn off device slot power zpool online --power <pool> <device> Turn on device slot power zpool clear --power <pool> [device] Turn on device slot power If the ZPOOL_AUTO_POWER_ON_SLOT env var is set, then the '--power' option is automatically implied for `zpool online` and `zpool clear` and does not need to be passed. zpool status also gets a --power option to print the slot power status. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Mart Frauenlob <AllKind@fastest.cc> Signed-off-by:
Tony Hutter <hutter2@llnl.gov> Closes #15662
-
- 20 Dec, 2023 5 commits
-
-
Rob N authored
We are finding that as customers get larger and faster machines (hundreds of cores, large NVMe-backed pools) they keep hitting relatively low performance ceilings. Our profiling work almost always finds that they're running into bottlenecks on the SPA IO taskqs. Unfortunately there's often little we can advise at that point, because there's very few ways to change behaviour without patching. This commit adds two load-time parameters `zio_taskq_read` and `zio_taskq_write` that can configure the READ and WRITE IO taskqs directly. This achieves two goals: it gives operators (and those that support them) a way to tune things without requiring a custom build of OpenZFS, which is often not possible, and it lets us easily try different config variations in a variety of environments to inform the development of better defaults for these kind of systems. Because tuning the IO taskqs really requires a fairly deep understanding of how IO in ZFS works, and generally isn't needed without a pretty serious workload and an ability to identify bottlenecks, only minimal documentation is provided. Its expected that anyone using this is going to have the source code there as well. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <rob.norris@klarasystems.com> Closes #15675
-
Rob Norris authored
6.7 changes the shrinker API such that shrinkers must be allocated dynamically by the kernel. To accomodate this, this commit reworks spl_register_shrinker() to do something similar against earlier kernels. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
In 6.7 the superblock shrinker member s_shrink has changed from being an embedded struct to a pointer. Detect this, and don't take a reference if it already is one. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
6.6 made i_ctime inaccessible; 6.7 has done the same for i_atime and i_mtime. This extends the method used for ctime in b37f2934 to atime and mtime as well. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
Rob Norris authored
6.7 changed the names of the time members in struct inode, so we can't assign back to it because we don't know its name. In practice this doesn't matter though - if we're missing current_time(), then we must be on <4.9, and we know our fallback will need to return timespec. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Rob Norris <robn@despairlabs.com> Sponsored-by: https://github.com/sponsors/robn Closes #15681
-
- 15 Dec, 2023 2 commits
-
-
Umer Saleem authored
PR#15634 removes 128K into 2x68K LWB split optimization, since it was found to cause LWB buffer overflow while trying to write 128KB TX_CLONE_RANGE record with 1022 block pointers into 68KB buffer, with multiple VDEVs ZIL. This commit adds a test for this particular scenario by writing maximum sizes TX_CLONE_RANE record with 1022 block pointers into 68KB buffer, with two SLOG devices. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Reviewed-by:
Ameer Hamza <ahamza@ixsystems.com> Signed-off-by:
Umer Saleem <usaleem@ixsystems.com> Closes #15672
-
Alexander Motin authored
When ZFS overwrites a whole block, it does not bother to read the old content from disk. It is a good optimization, but if the buffer fill fails due to page fault or something else, the buffer ends up corrupted, neither keeping old content, nor getting the new one. On FreeBSD this is additionally complicated by page faults being blocked by VFS layer, always returning EFAULT on attempt to write from mmap()'ed but not yet cached address range. Normally it is not a big problem, since after original failure VFS will retry the write after reading the required data. The problem becomes worse in specific case when somebody tries to write into a file its own mmap()'ed content from the same location. In that situation the only copy of the data is getting corrupted on the page fault and the following retries only fixate the status quo. Block cloning makes this issue easier to reproduce, since it does not read the old data, unlike traditional file copy, that may work by chance. This patch provides the fill status to dmu_buf_fill_done(), that in case of error can destroy the corrupted buffer as if no write happened. One more complication in case of block cloning is that if error is possible during fill, dmu_buf_will_fill() must read the data via fall-back to dmu_buf_will_dirty(). It is required to allow in case of error restoring the buffer to a state after the cloning, not not before it, that would happen if we just call dbuf_undirty(). Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15665
-
- 12 Dec, 2023 3 commits
-
-
Alexander Motin authored
Block cloning normally creates dirty record without dr_data. But if the block is read after cloning, it is moved into DB_CACHED state and receives the data buffer. If after that we call dbuf_unoverride() to convert the dirty record into normal write, we should give it the data buffer from dbuf and release one. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15654 Closes #15656
-
Alexander Motin authored
In some cases dbuf_assign_arcbuf() may be called on a block that was recently cloned. If it happened in current TXG we must undo the block cloning first, since the only one dirty record per TXG can't and shouldn't mean both cloning and overwrite same time. Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15653
-
Brian Behlendorf authored
Align the raidz_expand_005_pos test with the raidz_expand_004_pos test and only verify no errors were reported. Allow scrub repair IO. Reviewed-by:
Don Brady <don.brady@klarasystems.com> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15663
-
- 11 Dec, 2023 3 commits
-
-
Chunwei Chen authored
While evicting dbufs of a dnode, a marker node is added to the AVL. The marker node should be inserted in AVL tree ahead of the dbuf its trying to delete. The blkid and level is used to ensure this. However, this could go wrong there's another dbufs with the same blkid and level in DB_EVICTING state but not yet removed from AVL tree. dbuf_compare() could fail to give the right location or could cause confusion and trigger ASSERTs. To ensure that the marker is inserted before the deleting dbuf, use the pointer value of the original dbuf for comparision. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by:
Sanjeev Bagewadi <sanjeev.bagewadi@nutanix.com> Signed-off-by:
Chunwei Chen <david.chen@nutanix.com> Closes #12482 Closes #15643
-
Tony Hutter authored
Add a test for the dirty dnode SEEK_HOLE/SEEK_DATA bug described in https://github.com/openzfs/zfs/issues/15526 The bug was fixed in https://github.com/openzfs/zfs/pull/15571 and was backported to 2.2.2 and 2.1.14. This test case is just to make sure it does not come back. seekflood.c originally written by Rob Norris. Reviewed-by:
Graham Perrin <grahamperrin@freebsd.org> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Rob Norris <robn@despairlabs.com> Signed-off-by:
Tony Hutter <hutter2@llnl.gov> Closes #15608
-
Brian Behlendorf authored
The zpool_import_status.ksh test case was not being run because it was not included in the Makefile.am. Reviewed-by:
George Melikov <mail@gmelikov.ru> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15655
-
- 09 Dec, 2023 3 commits
-
-
Brian Behlendorf authored
The io_uring test fails on CentOS 9 with the following fio error. Disable the test for the benefit of the CI until this can be fully investigated. This basic test passes as expected on newer kernels. Reviewed-by:
Tony Hutter <hutter2@llnl.gov> Signed-off-by:
Brian Behlendorf <behlendorf1@llnl.gov> Closes #15636
-
Alexander Motin authored
dmu_assign_arcbuf_by_dnode() should drop dn_struct_rwlock lock in case dbuf_hold() failed. I don't have reproduction for this, but it looks inconsistent with dmu_buf_hold_noread_by_dnode() and co. Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by:
Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15644
-
Ameer Hamza authored
Reviewed-by:
Kay Pedersen <mail@mkwg.de> Reviewed-by:
Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by:
Alexander Motin <mav@FreeBSD.org> Signed-off-by:
Ameer Hamza <ahamza@ixsystems.com> Closes #15614
-