pax_global_header00006660000000000000000000000064147646726270014537gustar00rootroot0000000000000052 comment=0ec7597b60e5e6c2c33dfea057ac38bf157acffb vdo-8.3.1.1/000077500000000000000000000000001476467262700124775ustar00rootroot00000000000000vdo-8.3.1.1/CONTRIBUTORS.txt000066400000000000000000000053051476467262700152000ustar00rootroot00000000000000The Red Hat VDO Team: Principal Engineer/Lead Architect: J. corwin Coburn Primary Authors: Joseph Chapman Sweet Tea Dorminy *Thomas Jaskiewicz Bruce Johnston Susan McGhee Ken Raeburn Michael Sclafani Matthew Sakai Joseph Shimkus John Wiele Support, Testing, Documentation, and other things too numerous to mention: Chung Chung : Bryan Gurney *Simon J. Hernandez Jakub Krysl Marek Suchanek Project Management & Technical Direction: Jered Floyd Louis Imershein Dennis Keefe Andrew Walsh *former team members Other Contributors: Ji-Hyeon Gim : Updates for FC26/Kernel 4.13 Vojtech Trefny Getting correct size of partitions Achilles Gaikwad Bash completion for the vdo and vdostats commands Jin-young Kwon Adding vdo --version command, and documentation fixes Francisco Vilmar Cardoso Ruviaro Typo corrections in vdo and uds Yukari Chiba User tools support for riscv64 Yang Huang User tools support for loongarch64 VDO was originally created at Permabit Technology Corporation, and was subsequently acquired and open-sourced by Red Hat. Former Members of the Permabit VDO Team: Engineers: Mark Amidon David Buckle Jacky Chu Joel Hoff Dimitri Kountourogianni Alexis Layton Michael Lee Rich Macchi Dave Paniriti Karl Ramm Hooman Vassef Assar Westurlund Support, Testing, Documentation, etc. Carl Alexander Mike Chu Mark Iskra Farid Jahanmir Francesca Koulikov Erik Lattimore Jennifer Levine Randy Long Steve Looby Uche Onyekwuluje Catherine Powell Jeff Pozz Sarmad Sada John Schmidt Omri Schwarz Jay Splaine John Welle Mary-Anne Wolf Devon Yablonski Robert Zupko Interns: Ari Entlich Lori Monteleone Project Management & Technical Direction: Michael Fortson Other Past Permabit Contributors (for early work on the index): James Clough Dave Golombek Albert Lin Edwin Olson Dave Pinkney Rich Brennan And Very Special Thanks To: Norman Margolis, who started the whole thing vdo-8.3.1.1/COPYING000066400000000000000000000355021476467262700135370ustar00rootroot00000000000000 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Lesser General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. vdo-8.3.1.1/Makefile000066400000000000000000000024141476467262700141400ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # INSTALL = install INSTALLOWNER ?= -o root -g root name ?= vdo defaultdocdir ?= /usr/share/doc defaultlicensedir ?= /usr/share/licenses DOCDIR=$(DESTDIR)/$(defaultdocdir)/$(name) LICENSEDIR=$(DESTDIR)/$(defaultlicensedir)/$(name) SUBDIRS = examples utils .PHONY: all clean install all clean: for d in $(SUBDIRS); do \ $(MAKE) -C $$d $@ || exit 1; \ done install: $(INSTALL) $(INSTALLOWNER) -d $(DOCDIR) $(INSTALL) $(INSTALLOWNER) -D -m 644 COPYING -t $(LICENSEDIR) for d in $(SUBDIRS); do \ $(MAKE) -C $$d $@ || exit 1; \ done vdo-8.3.1.1/README.md000066400000000000000000000116101476467262700137550ustar00rootroot00000000000000# vdo A set of userspace tools for managing pools of deduplicated and/or compressed block storage. ## Background VDO is a device-mapper target that provides inline block-level deduplication, compression, and thin provisioning capabilities for primary storage. VDO is managed through LVM and can be integrated into any existing storage stack. Deduplication is a technique for reducing the consumption of storage resources by eliminating multiple copies of duplicate blocks. Compression takes the individual unique blocks and shrinks them with coding algorithms; these reduced blocks are then efficiently packed together into physical blocks. Thin provisioning manages the mapping from logical block addresses presented by VDO to where the data has actually been stored, and also eliminates any blocks of all zeroes. With deduplication, instead of writing the same data more than once each duplicate block is detected and recorded as a reference to the original block. VDO maintains a mapping from logical block addresses (presented to the storage layer above VDO) to physical block addresses on the storage layer under VDO. After deduplication, multiple logical block addresses may be mapped to the same physical block address; these are called shared blocks and are reference-counted by the software. With VDO's compression, blocks are compressed with the fast LZ4 algorithm, and collected together where possible so that multiple compressed blocks fit within a single 4 KB block on the underlying storage. Each logical block address is mapped to a physical block address and an index within it for the desired compressed data. All compressed blocks are individually reference-counted for correctness. Block sharing and block compression are invisible to applications using the storage, which read and write blocks as they would if VDO were not present. When a shared block is overwritten, a new physical block is allocated for storing the new block data to ensure that other logical block addresses that are mapped to the shared physical block are not modified. This repository contains a set of userspace tools for managing VDO volumes. These include "vdoformat" for creating new volumes, "vdostats" for extracting statistics from those volumes, and a variety of support and debugging tools which should not be necessary during ordinary operation. ## History VDO was originally developed by Permabit Technology Corp. as a proprietary set of kernel modules and userspace tools. This software and technology has been acquired by Red Hat and relicensed under the GPL (v2 or later). The kernel module has been merged into the upstream Linux kernel as dm-vdo. ## Documentation - [RHEL9 VDO Documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/deduplicating_and_compressing_logical_volumes_on_rhel/index) - [RHEL8 VDO Documentation](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/deduplicating_and_compressing_storage/index) - [RHEL7 VDO Integration Guide](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/vdo-integration) - [RHEL7 VDO Evaluation Guide](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/vdo-evaluation) ## Releases The master branch of this repository is intended to be compatible with the most recent version of the Linux kernel. These packages are available in active Fedora releases with matching kernel versions. Version | Oldest Supported Linux Kernel Version ------- | -------------------------------------- 8.3.x.x | 6.9.0 Each older branch of this repository is intended to work with a specific release of Enterprise Linux (Red Hat Enterprise Linux, CentOS, etc.). Version | Intended Enterprise Linux Release ------- | --------------------------------- 6.1.x.x | EL7 (3.10.0-*.el7) 6.2.x.x | EL8 (4.18.0-*.el8) 8.2.x.x | EL9 (5.14.0-*.el9) * Pre-built versions with the required modifications for older Fedora releases can be found [here](https://copr.fedorainfracloud.org/coprs/rhawalsh/dm-vdo) and can be used by running `dnf copr enable rhawalsh/dm-vdo`. ## Building In order to build the user-level programs, invoke the following command from the top directory of this tree: make After building the user-level programs, they may be installed in the standard locations by invoking the following command from the top directory of this tree, as the root user: make install ## Communication Channels and Contributions Community feedback, participation and patches are welcome to the [vdo-devel](https://github.com/dm-vdo/vdo-devel) repository, which is the parent of this one. This repository does not accept pull requests. ## Licensing [GPL v2.0 or later](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html). All contributions retain ownership by their original author, but must also be licensed under the GPL 2.0 or later to be merged. vdo-8.3.1.1/examples/000077500000000000000000000000001476467262700143155ustar00rootroot00000000000000vdo-8.3.1.1/examples/Makefile000066400000000000000000000015511476467262700157570ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # SUBDIRS = monitor .PHONY: all clean install all clean install: for d in $(SUBDIRS); do \ $(MAKE) -C $$d $@ || exit 1; \ done vdo-8.3.1.1/examples/monitor/000077500000000000000000000000001476467262700160045ustar00rootroot00000000000000vdo-8.3.1.1/examples/monitor/Makefile000066400000000000000000000023131476467262700174430ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # INSTALLFILES=monitor_check_vdostats_logicalSpace.pl \ monitor_check_vdostats_physicalSpace.pl \ monitor_check_vdostats_savingPercent.pl INSTALL = install INSTALLOWNER ?= -o root -g root defaultdocdir ?= /usr/share/doc name ?= vdo INSTALLDIR=$(DESTDIR)/$(defaultdocdir)/$(name)/examples/monitor .PHONY: all clean install all:; clean:; install: $(INSTALL) $(INSTALLOWNER) -d $(INSTALLDIR) for i in $(INSTALLFILES); do \ $(INSTALL) $(INSTALLOWNER) -m 755 $$i $(INSTALLDIR); \ done vdo-8.3.1.1/examples/monitor/monitor_check_vdostats_logicalSpace.pl000077500000000000000000000107161476467262700255720ustar00rootroot00000000000000#!/usr/bin/perl ## # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # # monitor_check_vdostats_logicalSpace.pl [--warning |-w ] # [--critical |-c ] # # # This script parses the output of "vdostats --verbose" for a given VDO # volume, processes the "used percent" value, and returns a status code, # and a single-line output with status information. # # Options: # # -c : critical threshold equal to or greater than # percent. # # -w : warning threshold equal to or less than # percent. # # The "vdostats" program must be in the path used by "sudo". # ## use strict; use warnings FATAL => qw(all); use Getopt::Long; # Constants for the service status return values. use constant { MONITOR_SERVICE_OK => 0, MONITOR_SERVICE_WARNING => 1, MONITOR_SERVICE_CRITICAL => 2, MONITOR_SERVICE_UNKNOWN => 3, }; my $inputWarnThreshold = -1; my $inputCritThreshold = -1; GetOptions("critical=i" => \$inputCritThreshold, "warning=i" => \$inputWarnThreshold); # Default warning and critical thresholds for "logical used percent". my $warnThreshold = 80; my $critThreshold = 95; if ($inputWarnThreshold >= 0 && $inputWarnThreshold <= 100) { $warnThreshold = $inputWarnThreshold; } if ($inputCritThreshold >= 0 && $inputCritThreshold <= 100) { $critThreshold = $inputCritThreshold; } # A hash to hold the statistics names and values gathered from input. my %stats = (); # Vital statistics for general VDO health. This array contains only the # names of the desired statistics to store in the %stats hash. my @statNames = ( 'operating mode', 'data blocks used', 'overhead blocks used', 'logical blocks used', 'physical blocks', 'logical blocks', 'used percent', 'saving percent', '1k-blocks available', ); ############################################################################# # Get the statistics output for the given VDO device name, and filter the # desired stats values. ## sub getStats { if (!$ARGV[0]) { return; } my $deviceName = $ARGV[0]; my @verboseStatsOutput = `sudo vdostats $deviceName --verbose`; foreach my $statLabel (@statNames) { foreach my $inpline (@verboseStatsOutput) { if ($inpline =~ $statLabel) { $inpline =~ /.*: (.*)$/; my $statValue = $1; $stats{$statLabel} = $statValue; } } } } ############################################################################# # main ## if (scalar(@ARGV) != 1) { print("Usage: monitor_check_vdostats_logicalSpace.pl\n"); print(" [--warning |-w VALUE]\n"); print(" [--critical|-c VALUE]\n"); print(" \n"); exit(MONITOR_SERVICE_UNKNOWN); } getStats(); # If the stats table is empty, nothing was found; return unknown status. # Otherwise, print the stats. if (!%stats) { printf("Unable to load vdostats verbose output.\n"); exit(MONITOR_SERVICE_UNKNOWN); } # Calculate logical percent used, and print the value. my $logicalUsedPercent = (100 * $stats{"logical blocks used"} / $stats{"logical blocks"}); printf("logical used: %.2f%%\n", $logicalUsedPercent); # Process the critical and warning thresholds. # If critThreshold is less than warnThreshold, the only used percentage # return codes will be "OK" or "CRITICAL". if ($logicalUsedPercent >= $warnThreshold && $logicalUsedPercent < $critThreshold) { exit(MONITOR_SERVICE_WARNING); } if ($logicalUsedPercent >= $critThreshold) { exit(MONITOR_SERVICE_CRITICAL); } if ($logicalUsedPercent >= 0 && $logicalUsedPercent < $warnThreshold) { exit(MONITOR_SERVICE_OK); } # Default exit condition. exit(MONITOR_SERVICE_UNKNOWN); vdo-8.3.1.1/examples/monitor/monitor_check_vdostats_physicalSpace.pl000077500000000000000000000116171476467262700257750ustar00rootroot00000000000000#!/usr/bin/perl ## # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # # monitor_check_vdostats_physicalSpace.pl [--warning |-w ] # [--critical |-c ] # # # This script parses the output of "vdostats --verbose" for a given VDO # volume, processes the "used percent" value, and returns a status code, # and a single-line output with status information. # # Options: # # -c : critical threshold equal to or greater than # percent. # # -w : warning threshold equal to or less than # percent. # # The "vdostats" program must be in the path used by "sudo". # ## use strict; use warnings FATAL => qw(all); use Getopt::Long; # Constants for the service status return values. use constant { MONITOR_SERVICE_OK => 0, MONITOR_SERVICE_WARNING => 1, MONITOR_SERVICE_CRITICAL => 2, MONITOR_SERVICE_UNKNOWN => 3, }; my $inputWarnThreshold = -1; my $inputCritThreshold = -1; GetOptions("critical=i" => \$inputCritThreshold, "warning=i" => \$inputWarnThreshold); # Default warning and critical thresholds for "used percent". my $warnThreshold = 75; my $critThreshold = 90; if ($inputWarnThreshold >= 0 && $inputWarnThreshold <= 100) { $warnThreshold = $inputWarnThreshold; } if ($inputCritThreshold >= 0 && $inputCritThreshold <= 100) { $critThreshold = $inputCritThreshold; } # A hash to hold the statistics names and values gathered from input. my %stats = (); # Vital statistics for general VDO health. This array contains only the # names of the desired statistics to store in the %stats hash. my @statNames = ( 'operating mode', 'data blocks used', 'overhead blocks used', 'physical blocks', 'logical blocks', 'used percent', 'saving percent', '1k-blocks available', ); ############################################################################# # Get the statistics output for the given VDO device name, and filter the # desired stats values. ## sub getStats { if (!$ARGV[0]) { return; } my $deviceName = $ARGV[0]; my @verboseStatsOutput = `sudo vdostats $deviceName --verbose`; foreach my $statLabel (@statNames) { foreach my $inpline (@verboseStatsOutput) { if ($inpline =~ $statLabel) { $inpline =~ /.*: (.*)$/; my $statValue = $1; $stats{$statLabel} = $statValue; } } } } ############################################################################# # Print the vital statistics to stdout. ## sub printVitalStats { printf("operating mode: %s," . " physical used: %s%%," . " savings: %s%%.\n", $stats{"operating mode"}, $stats{"used percent"}, $stats{"saving percent"}); } ############################################################################# # main ## if (scalar(@ARGV) != 1) { print("Usage: monitor_check_vdostats_physicalSpace.pl\n"); print(" [--warning |-w VALUE]\n"); print(" [--critical|-c VALUE]\n"); print(" \n"); exit(MONITOR_SERVICE_UNKNOWN); } getStats(); # If the stats table is empty, nothing was found; return unknown status. # Otherwise, print the stats. if (!%stats) { printf("Unable to load vdostats verbose output.\n"); exit(MONITOR_SERVICE_UNKNOWN); } else { printVitalStats(\%stats); } # If the VDO is in read-only mode or recovering, exit now with a critical # status. if ($stats{"operating mode"} =~ "read-only") { exit(MONITOR_SERVICE_CRITICAL); } if ($stats{"operating mode"} =~ "recovering") { exit(MONITOR_SERVICE_WARNING); } if ($stats{"used percent"} =~ "N/A") { exit(MONITOR_SERVICE_UNKNOWN) } # Process the critical and warning thresholds. # If critThreshold is less than warnThreshold, the only used percentage # return codes will be "OK" or "CRITICAL". if ($stats{"used percent"} >= $warnThreshold && $stats{"used percent"} < $critThreshold) { exit(MONITOR_SERVICE_WARNING); } if ($stats{"used percent"} >= $critThreshold) { exit(MONITOR_SERVICE_CRITICAL); } if ($stats{"used percent"} >= 0 && $stats{"used percent"} < $warnThreshold) { exit(MONITOR_SERVICE_OK); } # Default exit condition. exit(MONITOR_SERVICE_UNKNOWN); vdo-8.3.1.1/examples/monitor/monitor_check_vdostats_savingPercent.pl000077500000000000000000000107241476467262700260130ustar00rootroot00000000000000#!/usr/bin/perl ## # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # # monitor_check_vdostats_savingPercent.pl [--warning |-w ] # [--critical |-c ] # # # This script parses the output of "vdostats --verbose" for a given VDO # volume, processes the "used percent" value, and returns a status code, # and a single-line output with status information. # # Options: # # -c : critical threshold equal to or greater than # percent. # # -w : warning threshold equal to or less than # percent. # # The "vdostats" program must be in the path used by "sudo". # ## use strict; use warnings FATAL => qw(all); use Getopt::Long; # Constants for the service status return values. use constant { MONITOR_SERVICE_OK => 0, MONITOR_SERVICE_WARNING => 1, MONITOR_SERVICE_CRITICAL => 2, MONITOR_SERVICE_UNKNOWN => 3, }; my $inputWarnThreshold = -1; my $inputCritThreshold = -1; GetOptions("critical=i" => \$inputCritThreshold, "warning=i" => \$inputWarnThreshold); # Default warning and critical thresholds for "logical used percent". my $warnThreshold = 50; my $critThreshold = 5; if ($inputWarnThreshold >= 0 && $inputWarnThreshold <= 100) { $warnThreshold = $inputWarnThreshold; } if ($inputCritThreshold >= 0 && $inputCritThreshold <= 100) { $critThreshold = $inputCritThreshold; } # A hash to hold the statistics names and values gathered from input. my %stats = (); # Vital statistics for general VDO health. This array contains only the # names of the desired statistics to store in the %stats hash. my @statNames = ( 'operating mode', 'data blocks used', 'overhead blocks used', 'logical blocks used', 'physical blocks', 'logical blocks', 'used percent', 'saving percent', '1k-blocks available', ); ############################################################################# # Get the statistics output for the given VDO device name, and filter the # desired stats values. ## sub getStats { if (!$ARGV[0]) { return; } my $deviceName = $ARGV[0]; my @verboseStatsOutput = `sudo vdostats $deviceName --verbose`; foreach my $statLabel (@statNames) { foreach my $inpline (@verboseStatsOutput) { if ($inpline =~ $statLabel) { $inpline =~ /.*: (.*)$/; my $statValue = $1; $stats{$statLabel} = $statValue; } } } } ############################################################################# # main ## if (scalar(@ARGV) != 1) { print("Usage: monitor_check_vdostats_savingPercent.pl\n"); print(" [--warning |-w VALUE]\n"); print(" [--critical|-c VALUE]\n"); print(" \n"); exit(MONITOR_SERVICE_UNKNOWN); } getStats(); # If the stats table is empty, nothing was found; return unknown status. # Otherwise, print the stats. if (!%stats) { printf("Unable to load vdostats verbose output.\n"); exit(MONITOR_SERVICE_UNKNOWN); } printf("saving percent: %s%%\n", $stats{"saving percent"}); # Process the critical and warning thresholds. # If critThreshold is less than warnThreshold, the only used percentage # return codes will be "OK" or "CRITICAL". # An empty VDO volume has a saving percent of "N/A" when 0 logical blocks # are used. This is interpreted as an "OK" status. if ($stats{"saving percent"} =~ "N/A") { exit(MONITOR_SERVICE_OK); } if ($stats{"saving percent"} <= $warnThreshold && $stats{"saving percent"} > $critThreshold) { exit(MONITOR_SERVICE_WARNING); } if ($stats{"saving percent"} <= $critThreshold) { exit(MONITOR_SERVICE_CRITICAL); } if ($stats{"saving percent"} > $warnThreshold) { exit(MONITOR_SERVICE_OK); } # Default exit condition. exit(MONITOR_SERVICE_UNKNOWN); vdo-8.3.1.1/utils/000077500000000000000000000000001476467262700136375ustar00rootroot00000000000000vdo-8.3.1.1/utils/Makefile000066400000000000000000000015511476467262700153010ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # SUBDIRS = uds vdo .PHONY: all clean install all clean install: for d in $(SUBDIRS); do \ $(MAKE) -C $$d $@ || exit 1; \ done vdo-8.3.1.1/utils/uds/000077500000000000000000000000001476467262700144325ustar00rootroot00000000000000vdo-8.3.1.1/utils/uds/Makefile000066400000000000000000000064711476467262700161020ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # BUILD_VERSION = 8.3.1.1 DEPDIR = .deps ifdef LLVM export CC := clang export LD := ld.ldd endif ifeq ($(origin CC), default) CC := gcc endif ifeq ($(findstring clang, $(CC)),clang) # Ignore additional warnings for clang WARNS = -Wno-compare-distinct-pointer-types \ -Wno-gnu-statement-expression \ -Wno-gnu-zero-variadic-macro-arguments \ -Wno-implicit-const-int-float-conversion \ -Wno-language-extension-token else WARNS = -Wcast-align \ -Wcast-qual \ -Wformat=2 \ -Wlogical-op endif WARNS += -Wall \ -Werror \ -Wextra \ -Winit-self \ -Wmissing-include-dirs \ -Wpointer-arith \ -Wredundant-decls \ -Wunused \ -Wwrite-strings C_WARNS = -Wbad-function-cast \ -Wfloat-equal \ -Wmissing-declarations \ -Wmissing-format-attribute \ -Wmissing-prototypes \ -Wnested-externs \ -Wold-style-definition \ -Wswitch-default OPT_FLAGS = -O3 -fno-omit-frame-pointer DEBUG_FLAGS = RPM_OPT_FLAGS ?= -fpic GLOBAL_FLAGS = $(RPM_OPT_FLAGS) -D_GNU_SOURCE -g $(OPT_FLAGS) \ $(WARNS) $(shell getconf LFS_CFLAGS) $(DEBUG_FLAGS) \ -DCURRENT_VERSION='"$(BUILD_VERSION)"' \ CFLAGS = $(GLOBAL_FLAGS) -I. -std=gnu11 -pedantic $(C_WARNS) $(MY_CFLAGS) LDFLAGS = $(RPM_LD_FLAGS) $(MY_LDFLAGS) MY_FLAGS = MY_CFLAGS = $(MY_FLAGS) MY_LDFLAGS = vpath %.c . UDS_OBJECTS = chapter-index.o \ config.o \ delta-index.o \ dm-bufio.o \ errors.o \ event-count.o \ fileUtils.o \ funnel-queue.o \ geometry.o \ index.o \ index-layout.o \ index-page-map.o \ index-session.o \ io-factory.o \ logger.o \ memoryAlloc.o \ minisyslog.o \ murmurhash3.o \ open-chapter.o \ permassert.o \ radix-sort.o \ random.o \ requestQueue.o \ sparse-cache.o \ string-utils.o \ syscalls.o \ threadCondVar.o \ threadMutex.o \ threadSemaphore.o \ thread-utils.o \ time-utils.o \ volume.o \ volume-index.o .PHONY: all all: libuds.a .PHONY: clean clean: rm -rf *.o *.a $(DEPDIR) .PHONY: install install:; libuds.a: $(UDS_OBJECTS) rm -f $@ ar cr $@ $^ %.s: %.c $(CC) $(CFLAGS) -S $^ ######################################################################## # Dependency processing %.o: %.c @mkdir -p $(DEPDIR)/$(@D) $(@D) $(COMPILE.c) -MD -MF $(DEPDIR)/$*.d.new -MP -MT $@ $< -o $@ if cmp -s $(DEPDIR)/$*.d $(DEPDIR)/$*.d.new; then \ rm -f $(DEPDIR)/$*.d.new ; \ else \ mv -f $(DEPDIR)/$*.d.new $(DEPDIR)/$*.d ; \ fi $(DEPDIR)/%.d: %.c @mkdir -p $(@D) $(CC) $(CFLAGS) -MM -MF $@ -MP -MT $*.o $< ifneq ($(MAKECMDGOALS),clean) -include $(UDS_OBJECTS:%.o=$(DEPDIR)/%.d) endif vdo-8.3.1.1/utils/uds/chapter-index.c000066400000000000000000000221021476467262700173260ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "chapter-index.h" #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "hash-utils.h" #include "indexer.h" int uds_make_open_chapter_index(struct open_chapter_index **chapter_index, const struct index_geometry *geometry, u64 volume_nonce) { int result; size_t memory_size; struct open_chapter_index *index; result = vdo_allocate(1, struct open_chapter_index, "open chapter index", &index); if (result != VDO_SUCCESS) return result; /* * The delta index will rebalance delta lists when memory gets tight, * so give the chapter index one extra page. */ memory_size = ((geometry->index_pages_per_chapter + 1) * geometry->bytes_per_page); index->geometry = geometry; index->volume_nonce = volume_nonce; result = uds_initialize_delta_index(&index->delta_index, 1, geometry->delta_lists_per_chapter, geometry->chapter_mean_delta, geometry->chapter_payload_bits, memory_size, 'm'); if (result != UDS_SUCCESS) { vdo_free(index); return result; } index->memory_size = index->delta_index.memory_size + sizeof(struct open_chapter_index); *chapter_index = index; return UDS_SUCCESS; } void uds_free_open_chapter_index(struct open_chapter_index *chapter_index) { if (chapter_index == NULL) return; uds_uninitialize_delta_index(&chapter_index->delta_index); vdo_free(chapter_index); } /* Re-initialize an open chapter index for a new chapter. */ void uds_empty_open_chapter_index(struct open_chapter_index *chapter_index, u64 virtual_chapter_number) { uds_reset_delta_index(&chapter_index->delta_index); chapter_index->virtual_chapter_number = virtual_chapter_number; } static inline bool was_entry_found(const struct delta_index_entry *entry, u32 address) { return (!entry->at_end) && (entry->key == address); } /* Associate a record name with the record page containing its metadata. */ int uds_put_open_chapter_index_record(struct open_chapter_index *chapter_index, const struct uds_record_name *name, u32 page_number) { int result; struct delta_index_entry entry; u32 address; u32 list_number; const u8 *found_name; bool found; const struct index_geometry *geometry = chapter_index->geometry; u64 chapter_number = chapter_index->virtual_chapter_number; u32 record_pages = geometry->record_pages_per_chapter; result = VDO_ASSERT(page_number < record_pages, "Page number within chapter (%u) exceeds the maximum value %u", page_number, record_pages); if (result != VDO_SUCCESS) return UDS_INVALID_ARGUMENT; address = uds_hash_to_chapter_delta_address(name, geometry); list_number = uds_hash_to_chapter_delta_list(name, geometry); result = uds_get_delta_index_entry(&chapter_index->delta_index, list_number, address, name->name, &entry); if (result != UDS_SUCCESS) return result; found = was_entry_found(&entry, address); result = VDO_ASSERT(!(found && entry.is_collision), "Chunk appears more than once in chapter %llu", (unsigned long long) chapter_number); if (result != VDO_SUCCESS) return UDS_BAD_STATE; found_name = (found ? name->name : NULL); return uds_put_delta_index_entry(&entry, address, page_number, found_name); } /* * Pack a section of an open chapter index into a chapter index page. A range of delta lists * (starting with a specified list index) is copied from the open chapter index into a memory page. * The number of lists copied onto the page is returned to the caller on success. * * @chapter_index: The open chapter index * @memory: The memory page to use * @first_list: The first delta list number to be copied * @last_page: If true, this is the last page of the chapter index and all the remaining lists must * be packed onto this page * @lists_packed: The number of delta lists that were packed onto this page */ int uds_pack_open_chapter_index_page(struct open_chapter_index *chapter_index, u8 *memory, u32 first_list, bool last_page, u32 *lists_packed) { int result; struct delta_index *delta_index = &chapter_index->delta_index; struct delta_index_stats stats; u64 nonce = chapter_index->volume_nonce; u64 chapter_number = chapter_index->virtual_chapter_number; const struct index_geometry *geometry = chapter_index->geometry; u32 list_count = geometry->delta_lists_per_chapter; unsigned int removals = 0; struct delta_index_entry entry; u32 next_list; s32 list_number; for (;;) { result = uds_pack_delta_index_page(delta_index, nonce, memory, geometry->bytes_per_page, chapter_number, first_list, lists_packed); if (result != UDS_SUCCESS) return result; if ((first_list + *lists_packed) == list_count) { /* All lists are packed. */ break; } else if (*lists_packed == 0) { /* * The next delta list does not fit on a page. This delta list will be * removed. */ } else if (last_page) { /* * This is the last page and there are lists left unpacked, but all of the * remaining lists must fit on the page. Find a list that contains entries * and remove the entire list. Try the first list that does not fit. If it * is empty, we will select the last list that already fits and has any * entries. */ } else { /* This page is done. */ break; } if (removals == 0) { uds_get_delta_index_stats(delta_index, &stats); vdo_log_warning("The chapter index for chapter %llu contains %llu entries with %llu collisions", (unsigned long long) chapter_number, (unsigned long long) stats.record_count, (unsigned long long) stats.collision_count); } list_number = *lists_packed; do { if (list_number < 0) return UDS_OVERFLOW; next_list = first_list + list_number--; result = uds_start_delta_index_search(delta_index, next_list, 0, &entry); if (result != UDS_SUCCESS) return result; result = uds_next_delta_index_entry(&entry); if (result != UDS_SUCCESS) return result; } while (entry.at_end); do { result = uds_remove_delta_index_entry(&entry); if (result != UDS_SUCCESS) return result; removals++; } while (!entry.at_end); } if (removals > 0) { vdo_log_warning("To avoid chapter index page overflow in chapter %llu, %u entries were removed from the chapter index", (unsigned long long) chapter_number, removals); } return UDS_SUCCESS; } /* Make a new chapter index page, initializing it with the data from a given index_page buffer. */ int uds_initialize_chapter_index_page(struct delta_index_page *index_page, const struct index_geometry *geometry, u8 *page_buffer, u64 volume_nonce) { return uds_initialize_delta_index_page(index_page, volume_nonce, geometry->chapter_mean_delta, geometry->chapter_payload_bits, page_buffer, geometry->bytes_per_page); } /* Validate a chapter index page read during rebuild. */ int uds_validate_chapter_index_page(const struct delta_index_page *index_page, const struct index_geometry *geometry) { int result; const struct delta_index *delta_index = &index_page->delta_index; u32 first = index_page->lowest_list_number; u32 last = index_page->highest_list_number; u32 list_number; /* We walk every delta list from start to finish. */ for (list_number = first; list_number <= last; list_number++) { struct delta_index_entry entry; result = uds_start_delta_index_search(delta_index, list_number - first, 0, &entry); if (result != UDS_SUCCESS) return result; for (;;) { result = uds_next_delta_index_entry(&entry); if (result != UDS_SUCCESS) { /* * A random bit stream is highly likely to arrive here when we go * past the end of the delta list. */ return result; } if (entry.at_end) break; /* Also make sure that the record page field contains a plausible value. */ if (uds_get_delta_entry_value(&entry) >= geometry->record_pages_per_chapter) { /* * Do not log this as an error. It happens in normal operation when * we are doing a rebuild but haven't written the entire volume * once. */ return UDS_CORRUPT_DATA; } } } return UDS_SUCCESS; } /* * Search a chapter index page for a record name, returning the record page number that may contain * the name. */ int uds_search_chapter_index_page(struct delta_index_page *index_page, const struct index_geometry *geometry, const struct uds_record_name *name, u16 *record_page_ptr) { int result; struct delta_index *delta_index = &index_page->delta_index; u32 address = uds_hash_to_chapter_delta_address(name, geometry); u32 delta_list_number = uds_hash_to_chapter_delta_list(name, geometry); u32 sub_list_number = delta_list_number - index_page->lowest_list_number; struct delta_index_entry entry; result = uds_get_delta_index_entry(delta_index, sub_list_number, address, name->name, &entry); if (result != UDS_SUCCESS) return result; if (was_entry_found(&entry, address)) *record_page_ptr = uds_get_delta_entry_value(&entry); else *record_page_ptr = NO_CHAPTER_INDEX_ENTRY; return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/chapter-index.h000066400000000000000000000041511476467262700173370ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_CHAPTER_INDEX_H #define UDS_CHAPTER_INDEX_H #include #include "delta-index.h" #include "geometry.h" /* * A chapter index for an open chapter is a mutable structure that tracks all the records that have * been added to the chapter. A chapter index for a closed chapter is similar except that it is * immutable because the contents of a closed chapter can never change, and the immutable structure * is more efficient. Both types of chapter index are implemented with a delta index. */ /* The value returned when no entry is found in the chapter index. */ #define NO_CHAPTER_INDEX_ENTRY U16_MAX struct open_chapter_index { const struct index_geometry *geometry; struct delta_index delta_index; u64 virtual_chapter_number; u64 volume_nonce; size_t memory_size; }; int __must_check uds_make_open_chapter_index(struct open_chapter_index **chapter_index, const struct index_geometry *geometry, u64 volume_nonce); void uds_free_open_chapter_index(struct open_chapter_index *chapter_index); void uds_empty_open_chapter_index(struct open_chapter_index *chapter_index, u64 virtual_chapter_number); int __must_check uds_put_open_chapter_index_record(struct open_chapter_index *chapter_index, const struct uds_record_name *name, u32 page_number); int __must_check uds_pack_open_chapter_index_page(struct open_chapter_index *chapter_index, u8 *memory, u32 first_list, bool last_page, u32 *lists_packed); int __must_check uds_initialize_chapter_index_page(struct delta_index_page *index_page, const struct index_geometry *geometry, u8 *page_buffer, u64 volume_nonce); int __must_check uds_validate_chapter_index_page(const struct delta_index_page *index_page, const struct index_geometry *geometry); int __must_check uds_search_chapter_index_page(struct delta_index_page *index_page, const struct index_geometry *geometry, const struct uds_record_name *name, u16 *record_page_ptr); #endif /* UDS_CHAPTER_INDEX_H */ vdo-8.3.1.1/utils/uds/config.c000066400000000000000000000314731476467262700160530ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "config.h" #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "string-utils.h" #include "thread-utils.h" static const u8 INDEX_CONFIG_MAGIC[] = "ALBIC"; static const u8 INDEX_CONFIG_VERSION_6_02[] = "06.02"; static const u8 INDEX_CONFIG_VERSION_8_02[] = "08.02"; #define DEFAULT_VOLUME_READ_THREADS 2 #define MAX_VOLUME_READ_THREADS 16 #define INDEX_CONFIG_MAGIC_LENGTH (sizeof(INDEX_CONFIG_MAGIC) - 1) #define INDEX_CONFIG_VERSION_LENGTH ((int)(sizeof(INDEX_CONFIG_VERSION_6_02) - 1)) static bool is_version(const u8 *version, u8 *buffer) { return memcmp(version, buffer, INDEX_CONFIG_VERSION_LENGTH) == 0; } static bool are_matching_configurations(struct uds_configuration *saved_config, struct index_geometry *saved_geometry, struct uds_configuration *user) { struct index_geometry *geometry = user->geometry; bool result = true; if (saved_geometry->record_pages_per_chapter != geometry->record_pages_per_chapter) { vdo_log_error("Record pages per chapter (%u) does not match (%u)", saved_geometry->record_pages_per_chapter, geometry->record_pages_per_chapter); result = false; } if (saved_geometry->chapters_per_volume != geometry->chapters_per_volume) { vdo_log_error("Chapter count (%u) does not match (%u)", saved_geometry->chapters_per_volume, geometry->chapters_per_volume); result = false; } if (saved_geometry->sparse_chapters_per_volume != geometry->sparse_chapters_per_volume) { vdo_log_error("Sparse chapter count (%u) does not match (%u)", saved_geometry->sparse_chapters_per_volume, geometry->sparse_chapters_per_volume); result = false; } if (saved_config->cache_chapters != user->cache_chapters) { vdo_log_error("Cache size (%u) does not match (%u)", saved_config->cache_chapters, user->cache_chapters); result = false; } if (saved_config->volume_index_mean_delta != user->volume_index_mean_delta) { vdo_log_error("Volume index mean delta (%u) does not match (%u)", saved_config->volume_index_mean_delta, user->volume_index_mean_delta); result = false; } if (saved_geometry->bytes_per_page != geometry->bytes_per_page) { vdo_log_error("Bytes per page value (%zu) does not match (%zu)", saved_geometry->bytes_per_page, geometry->bytes_per_page); result = false; } if (saved_config->sparse_sample_rate != user->sparse_sample_rate) { vdo_log_error("Sparse sample rate (%u) does not match (%u)", saved_config->sparse_sample_rate, user->sparse_sample_rate); result = false; } if (saved_config->nonce != user->nonce) { vdo_log_error("Nonce (%llu) does not match (%llu)", (unsigned long long) saved_config->nonce, (unsigned long long) user->nonce); result = false; } return result; } /* Read the configuration and validate it against the provided one. */ int uds_validate_config_contents(struct buffered_reader *reader, struct uds_configuration *user_config) { int result; struct uds_configuration config; struct index_geometry geometry; u8 version_buffer[INDEX_CONFIG_VERSION_LENGTH]; u32 bytes_per_page; u8 buffer[sizeof(struct uds_configuration_6_02)]; size_t offset = 0; result = uds_verify_buffered_data(reader, INDEX_CONFIG_MAGIC, INDEX_CONFIG_MAGIC_LENGTH); if (result != UDS_SUCCESS) return result; result = uds_read_from_buffered_reader(reader, version_buffer, INDEX_CONFIG_VERSION_LENGTH); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot read index config version"); if (!is_version(INDEX_CONFIG_VERSION_6_02, version_buffer) && !is_version(INDEX_CONFIG_VERSION_8_02, version_buffer)) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "unsupported configuration version: '%.*s'", INDEX_CONFIG_VERSION_LENGTH, version_buffer); } result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot read config data"); decode_u32_le(buffer, &offset, &geometry.record_pages_per_chapter); decode_u32_le(buffer, &offset, &geometry.chapters_per_volume); decode_u32_le(buffer, &offset, &geometry.sparse_chapters_per_volume); decode_u32_le(buffer, &offset, &config.cache_chapters); offset += sizeof(u32); decode_u32_le(buffer, &offset, &config.volume_index_mean_delta); decode_u32_le(buffer, &offset, &bytes_per_page); geometry.bytes_per_page = bytes_per_page; decode_u32_le(buffer, &offset, &config.sparse_sample_rate); decode_u64_le(buffer, &offset, &config.nonce); result = VDO_ASSERT(offset == sizeof(struct uds_configuration_6_02), "%zu bytes read but not decoded", sizeof(struct uds_configuration_6_02) - offset); if (result != VDO_SUCCESS) return UDS_CORRUPT_DATA; if (is_version(INDEX_CONFIG_VERSION_6_02, version_buffer)) { user_config->geometry->remapped_virtual = 0; user_config->geometry->remapped_physical = 0; } else { u8 remapping[sizeof(u64) + sizeof(u64)]; result = uds_read_from_buffered_reader(reader, remapping, sizeof(remapping)); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot read converted config"); offset = 0; decode_u64_le(remapping, &offset, &user_config->geometry->remapped_virtual); decode_u64_le(remapping, &offset, &user_config->geometry->remapped_physical); } if (!are_matching_configurations(&config, &geometry, user_config)) { vdo_log_warning("Supplied configuration does not match save"); return UDS_NO_INDEX; } return UDS_SUCCESS; } /* * Write the configuration to stable storage. If the superblock version is < 4, write the 6.02 * version; otherwise write the 8.02 version, indicating the configuration is for an index that has * been reduced by one chapter. */ int uds_write_config_contents(struct buffered_writer *writer, struct uds_configuration *config, u32 version) { int result; struct index_geometry *geometry = config->geometry; u8 buffer[sizeof(struct uds_configuration_8_02)]; size_t offset = 0; result = uds_write_to_buffered_writer(writer, INDEX_CONFIG_MAGIC, INDEX_CONFIG_MAGIC_LENGTH); if (result != UDS_SUCCESS) return result; /* * If version is < 4, the index has not been reduced by a chapter so it must be written out * as version 6.02 so that it is still compatible with older versions of UDS. */ if (version >= 4) { result = uds_write_to_buffered_writer(writer, INDEX_CONFIG_VERSION_8_02, INDEX_CONFIG_VERSION_LENGTH); if (result != UDS_SUCCESS) return result; } else { result = uds_write_to_buffered_writer(writer, INDEX_CONFIG_VERSION_6_02, INDEX_CONFIG_VERSION_LENGTH); if (result != UDS_SUCCESS) return result; } encode_u32_le(buffer, &offset, geometry->record_pages_per_chapter); encode_u32_le(buffer, &offset, geometry->chapters_per_volume); encode_u32_le(buffer, &offset, geometry->sparse_chapters_per_volume); encode_u32_le(buffer, &offset, config->cache_chapters); encode_u32_le(buffer, &offset, 0); encode_u32_le(buffer, &offset, config->volume_index_mean_delta); encode_u32_le(buffer, &offset, geometry->bytes_per_page); encode_u32_le(buffer, &offset, config->sparse_sample_rate); encode_u64_le(buffer, &offset, config->nonce); result = VDO_ASSERT(offset == sizeof(struct uds_configuration_6_02), "%zu bytes encoded, of %zu expected", offset, sizeof(struct uds_configuration_6_02)); if (result != VDO_SUCCESS) return result; if (version >= 4) { encode_u64_le(buffer, &offset, geometry->remapped_virtual); encode_u64_le(buffer, &offset, geometry->remapped_physical); } return uds_write_to_buffered_writer(writer, buffer, offset); } /* Compute configuration parameters that depend on memory size. */ static int compute_memory_sizes(uds_memory_config_size_t mem_gb, bool sparse, u32 *chapters_per_volume, u32 *record_pages_per_chapter, u32 *sparse_chapters_per_volume) { u32 reduced_chapters = 0; u32 base_chapters; if (mem_gb == UDS_MEMORY_CONFIG_256MB) { base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = SMALL_RECORD_PAGES_PER_CHAPTER; } else if (mem_gb == UDS_MEMORY_CONFIG_512MB) { base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = 2 * SMALL_RECORD_PAGES_PER_CHAPTER; } else if (mem_gb == UDS_MEMORY_CONFIG_768MB) { base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = 3 * SMALL_RECORD_PAGES_PER_CHAPTER; } else if ((mem_gb >= 1) && (mem_gb <= UDS_MEMORY_CONFIG_MAX)) { base_chapters = mem_gb * DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = DEFAULT_RECORD_PAGES_PER_CHAPTER; } else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_256MB) { reduced_chapters = 1; base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = SMALL_RECORD_PAGES_PER_CHAPTER; } else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_512MB) { reduced_chapters = 1; base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = 2 * SMALL_RECORD_PAGES_PER_CHAPTER; } else if (mem_gb == UDS_MEMORY_CONFIG_REDUCED_768MB) { reduced_chapters = 1; base_chapters = DEFAULT_CHAPTERS_PER_VOLUME; *record_pages_per_chapter = 3 * SMALL_RECORD_PAGES_PER_CHAPTER; } else if ((mem_gb >= 1 + UDS_MEMORY_CONFIG_REDUCED) && (mem_gb <= UDS_MEMORY_CONFIG_REDUCED_MAX)) { reduced_chapters = 1; base_chapters = ((mem_gb - UDS_MEMORY_CONFIG_REDUCED) * DEFAULT_CHAPTERS_PER_VOLUME); *record_pages_per_chapter = DEFAULT_RECORD_PAGES_PER_CHAPTER; } else { vdo_log_error("received invalid memory size"); return -EINVAL; } if (sparse) { /* Make 95% of chapters sparse, allowing 10x more records. */ *sparse_chapters_per_volume = (19 * base_chapters) / 2; base_chapters *= 10; } else { *sparse_chapters_per_volume = 0; } *chapters_per_volume = base_chapters - reduced_chapters; return UDS_SUCCESS; } static unsigned int __must_check normalize_zone_count(unsigned int requested) { unsigned int zone_count = requested; if (zone_count == 0) zone_count = num_online_cpus() / 2; if (zone_count < 1) zone_count = 1; if (zone_count > MAX_ZONES) zone_count = MAX_ZONES; vdo_log_info("Using %u indexing zone%s for concurrency.", zone_count, zone_count == 1 ? "" : "s"); return zone_count; } static unsigned int __must_check normalize_read_threads(unsigned int requested) { unsigned int read_threads = requested; if (read_threads < 1) read_threads = DEFAULT_VOLUME_READ_THREADS; if (read_threads > MAX_VOLUME_READ_THREADS) read_threads = MAX_VOLUME_READ_THREADS; return read_threads; } int uds_make_configuration(const struct uds_parameters *params, struct uds_configuration **config_ptr) { struct uds_configuration *config; u32 chapters_per_volume = 0; u32 record_pages_per_chapter = 0; u32 sparse_chapters_per_volume = 0; int result; result = compute_memory_sizes(params->memory_size, params->sparse, &chapters_per_volume, &record_pages_per_chapter, &sparse_chapters_per_volume); if (result != UDS_SUCCESS) return result; result = vdo_allocate(1, struct uds_configuration, __func__, &config); if (result != VDO_SUCCESS) return result; result = uds_make_index_geometry(DEFAULT_BYTES_PER_PAGE, record_pages_per_chapter, chapters_per_volume, sparse_chapters_per_volume, 0, 0, &config->geometry); if (result != UDS_SUCCESS) { uds_free_configuration(config); return result; } config->zone_count = normalize_zone_count(params->zone_count); config->read_threads = normalize_read_threads(params->read_threads); config->cache_chapters = DEFAULT_CACHE_CHAPTERS; config->volume_index_mean_delta = DEFAULT_VOLUME_INDEX_MEAN_DELTA; config->sparse_sample_rate = (params->sparse ? DEFAULT_SPARSE_SAMPLE_RATE : 0); config->nonce = params->nonce; config->bdev = params->bdev; config->offset = params->offset; config->size = params->size; *config_ptr = config; return UDS_SUCCESS; } void uds_free_configuration(struct uds_configuration *config) { if (config != NULL) { uds_free_index_geometry(config->geometry); vdo_free(config); } } void uds_log_configuration(struct uds_configuration *config) { struct index_geometry *geometry = config->geometry; vdo_log_debug("Configuration:"); vdo_log_debug(" Record pages per chapter: %10u", geometry->record_pages_per_chapter); vdo_log_debug(" Chapters per volume: %10u", geometry->chapters_per_volume); vdo_log_debug(" Sparse chapters per volume: %10u", geometry->sparse_chapters_per_volume); vdo_log_debug(" Cache size (chapters): %10u", config->cache_chapters); vdo_log_debug(" Volume index mean delta: %10u", config->volume_index_mean_delta); vdo_log_debug(" Bytes per page: %10zu", geometry->bytes_per_page); vdo_log_debug(" Sparse sample rate: %10u", config->sparse_sample_rate); vdo_log_debug(" Nonce: %llu", (unsigned long long) config->nonce); } vdo-8.3.1.1/utils/uds/config.h000066400000000000000000000067541476467262700160640ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_CONFIG_H #define UDS_CONFIG_H #include "geometry.h" #include "indexer.h" #include "io-factory.h" /* * The uds_configuration records a variety of parameters used to configure a new UDS index. Some * parameters are provided by the client, while others are fixed or derived from user-supplied * values. It is created when an index is created, and it is recorded in the index metadata. */ enum { DEFAULT_VOLUME_INDEX_MEAN_DELTA = 4096, DEFAULT_CACHE_CHAPTERS = 7, DEFAULT_SPARSE_SAMPLE_RATE = 32, MAX_ZONES = 16, }; /* A set of configuration parameters for the indexer. */ struct uds_configuration { /* Storage device for the index */ struct block_device *bdev; /* The maximum allowable size of the index */ size_t size; /* The offset where the index should start */ off_t offset; /* Parameters for the volume */ /* The volume layout */ struct index_geometry *geometry; /* Index owner's nonce */ u64 nonce; /* The number of threads used to process index requests */ unsigned int zone_count; /* The number of threads used to read volume pages */ unsigned int read_threads; /* Size of the page cache and sparse chapter index cache in chapters */ u32 cache_chapters; /* Parameters for the volume index */ /* The mean delta for the volume index */ u32 volume_index_mean_delta; /* Sampling rate for sparse indexing */ u32 sparse_sample_rate; }; /* On-disk structure of data for a version 8.02 index. */ struct uds_configuration_8_02 { /* Smaller (16), Small (64) or large (256) indices */ u32 record_pages_per_chapter; /* Total number of chapters per volume */ u32 chapters_per_volume; /* Number of sparse chapters per volume */ u32 sparse_chapters_per_volume; /* Size of the page cache, in chapters */ u32 cache_chapters; /* Unused field */ u32 unused; /* The volume index mean delta to use */ u32 volume_index_mean_delta; /* Size of a page, used for both record pages and index pages */ u32 bytes_per_page; /* Sampling rate for sparse indexing */ u32 sparse_sample_rate; /* Index owner's nonce */ u64 nonce; /* Virtual chapter remapped from physical chapter 0 */ u64 remapped_virtual; /* New physical chapter which remapped chapter was moved to */ u64 remapped_physical; } __packed; /* On-disk structure of data for a version 6.02 index. */ struct uds_configuration_6_02 { /* Smaller (16), Small (64) or large (256) indices */ u32 record_pages_per_chapter; /* Total number of chapters per volume */ u32 chapters_per_volume; /* Number of sparse chapters per volume */ u32 sparse_chapters_per_volume; /* Size of the page cache, in chapters */ u32 cache_chapters; /* Unused field */ u32 unused; /* The volume index mean delta to use */ u32 volume_index_mean_delta; /* Size of a page, used for both record pages and index pages */ u32 bytes_per_page; /* Sampling rate for sparse indexing */ u32 sparse_sample_rate; /* Index owner's nonce */ u64 nonce; } __packed; int __must_check uds_make_configuration(const struct uds_parameters *params, struct uds_configuration **config_ptr); void uds_free_configuration(struct uds_configuration *config); int __must_check uds_validate_config_contents(struct buffered_reader *reader, struct uds_configuration *config); int __must_check uds_write_config_contents(struct buffered_writer *writer, struct uds_configuration *config, u32 version); void uds_log_configuration(struct uds_configuration *config); #endif /* UDS_CONFIG_H */ vdo-8.3.1.1/utils/uds/cpu.h000066400000000000000000000036141476467262700153760ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_CPU_H #define UDS_CPU_H #include /** * uds_prefetch_address() - Minimize cache-miss latency by attempting to move data into a CPU cache * before it is accessed. * * @address: the address to fetch (may be invalid) * @for_write: must be constant at compile time--false if for reading, true if for writing */ static inline void uds_prefetch_address(const void *address, bool for_write) { /* * for_write won't be a constant if we are compiled with optimization turned off, in which * case prefetching really doesn't matter. clang can't figure out that if for_write is a * constant, it can be passed as the second, mandatorily constant argument to prefetch(), * at least currently on llvm 12. */ if (__builtin_constant_p(for_write)) { if (for_write) __builtin_prefetch(address, true); else __builtin_prefetch(address, false); } } /** * uds_prefetch_range() - Minimize cache-miss latency by attempting to move a range of addresses * into a CPU cache before they are accessed. * * @start: the starting address to fetch (may be invalid) * @size: the number of bytes in the address range * @for_write: must be constant at compile time--false if for reading, true if for writing */ static inline void uds_prefetch_range(const void *start, unsigned int size, bool for_write) { /* * Count the number of cache lines to fetch, allowing for the address range to span an * extra cache line boundary due to address alignment. */ const char *address = (const char *) start; unsigned int offset = ((uintptr_t) address % L1_CACHE_BYTES); unsigned int cache_lines = (1 + ((size + offset) / L1_CACHE_BYTES)); while (cache_lines-- > 0) { uds_prefetch_address(address, for_write); address += L1_CACHE_BYTES; } } #endif /* UDS_CPU_H */ vdo-8.3.1.1/utils/uds/delta-index.c000066400000000000000000001751671476467262700170150ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "delta-index.h" #include #include #include #include #include #include "cpu.h" #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "permassert.h" #include "string-utils.h" #include "time-utils.h" #include "config.h" #include "indexer.h" /* * The entries in a delta index could be stored in a single delta list, but to reduce search times * and update costs it uses multiple delta lists. These lists are stored in a single chunk of * memory managed by the delta_zone structure. The delta_zone can move the data around within its * memory, so the location of each delta list is recorded as a bit offset into the memory. Because * the volume index can contain over a million delta lists, we want to be efficient with the size * of the delta list header information. This information is encoded into 16 bytes per list. The * volume index delta list memory can easily exceed 4 gigabits, so a 64 bit value is needed to * address the memory. The volume index delta lists average around 6 kilobits, so 16 bits are * sufficient to store the size of a delta list. * * Each delta list is stored as a bit stream. Within the delta list encoding, bits and bytes are * numbered in little endian order. Within a byte, bit 0 is the least significant bit (0x1), and * bit 7 is the most significant bit (0x80). Within a bit stream, bit 7 is the most significant bit * of byte 0, and bit 8 is the least significant bit of byte 1. Within a byte array, a byte's * number corresponds to its index in the array. * * A standard delta list entry is stored as a fixed length payload (the value) followed by a * variable length key (the delta). A collision entry is used when two block names have the same * delta list address. A collision entry always follows a standard entry for the hash with which it * collides, and is encoded with DELTA == 0 with an additional 256 bits field at the end, * containing the full block name. An entry with a delta of 0 at the beginning of a delta list * indicates a normal entry. * * The delta in each entry is encoded with a variable-length Huffman code to minimize the memory * used by small deltas. The Huffman code is specified by three parameters, which can be computed * from the desired mean delta when the index is full. (See compute_coding_constants() for * details.) * * The bit field utilities used to read and write delta entries assume that it is possible to read * some bytes beyond the end of the bit field, so a delta_zone memory allocation is guarded by two * invalid delta lists to prevent reading outside the delta_zone memory. The valid delta lists are * numbered 1 to N, and the guard lists are numbered 0 and N+1. The function to decode the bit * stream include a step that skips over bits set to 0 until the first 1 bit is found. A corrupted * delta list could cause this step to run off the end of the delta_zone memory, so as extra * protection against this happening, the tail guard list is set to all ones. * * The delta_index supports two different forms. The mutable form is created by * uds_initialize_delta_index(), and is used for the volume index and for open chapter indexes. The * immutable form is created by uds_initialize_delta_index_page(), and is used for closed (and * cached) chapter index pages. The immutable form does not allocate delta list headers or * temporary offsets, and thus is somewhat more memory efficient. */ /* * This is the largest field size supported by get_field() and set_field(). Any field that is * larger is not guaranteed to fit in a single byte-aligned u32. */ #define MAX_FIELD_BITS ((sizeof(u32) - 1) * BITS_PER_BYTE + 1) /* * This is the largest field size supported by get_big_field() and set_big_field(). Any field that * is larger is not guaranteed to fit in a single byte-aligned u64. */ #define MAX_BIG_FIELD_BITS ((sizeof(u64) - 1) * BITS_PER_BYTE + 1) /* * This is the number of guard bytes needed at the end of the memory byte array when using the bit * utilities. These utilities call get_big_field() and set_big_field(), which can access up to 7 * bytes beyond the end of the desired field. The definition is written to make it clear how this * value is derived. */ #define POST_FIELD_GUARD_BYTES (sizeof(u64) - 1) /* The number of guard bits that are needed in the tail guard list */ #define GUARD_BITS (POST_FIELD_GUARD_BYTES * BITS_PER_BYTE) /* * The maximum size of a single delta list in bytes. We count guard bytes in this value because a * buffer of this size can be used with move_bits(). */ #define DELTA_LIST_MAX_BYTE_COUNT \ ((U16_MAX + BITS_PER_BYTE) / BITS_PER_BYTE + POST_FIELD_GUARD_BYTES) /* The number of extra bytes and bits needed to store a collision entry */ #define COLLISION_BYTES UDS_RECORD_NAME_SIZE #define COLLISION_BITS (COLLISION_BYTES * BITS_PER_BYTE) /* * Immutable delta lists are packed into pages containing a header that encodes the delta list * information into 19 bits per list (64KB bit offset). */ #define IMMUTABLE_HEADER_SIZE 19 /* * Constants and structures for the saved delta index. "DI" is for delta_index, and -##### is a * number to increment when the format of the data changes. */ #define MAGIC_SIZE 8 static const char DELTA_INDEX_MAGIC[] = "DI-00002"; struct delta_index_header { char magic[MAGIC_SIZE]; u32 zone_number; u32 zone_count; u32 first_list; u32 list_count; u64 record_count; u64 collision_count; }; /* * Header data used for immutable delta index pages. This data is followed by the delta list offset * table. */ struct delta_page_header { /* Externally-defined nonce */ u64 nonce; /* The virtual chapter number */ u64 virtual_chapter_number; /* Index of the first delta list on the page */ u16 first_list; /* Number of delta lists on the page */ u16 list_count; } __packed; static inline u64 get_delta_list_byte_start(const struct delta_list *delta_list) { return delta_list->start / BITS_PER_BYTE; } static inline u16 get_delta_list_byte_size(const struct delta_list *delta_list) { unsigned int bit_offset = delta_list->start % BITS_PER_BYTE; return BITS_TO_BYTES(bit_offset + delta_list->size); } static void rebalance_delta_zone(const struct delta_zone *delta_zone, u32 first, u32 last) { struct delta_list *delta_list; u64 new_start; if (first == last) { /* Only one list is moving, and we know there is space. */ delta_list = &delta_zone->delta_lists[first]; new_start = delta_zone->new_offsets[first]; if (delta_list->start != new_start) { u64 source; u64 destination; source = get_delta_list_byte_start(delta_list); delta_list->start = new_start; destination = get_delta_list_byte_start(delta_list); memmove(delta_zone->memory + destination, delta_zone->memory + source, get_delta_list_byte_size(delta_list)); } } else { /* * There is more than one list. Divide the problem in half, and use recursive calls * to process each half. Note that after this computation, first <= middle, and * middle < last. */ u32 middle = (first + last) / 2; delta_list = &delta_zone->delta_lists[middle]; new_start = delta_zone->new_offsets[middle]; /* * The direction that our middle list is moving determines which half of the * problem must be processed first. */ if (new_start > delta_list->start) { rebalance_delta_zone(delta_zone, middle + 1, last); rebalance_delta_zone(delta_zone, first, middle); } else { rebalance_delta_zone(delta_zone, first, middle); rebalance_delta_zone(delta_zone, middle + 1, last); } } } static inline size_t get_zone_memory_size(unsigned int zone_count, size_t memory_size) { /* Round up so that each zone is a multiple of 64K in size. */ size_t ALLOC_BOUNDARY = 64 * 1024; return (memory_size / zone_count + ALLOC_BOUNDARY - 1) & -ALLOC_BOUNDARY; } void uds_reset_delta_index(const struct delta_index *delta_index) { unsigned int z; /* * Initialize all delta lists to be empty. We keep 2 extra delta list descriptors, one * before the first real entry and one after so that we don't need to bounds check the * array access when calculating preceding and following gap sizes. */ for (z = 0; z < delta_index->zone_count; z++) { u64 list_bits; u64 spacing; u64 offset; unsigned int i; struct delta_zone *zone = &delta_index->delta_zones[z]; struct delta_list *delta_lists = zone->delta_lists; /* Zeroing the delta list headers initializes the head guard list correctly. */ memset(delta_lists, 0, (zone->list_count + 2) * sizeof(struct delta_list)); /* Set all the bits in the end guard list. */ list_bits = (u64) zone->size * BITS_PER_BYTE - GUARD_BITS; delta_lists[zone->list_count + 1].start = list_bits; delta_lists[zone->list_count + 1].size = GUARD_BITS; memset(zone->memory + (list_bits / BITS_PER_BYTE), ~0, POST_FIELD_GUARD_BYTES); /* Evenly space out the real delta lists by setting regular offsets. */ spacing = list_bits / zone->list_count; offset = spacing / 2; for (i = 1; i <= zone->list_count; i++) { delta_lists[i].start = offset; offset += spacing; } /* Update the statistics. */ zone->discard_count += zone->record_count; zone->record_count = 0; zone->collision_count = 0; } } /* Compute the Huffman coding parameters for the given mean delta. The Huffman code is specified by * three parameters: * * MINBITS The number of bits in the smallest code * BASE The number of values coded using a code of length MINBITS * INCR The number of values coded by using one additional bit * * These parameters are related by this equation: * * BASE + INCR == 1 << MINBITS * * The math for the Huffman code of an exponential distribution says that * * INCR = log(2) * MEAN_DELTA * * Then use the smallest MINBITS value so that * * (1 << MINBITS) > INCR * * And then * * BASE = (1 << MINBITS) - INCR * * Now the index can generate a code such that * - The first BASE values code using MINBITS bits. * - The next INCR values code using MINBITS+1 bits. * - The next INCR values code using MINBITS+2 bits. * - (and so on). */ static void compute_coding_constants(u32 mean_delta, u16 *min_bits, u32 *min_keys, u32 *incr_keys) { /* * We want to compute the rounded value of log(2) * mean_delta. Since we cannot always use * floating point, use a really good integer approximation. */ *incr_keys = (836158UL * mean_delta + 603160UL) / 1206321UL; *min_bits = bits_per(*incr_keys + 1); *min_keys = (1 << *min_bits) - *incr_keys; } void uds_uninitialize_delta_index(struct delta_index *delta_index) { unsigned int z; if (delta_index->delta_zones == NULL) return; for (z = 0; z < delta_index->zone_count; z++) { vdo_free(vdo_forget(delta_index->delta_zones[z].new_offsets)); vdo_free(vdo_forget(delta_index->delta_zones[z].delta_lists)); vdo_free(vdo_forget(delta_index->delta_zones[z].memory)); } vdo_free(delta_index->delta_zones); memset(delta_index, 0, sizeof(struct delta_index)); } static int initialize_delta_zone(struct delta_zone *delta_zone, size_t size, u32 first_list, u32 list_count, u32 mean_delta, u32 payload_bits, u8 tag) { int result; result = vdo_allocate(size, u8, "delta list", &delta_zone->memory); if (result != VDO_SUCCESS) return result; result = vdo_allocate(list_count + 2, u64, "delta list temp", &delta_zone->new_offsets); if (result != VDO_SUCCESS) return result; /* Allocate the delta lists. */ result = vdo_allocate(list_count + 2, struct delta_list, "delta lists", &delta_zone->delta_lists); if (result != VDO_SUCCESS) return result; compute_coding_constants(mean_delta, &delta_zone->min_bits, &delta_zone->min_keys, &delta_zone->incr_keys); delta_zone->value_bits = payload_bits; delta_zone->buffered_writer = NULL; delta_zone->size = size; delta_zone->rebalance_time = 0; delta_zone->rebalance_count = 0; delta_zone->record_count = 0; delta_zone->collision_count = 0; delta_zone->discard_count = 0; delta_zone->overflow_count = 0; delta_zone->first_list = first_list; delta_zone->list_count = list_count; delta_zone->tag = tag; return UDS_SUCCESS; } int uds_initialize_delta_index(struct delta_index *delta_index, unsigned int zone_count, u32 list_count, u32 mean_delta, u32 payload_bits, size_t memory_size, u8 tag) { int result; unsigned int z; size_t zone_memory; result = vdo_allocate(zone_count, struct delta_zone, "Delta Index Zones", &delta_index->delta_zones); if (result != VDO_SUCCESS) return result; delta_index->zone_count = zone_count; delta_index->list_count = list_count; delta_index->lists_per_zone = DIV_ROUND_UP(list_count, zone_count); delta_index->memory_size = 0; delta_index->mutable = true; delta_index->tag = tag; for (z = 0; z < zone_count; z++) { u32 lists_in_zone = delta_index->lists_per_zone; u32 first_list_in_zone = z * lists_in_zone; if (z == zone_count - 1) { /* * The last zone gets fewer lists if zone_count doesn't evenly divide * list_count. We'll have an underflow if the assertion below doesn't hold. */ if (delta_index->list_count <= first_list_in_zone) { uds_uninitialize_delta_index(delta_index); return vdo_log_error_strerror(UDS_INVALID_ARGUMENT, "%u delta lists not enough for %u zones", list_count, zone_count); } lists_in_zone = delta_index->list_count - first_list_in_zone; } zone_memory = get_zone_memory_size(zone_count, memory_size); result = initialize_delta_zone(&delta_index->delta_zones[z], zone_memory, first_list_in_zone, lists_in_zone, mean_delta, payload_bits, tag); if (result != UDS_SUCCESS) { uds_uninitialize_delta_index(delta_index); return result; } delta_index->memory_size += (sizeof(struct delta_zone) + zone_memory + (lists_in_zone + 2) * (sizeof(struct delta_list) + sizeof(u64))); } uds_reset_delta_index(delta_index); return UDS_SUCCESS; } /* Read a bit field from an arbitrary bit boundary. */ static inline u32 get_field(const u8 *memory, u64 offset, u8 size) { const void *addr = memory + offset / BITS_PER_BYTE; return (get_unaligned_le32(addr) >> (offset % BITS_PER_BYTE)) & ((1 << size) - 1); } /* Write a bit field to an arbitrary bit boundary. */ static inline void set_field(u32 value, u8 *memory, u64 offset, u8 size) { void *addr = memory + offset / BITS_PER_BYTE; int shift = offset % BITS_PER_BYTE; u32 data = get_unaligned_le32(addr); data &= ~(((1 << size) - 1) << shift); data |= value << shift; put_unaligned_le32(data, addr); } /* Get the bit offset to the immutable delta list header. */ static inline u32 get_immutable_header_offset(u32 list_number) { return sizeof(struct delta_page_header) * BITS_PER_BYTE + list_number * IMMUTABLE_HEADER_SIZE; } /* Get the bit offset to the start of the immutable delta list bit stream. */ static inline u32 get_immutable_start(const u8 *memory, u32 list_number) { return get_field(memory, get_immutable_header_offset(list_number), IMMUTABLE_HEADER_SIZE); } /* Set the bit offset to the start of the immutable delta list bit stream. */ static inline void set_immutable_start(u8 *memory, u32 list_number, u32 start) { set_field(start, memory, get_immutable_header_offset(list_number), IMMUTABLE_HEADER_SIZE); } static bool verify_delta_index_page(u64 nonce, u16 list_count, u64 expected_nonce, u8 *memory, size_t memory_size) { unsigned int i; /* * Verify the nonce. A mismatch can happen here during rebuild if we haven't written the * entire volume at least once. */ if (nonce != expected_nonce) return false; /* Verify that the number of delta lists can fit in the page. */ if (list_count > ((memory_size - sizeof(struct delta_page_header)) * BITS_PER_BYTE / IMMUTABLE_HEADER_SIZE)) return false; /* * Verify that the first delta list is immediately after the last delta * list header. */ if (get_immutable_start(memory, 0) != get_immutable_header_offset(list_count + 1)) return false; /* Verify that the lists are in the correct order. */ for (i = 0; i < list_count; i++) { if (get_immutable_start(memory, i) > get_immutable_start(memory, i + 1)) return false; } /* * Verify that the last list ends on the page, and that there is room * for the post-field guard bits. */ if (get_immutable_start(memory, list_count) > (memory_size - POST_FIELD_GUARD_BYTES) * BITS_PER_BYTE) return false; /* Verify that the guard bytes are correctly set to all ones. */ for (i = 0; i < POST_FIELD_GUARD_BYTES; i++) { if (memory[memory_size - POST_FIELD_GUARD_BYTES + i] != (u8) ~0) return false; } /* All verifications passed. */ return true; } /* Initialize a delta index page to refer to a supplied page. */ int uds_initialize_delta_index_page(struct delta_index_page *delta_index_page, u64 expected_nonce, u32 mean_delta, u32 payload_bits, u8 *memory, size_t memory_size) { u64 nonce; u64 vcn; u64 first_list; u64 list_count; struct delta_page_header *header = (struct delta_page_header *) memory; struct delta_zone *delta_zone = &delta_index_page->delta_zone; const u8 *nonce_addr = (const u8 *) &header->nonce; const u8 *vcn_addr = (const u8 *) &header->virtual_chapter_number; const u8 *first_list_addr = (const u8 *) &header->first_list; const u8 *list_count_addr = (const u8 *) &header->list_count; /* First assume that the header is little endian. */ nonce = get_unaligned_le64(nonce_addr); vcn = get_unaligned_le64(vcn_addr); first_list = get_unaligned_le16(first_list_addr); list_count = get_unaligned_le16(list_count_addr); if (!verify_delta_index_page(nonce, list_count, expected_nonce, memory, memory_size)) { /* If that fails, try big endian. */ nonce = get_unaligned_be64(nonce_addr); vcn = get_unaligned_be64(vcn_addr); first_list = get_unaligned_be16(first_list_addr); list_count = get_unaligned_be16(list_count_addr); if (!verify_delta_index_page(nonce, list_count, expected_nonce, memory, memory_size)) { /* * Both attempts failed. Do not log this as an error, because it can happen * during a rebuild if we haven't written the entire volume at least once. */ return UDS_CORRUPT_DATA; } } delta_index_page->delta_index.delta_zones = delta_zone; delta_index_page->delta_index.zone_count = 1; delta_index_page->delta_index.list_count = list_count; delta_index_page->delta_index.lists_per_zone = list_count; delta_index_page->delta_index.mutable = false; delta_index_page->delta_index.tag = 'p'; delta_index_page->virtual_chapter_number = vcn; delta_index_page->lowest_list_number = first_list; delta_index_page->highest_list_number = first_list + list_count - 1; compute_coding_constants(mean_delta, &delta_zone->min_bits, &delta_zone->min_keys, &delta_zone->incr_keys); delta_zone->value_bits = payload_bits; delta_zone->memory = memory; delta_zone->delta_lists = NULL; delta_zone->new_offsets = NULL; delta_zone->buffered_writer = NULL; delta_zone->size = memory_size; delta_zone->rebalance_time = 0; delta_zone->rebalance_count = 0; delta_zone->record_count = 0; delta_zone->collision_count = 0; delta_zone->discard_count = 0; delta_zone->overflow_count = 0; delta_zone->first_list = 0; delta_zone->list_count = list_count; delta_zone->tag = 'p'; return UDS_SUCCESS; } /* Read a large bit field from an arbitrary bit boundary. */ static inline u64 get_big_field(const u8 *memory, u64 offset, u8 size) { const void *addr = memory + offset / BITS_PER_BYTE; return (get_unaligned_le64(addr) >> (offset % BITS_PER_BYTE)) & ((1UL << size) - 1); } /* Write a large bit field to an arbitrary bit boundary. */ static inline void set_big_field(u64 value, u8 *memory, u64 offset, u8 size) { void *addr = memory + offset / BITS_PER_BYTE; u8 shift = offset % BITS_PER_BYTE; u64 data = get_unaligned_le64(addr); data &= ~(((1UL << size) - 1) << shift); data |= value << shift; put_unaligned_le64(data, addr); } /* Set a sequence of bits to all zeros. */ static inline void set_zero(u8 *memory, u64 offset, u32 size) { if (size > 0) { u8 *addr = memory + offset / BITS_PER_BYTE; u8 shift = offset % BITS_PER_BYTE; u32 count = size + shift > BITS_PER_BYTE ? (u32) BITS_PER_BYTE - shift : size; *addr++ &= ~(((1 << count) - 1) << shift); for (size -= count; size > BITS_PER_BYTE; size -= BITS_PER_BYTE) *addr++ = 0; if (size > 0) *addr &= 0xFF << size; } } /* * Move several bits from a higher to a lower address, moving the lower addressed bits first. The * size and memory offsets are measured in bits. */ static void move_bits_down(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size) { const u8 *source; u8 *destination; u8 offset; u8 count; u64 field; /* Start by moving one field that ends on a to int boundary. */ count = (MAX_BIG_FIELD_BITS - ((to_offset + MAX_BIG_FIELD_BITS) % BITS_PER_TYPE(u32))); field = get_big_field(from, from_offset, count); set_big_field(field, to, to_offset, count); from_offset += count; to_offset += count; size -= count; /* Now do the main loop to copy 32 bit chunks that are int-aligned at the destination. */ offset = from_offset % BITS_PER_TYPE(u32); source = from + (from_offset - offset) / BITS_PER_BYTE; destination = to + to_offset / BITS_PER_BYTE; while (size > MAX_BIG_FIELD_BITS) { put_unaligned_le32(get_unaligned_le64(source) >> offset, destination); source += sizeof(u32); destination += sizeof(u32); from_offset += BITS_PER_TYPE(u32); to_offset += BITS_PER_TYPE(u32); size -= BITS_PER_TYPE(u32); } /* Finish up by moving any remaining bits. */ if (size > 0) { field = get_big_field(from, from_offset, size); set_big_field(field, to, to_offset, size); } } /* * Move several bits from a lower to a higher address, moving the higher addressed bits first. The * size and memory offsets are measured in bits. */ static void move_bits_up(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size) { const u8 *source; u8 *destination; u8 offset; u8 count; u64 field; /* Start by moving one field that begins on a destination int boundary. */ count = (to_offset + size) % BITS_PER_TYPE(u32); if (count > 0) { size -= count; field = get_big_field(from, from_offset + size, count); set_big_field(field, to, to_offset + size, count); } /* Now do the main loop to copy 32 bit chunks that are int-aligned at the destination. */ offset = (from_offset + size) % BITS_PER_TYPE(u32); source = from + (from_offset + size - offset) / BITS_PER_BYTE; destination = to + (to_offset + size) / BITS_PER_BYTE; while (size > MAX_BIG_FIELD_BITS) { source -= sizeof(u32); destination -= sizeof(u32); size -= BITS_PER_TYPE(u32); put_unaligned_le32(get_unaligned_le64(source) >> offset, destination); } /* Finish up by moving any remaining bits. */ if (size > 0) { field = get_big_field(from, from_offset, size); set_big_field(field, to, to_offset, size); } } /* * Move bits from one field to another. When the fields overlap, behave as if we first move all the * bits from the source to a temporary value, and then move all the bits from the temporary value * to the destination. The size and memory offsets are measured in bits. */ static void move_bits(const u8 *from, u64 from_offset, u8 *to, u64 to_offset, u32 size) { u64 field; /* A small move doesn't require special handling. */ if (size <= MAX_BIG_FIELD_BITS) { if (size > 0) { field = get_big_field(from, from_offset, size); set_big_field(field, to, to_offset, size); } return; } if (from_offset > to_offset) move_bits_down(from, from_offset, to, to_offset, size); else move_bits_up(from, from_offset, to, to_offset, size); } /* * Pack delta lists from a mutable delta index into an immutable delta index page. A range of delta * lists (starting with a specified list index) is copied from the mutable delta index into a * memory page used in the immutable index. The number of lists copied onto the page is returned in * list_count. */ int uds_pack_delta_index_page(const struct delta_index *delta_index, u64 header_nonce, u8 *memory, size_t memory_size, u64 virtual_chapter_number, u32 first_list, u32 *list_count) { const struct delta_zone *delta_zone; struct delta_list *delta_lists; u32 max_lists; u32 n_lists = 0; u32 offset; u32 i; int free_bits; int bits; struct delta_page_header *header; delta_zone = &delta_index->delta_zones[0]; delta_lists = &delta_zone->delta_lists[first_list + 1]; max_lists = delta_index->list_count - first_list; /* * Compute how many lists will fit on the page. Subtract the size of the fixed header, one * delta list offset, and the guard bytes from the page size to determine how much space is * available for delta lists. */ free_bits = memory_size * BITS_PER_BYTE; free_bits -= get_immutable_header_offset(1); free_bits -= GUARD_BITS; if (free_bits < IMMUTABLE_HEADER_SIZE) { /* This page is too small to store any delta lists. */ return vdo_log_error_strerror(UDS_OVERFLOW, "Chapter Index Page of %zu bytes is too small", memory_size); } while (n_lists < max_lists) { /* Each list requires a delta list offset and the list data. */ bits = IMMUTABLE_HEADER_SIZE + delta_lists[n_lists].size; if (bits > free_bits) break; n_lists++; free_bits -= bits; } *list_count = n_lists; header = (struct delta_page_header *) memory; put_unaligned_le64(header_nonce, (u8 *) &header->nonce); put_unaligned_le64(virtual_chapter_number, (u8 *) &header->virtual_chapter_number); put_unaligned_le16(first_list, (u8 *) &header->first_list); put_unaligned_le16(n_lists, (u8 *) &header->list_count); /* Construct the delta list offset table. */ offset = get_immutable_header_offset(n_lists + 1); set_immutable_start(memory, 0, offset); for (i = 0; i < n_lists; i++) { offset += delta_lists[i].size; set_immutable_start(memory, i + 1, offset); } /* Copy the delta list data onto the memory page. */ for (i = 0; i < n_lists; i++) { move_bits(delta_zone->memory, delta_lists[i].start, memory, get_immutable_start(memory, i), delta_lists[i].size); } /* Set all the bits in the guard bytes. */ memset(memory + memory_size - POST_FIELD_GUARD_BYTES, ~0, POST_FIELD_GUARD_BYTES); return UDS_SUCCESS; } /* Compute the new offsets of the delta lists. */ static void compute_new_list_offsets(struct delta_zone *delta_zone, u32 growing_index, size_t growing_size, size_t used_space) { size_t spacing; u32 i; struct delta_list *delta_lists = delta_zone->delta_lists; u32 tail_guard_index = delta_zone->list_count + 1; spacing = (delta_zone->size - used_space) / delta_zone->list_count; delta_zone->new_offsets[0] = 0; for (i = 0; i <= delta_zone->list_count; i++) { delta_zone->new_offsets[i + 1] = (delta_zone->new_offsets[i] + get_delta_list_byte_size(&delta_lists[i]) + spacing); delta_zone->new_offsets[i] *= BITS_PER_BYTE; delta_zone->new_offsets[i] += delta_lists[i].start % BITS_PER_BYTE; if (i == 0) delta_zone->new_offsets[i + 1] -= spacing / 2; if (i + 1 == growing_index) delta_zone->new_offsets[i + 1] += growing_size; } delta_zone->new_offsets[tail_guard_index] = (delta_zone->size * BITS_PER_BYTE - delta_lists[tail_guard_index].size); } static void rebalance_lists(struct delta_zone *delta_zone) { struct delta_list *delta_lists; u32 i; size_t used_space = 0; /* Extend and balance memory to receive the delta lists */ delta_lists = delta_zone->delta_lists; for (i = 0; i <= delta_zone->list_count + 1; i++) used_space += get_delta_list_byte_size(&delta_lists[i]); compute_new_list_offsets(delta_zone, 0, 0, used_space); for (i = 1; i <= delta_zone->list_count + 1; i++) delta_lists[i].start = delta_zone->new_offsets[i]; } /* Start restoring a delta index from multiple input streams. */ int uds_start_restoring_delta_index(struct delta_index *delta_index, struct buffered_reader **buffered_readers, unsigned int reader_count) { int result; unsigned int zone_count = reader_count; u64 record_count = 0; u64 collision_count = 0; u32 first_list[MAX_ZONES]; u32 list_count[MAX_ZONES]; unsigned int z; u32 list_next = 0; const struct delta_zone *delta_zone; /* Read and validate each header. */ for (z = 0; z < zone_count; z++) { struct delta_index_header header; u8 buffer[sizeof(struct delta_index_header)]; size_t offset = 0; result = uds_read_from_buffered_reader(buffered_readers[z], buffer, sizeof(buffer)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read delta index header"); } memcpy(&header.magic, buffer, MAGIC_SIZE); offset += MAGIC_SIZE; decode_u32_le(buffer, &offset, &header.zone_number); decode_u32_le(buffer, &offset, &header.zone_count); decode_u32_le(buffer, &offset, &header.first_list); decode_u32_le(buffer, &offset, &header.list_count); decode_u64_le(buffer, &offset, &header.record_count); decode_u64_le(buffer, &offset, &header.collision_count); result = VDO_ASSERT(offset == sizeof(struct delta_index_header), "%zu bytes decoded of %zu expected", offset, sizeof(struct delta_index_header)); if (result != VDO_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read delta index header"); } if (memcmp(header.magic, DELTA_INDEX_MAGIC, MAGIC_SIZE) != 0) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index file has bad magic number"); } if (zone_count != header.zone_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index files contain mismatched zone counts (%u,%u)", zone_count, header.zone_count); } if (header.zone_number != z) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index zone %u found in slot %u", header.zone_number, z); } first_list[z] = header.first_list; list_count[z] = header.list_count; record_count += header.record_count; collision_count += header.collision_count; if (first_list[z] != list_next) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index file for zone %u starts with list %u instead of list %u", z, first_list[z], list_next); } list_next += list_count[z]; } if (list_next != delta_index->list_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index files contain %u delta lists instead of %u delta lists", list_next, delta_index->list_count); } if (collision_count > record_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "delta index files contain %llu collisions and %llu records", (unsigned long long) collision_count, (unsigned long long) record_count); } uds_reset_delta_index(delta_index); delta_index->delta_zones[0].record_count = record_count; delta_index->delta_zones[0].collision_count = collision_count; /* Read the delta lists and distribute them to the proper zones. */ for (z = 0; z < zone_count; z++) { u32 i; delta_index->load_lists[z] = 0; for (i = 0; i < list_count[z]; i++) { u16 delta_list_size; u32 list_number; unsigned int zone_number; u8 size_data[sizeof(u16)]; result = uds_read_from_buffered_reader(buffered_readers[z], size_data, sizeof(size_data)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read delta index size"); } delta_list_size = get_unaligned_le16(size_data); if (delta_list_size > 0) delta_index->load_lists[z] += 1; list_number = first_list[z] + i; zone_number = list_number / delta_index->lists_per_zone; delta_zone = &delta_index->delta_zones[zone_number]; list_number -= delta_zone->first_list; delta_zone->delta_lists[list_number + 1].size = delta_list_size; } } /* Prepare each zone to start receiving the delta list data. */ for (z = 0; z < delta_index->zone_count; z++) rebalance_lists(&delta_index->delta_zones[z]); return UDS_SUCCESS; } static int restore_delta_list_to_zone(struct delta_zone *delta_zone, const struct delta_list_save_info *save_info, const u8 *data) { struct delta_list *delta_list; u16 bit_count; u16 byte_count; u32 list_number = save_info->index - delta_zone->first_list; if (list_number >= delta_zone->list_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "invalid delta list number %u not in range [%u,%u)", save_info->index, delta_zone->first_list, delta_zone->first_list + delta_zone->list_count); } delta_list = &delta_zone->delta_lists[list_number + 1]; if (delta_list->size == 0) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "unexpected delta list number %u", save_info->index); } bit_count = delta_list->size + save_info->bit_offset; byte_count = BITS_TO_BYTES(bit_count); if (save_info->byte_count != byte_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "unexpected delta list size %u != %u", save_info->byte_count, byte_count); } move_bits(data, save_info->bit_offset, delta_zone->memory, delta_list->start, delta_list->size); return UDS_SUCCESS; } static int restore_delta_list_data(struct delta_index *delta_index, unsigned int load_zone, struct buffered_reader *buffered_reader, u8 *data) { int result; struct delta_list_save_info save_info; u8 buffer[sizeof(struct delta_list_save_info)]; unsigned int new_zone; result = uds_read_from_buffered_reader(buffered_reader, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read delta list data"); } save_info = (struct delta_list_save_info) { .tag = buffer[0], .bit_offset = buffer[1], .byte_count = get_unaligned_le16(&buffer[2]), .index = get_unaligned_le32(&buffer[4]), }; if ((save_info.bit_offset >= BITS_PER_BYTE) || (save_info.byte_count > DELTA_LIST_MAX_BYTE_COUNT)) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "corrupt delta list data"); } /* Make sure the data is intended for this delta index. */ if (save_info.tag != delta_index->tag) return UDS_CORRUPT_DATA; if (save_info.index >= delta_index->list_count) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "invalid delta list number %u of %u", save_info.index, delta_index->list_count); } result = uds_read_from_buffered_reader(buffered_reader, data, save_info.byte_count); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read delta list data"); } delta_index->load_lists[load_zone] -= 1; new_zone = save_info.index / delta_index->lists_per_zone; return restore_delta_list_to_zone(&delta_index->delta_zones[new_zone], &save_info, data); } /* Restore delta lists from saved data. */ int uds_finish_restoring_delta_index(struct delta_index *delta_index, struct buffered_reader **buffered_readers, unsigned int reader_count) { int result; int saved_result = UDS_SUCCESS; unsigned int z; u8 *data; result = vdo_allocate(DELTA_LIST_MAX_BYTE_COUNT, u8, __func__, &data); if (result != VDO_SUCCESS) return result; for (z = 0; z < reader_count; z++) { while (delta_index->load_lists[z] > 0) { result = restore_delta_list_data(delta_index, z, buffered_readers[z], data); if (result != UDS_SUCCESS) { saved_result = result; break; } } } vdo_free(data); return saved_result; } int uds_check_guard_delta_lists(struct buffered_reader **buffered_readers, unsigned int reader_count) { int result; unsigned int z; u8 buffer[sizeof(struct delta_list_save_info)]; for (z = 0; z < reader_count; z++) { result = uds_read_from_buffered_reader(buffered_readers[z], buffer, sizeof(buffer)); if (result != UDS_SUCCESS) return result; if (buffer[0] != 'z') return UDS_CORRUPT_DATA; } return UDS_SUCCESS; } static int flush_delta_list(struct delta_zone *zone, u32 flush_index) { struct delta_list *delta_list; u8 buffer[sizeof(struct delta_list_save_info)]; int result; delta_list = &zone->delta_lists[flush_index + 1]; buffer[0] = zone->tag; buffer[1] = delta_list->start % BITS_PER_BYTE; put_unaligned_le16(get_delta_list_byte_size(delta_list), &buffer[2]); put_unaligned_le32(zone->first_list + flush_index, &buffer[4]); result = uds_write_to_buffered_writer(zone->buffered_writer, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) { vdo_log_warning_strerror(result, "failed to write delta list memory"); return result; } result = uds_write_to_buffered_writer(zone->buffered_writer, zone->memory + get_delta_list_byte_start(delta_list), get_delta_list_byte_size(delta_list)); if (result != UDS_SUCCESS) vdo_log_warning_strerror(result, "failed to write delta list memory"); return result; } /* Start saving a delta index zone to a buffered output stream. */ int uds_start_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number, struct buffered_writer *buffered_writer) { int result; u32 i; struct delta_zone *delta_zone; u8 buffer[sizeof(struct delta_index_header)]; size_t offset = 0; delta_zone = &delta_index->delta_zones[zone_number]; memcpy(buffer, DELTA_INDEX_MAGIC, MAGIC_SIZE); offset += MAGIC_SIZE; encode_u32_le(buffer, &offset, zone_number); encode_u32_le(buffer, &offset, delta_index->zone_count); encode_u32_le(buffer, &offset, delta_zone->first_list); encode_u32_le(buffer, &offset, delta_zone->list_count); encode_u64_le(buffer, &offset, delta_zone->record_count); encode_u64_le(buffer, &offset, delta_zone->collision_count); result = VDO_ASSERT(offset == sizeof(struct delta_index_header), "%zu bytes encoded of %zu expected", offset, sizeof(struct delta_index_header)); if (result != VDO_SUCCESS) return result; result = uds_write_to_buffered_writer(buffered_writer, buffer, offset); if (result != UDS_SUCCESS) return vdo_log_warning_strerror(result, "failed to write delta index header"); for (i = 0; i < delta_zone->list_count; i++) { u8 data[sizeof(u16)]; struct delta_list *delta_list; delta_list = &delta_zone->delta_lists[i + 1]; put_unaligned_le16(delta_list->size, data); result = uds_write_to_buffered_writer(buffered_writer, data, sizeof(data)); if (result != UDS_SUCCESS) return vdo_log_warning_strerror(result, "failed to write delta list size"); } delta_zone->buffered_writer = buffered_writer; return UDS_SUCCESS; } int uds_finish_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number) { int result; int first_error = UDS_SUCCESS; u32 i; struct delta_zone *delta_zone; struct delta_list *delta_list; delta_zone = &delta_index->delta_zones[zone_number]; for (i = 0; i < delta_zone->list_count; i++) { delta_list = &delta_zone->delta_lists[i + 1]; if (delta_list->size > 0) { result = flush_delta_list(delta_zone, i); if ((result != UDS_SUCCESS) && (first_error == UDS_SUCCESS)) first_error = result; } } delta_zone->buffered_writer = NULL; return first_error; } int uds_write_guard_delta_list(struct buffered_writer *buffered_writer) { int result; u8 buffer[sizeof(struct delta_list_save_info)]; memset(buffer, 0, sizeof(struct delta_list_save_info)); buffer[0] = 'z'; result = uds_write_to_buffered_writer(buffered_writer, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) vdo_log_warning_strerror(result, "failed to write guard delta list"); return UDS_SUCCESS; } size_t uds_compute_delta_index_save_bytes(u32 list_count, size_t memory_size) { /* One zone will use at least as much memory as other zone counts. */ return (sizeof(struct delta_index_header) + list_count * (sizeof(struct delta_list_save_info) + 1) + get_zone_memory_size(1, memory_size)); } static int assert_not_at_end(const struct delta_index_entry *delta_entry) { int result = VDO_ASSERT(!delta_entry->at_end, "operation is invalid because the list entry is at the end of the delta list"); if (result != VDO_SUCCESS) result = UDS_BAD_STATE; return result; } /* * Prepare to search for an entry in the specified delta list. * * This is always the first function to be called when dealing with delta index entries. It is * always followed by calls to uds_next_delta_index_entry() to iterate through a delta list. The * fields of the delta_index_entry argument will be set up for iteration, but will not contain an * entry from the list. */ int uds_start_delta_index_search(const struct delta_index *delta_index, u32 list_number, u32 key, struct delta_index_entry *delta_entry) { int result; unsigned int zone_number; struct delta_zone *delta_zone; struct delta_list *delta_list; result = VDO_ASSERT((list_number < delta_index->list_count), "Delta list number (%u) is out of range (%u)", list_number, delta_index->list_count); if (result != VDO_SUCCESS) return UDS_CORRUPT_DATA; zone_number = list_number / delta_index->lists_per_zone; delta_zone = &delta_index->delta_zones[zone_number]; list_number -= delta_zone->first_list; result = VDO_ASSERT((list_number < delta_zone->list_count), "Delta list number (%u) is out of range (%u) for zone (%u)", list_number, delta_zone->list_count, zone_number); if (result != VDO_SUCCESS) return UDS_CORRUPT_DATA; if (delta_index->mutable) { delta_list = &delta_zone->delta_lists[list_number + 1]; } else { u32 end_offset; /* * Translate the immutable delta list header into a temporary * full delta list header. */ delta_list = &delta_entry->temp_delta_list; delta_list->start = get_immutable_start(delta_zone->memory, list_number); end_offset = get_immutable_start(delta_zone->memory, list_number + 1); delta_list->size = end_offset - delta_list->start; delta_list->save_key = 0; delta_list->save_offset = 0; } if (key > delta_list->save_key) { delta_entry->key = delta_list->save_key; delta_entry->offset = delta_list->save_offset; } else { delta_entry->key = 0; delta_entry->offset = 0; if (key == 0) { /* * This usually means we're about to walk the entire delta list, so get all * of it into the CPU cache. */ uds_prefetch_range(&delta_zone->memory[delta_list->start / BITS_PER_BYTE], delta_list->size / BITS_PER_BYTE, false); } } delta_entry->at_end = false; delta_entry->delta_zone = delta_zone; delta_entry->delta_list = delta_list; delta_entry->entry_bits = 0; delta_entry->is_collision = false; delta_entry->list_number = list_number; delta_entry->list_overflow = false; delta_entry->value_bits = delta_zone->value_bits; return UDS_SUCCESS; } static inline u64 get_delta_entry_offset(const struct delta_index_entry *delta_entry) { return delta_entry->delta_list->start + delta_entry->offset; } /* * Decode a delta index entry delta value. The delta_index_entry basically describes the previous * list entry, and has had its offset field changed to point to the subsequent entry. We decode the * bit stream and update the delta_list_entry to describe the entry. */ static inline void decode_delta(struct delta_index_entry *delta_entry) { int key_bits; u32 delta; const struct delta_zone *delta_zone = delta_entry->delta_zone; const u8 *memory = delta_zone->memory; u64 delta_offset = get_delta_entry_offset(delta_entry) + delta_entry->value_bits; const u8 *addr = memory + delta_offset / BITS_PER_BYTE; int offset = delta_offset % BITS_PER_BYTE; u32 data = get_unaligned_le32(addr) >> offset; addr += sizeof(u32); key_bits = delta_zone->min_bits; delta = data & ((1 << key_bits) - 1); if (delta >= delta_zone->min_keys) { data >>= key_bits; if (data == 0) { key_bits = sizeof(u32) * BITS_PER_BYTE - offset; while ((data = get_unaligned_le32(addr)) == 0) { addr += sizeof(u32); key_bits += sizeof(u32) * BITS_PER_BYTE; } } key_bits += ffs(data); delta += ((key_bits - delta_zone->min_bits - 1) * delta_zone->incr_keys); } delta_entry->delta = delta; delta_entry->key += delta; /* Check for a collision, a delta of zero after the start. */ if (unlikely((delta == 0) && (delta_entry->offset > 0))) { delta_entry->is_collision = true; delta_entry->entry_bits = delta_entry->value_bits + key_bits + COLLISION_BITS; } else { delta_entry->is_collision = false; delta_entry->entry_bits = delta_entry->value_bits + key_bits; } } noinline int uds_next_delta_index_entry(struct delta_index_entry *delta_entry) { int result; const struct delta_list *delta_list; u32 next_offset; u16 size; result = assert_not_at_end(delta_entry); if (result != UDS_SUCCESS) return result; delta_list = delta_entry->delta_list; delta_entry->offset += delta_entry->entry_bits; size = delta_list->size; if (unlikely(delta_entry->offset >= size)) { delta_entry->at_end = true; delta_entry->delta = 0; delta_entry->is_collision = false; result = VDO_ASSERT((delta_entry->offset == size), "next offset past end of delta list"); if (result != VDO_SUCCESS) result = UDS_CORRUPT_DATA; return result; } decode_delta(delta_entry); next_offset = delta_entry->offset + delta_entry->entry_bits; if (next_offset > size) { /* * This is not an assertion because uds_validate_chapter_index_page() wants to * handle this error. */ vdo_log_warning("Decoded past the end of the delta list"); return UDS_CORRUPT_DATA; } return UDS_SUCCESS; } int uds_remember_delta_index_offset(const struct delta_index_entry *delta_entry) { int result; struct delta_list *delta_list = delta_entry->delta_list; result = VDO_ASSERT(!delta_entry->is_collision, "entry is not a collision"); if (result != VDO_SUCCESS) return result; delta_list->save_key = delta_entry->key - delta_entry->delta; delta_list->save_offset = delta_entry->offset; return UDS_SUCCESS; } static void set_delta(struct delta_index_entry *delta_entry, u32 delta) { const struct delta_zone *delta_zone = delta_entry->delta_zone; u32 key_bits = (delta_zone->min_bits + ((delta_zone->incr_keys - delta_zone->min_keys + delta) / delta_zone->incr_keys)); delta_entry->delta = delta; delta_entry->entry_bits = delta_entry->value_bits + key_bits; } static void get_collision_name(const struct delta_index_entry *entry, u8 *name) { u64 offset = get_delta_entry_offset(entry) + entry->entry_bits - COLLISION_BITS; const u8 *addr = entry->delta_zone->memory + offset / BITS_PER_BYTE; int size = COLLISION_BYTES; int shift = offset % BITS_PER_BYTE; while (--size >= 0) *name++ = get_unaligned_le16(addr++) >> shift; } static void set_collision_name(const struct delta_index_entry *entry, const u8 *name) { u64 offset = get_delta_entry_offset(entry) + entry->entry_bits - COLLISION_BITS; u8 *addr = entry->delta_zone->memory + offset / BITS_PER_BYTE; int size = COLLISION_BYTES; int shift = offset % BITS_PER_BYTE; u16 mask = ~((u16) 0xFF << shift); u16 data; while (--size >= 0) { data = (get_unaligned_le16(addr) & mask) | (*name++ << shift); put_unaligned_le16(data, addr++); } } int uds_get_delta_index_entry(const struct delta_index *delta_index, u32 list_number, u32 key, const u8 *name, struct delta_index_entry *delta_entry) { int result; result = uds_start_delta_index_search(delta_index, list_number, key, delta_entry); if (result != UDS_SUCCESS) return result; do { result = uds_next_delta_index_entry(delta_entry); if (result != UDS_SUCCESS) return result; } while (!delta_entry->at_end && (key > delta_entry->key)); result = uds_remember_delta_index_offset(delta_entry); if (result != UDS_SUCCESS) return result; if (!delta_entry->at_end && (key == delta_entry->key)) { struct delta_index_entry collision_entry = *delta_entry; for (;;) { u8 full_name[COLLISION_BYTES]; result = uds_next_delta_index_entry(&collision_entry); if (result != UDS_SUCCESS) return result; if (collision_entry.at_end || !collision_entry.is_collision) break; get_collision_name(&collision_entry, full_name); if (memcmp(full_name, name, COLLISION_BYTES) == 0) { *delta_entry = collision_entry; break; } } } return UDS_SUCCESS; } int uds_get_delta_entry_collision(const struct delta_index_entry *delta_entry, u8 *name) { int result; result = assert_not_at_end(delta_entry); if (result != UDS_SUCCESS) return result; result = VDO_ASSERT(delta_entry->is_collision, "Cannot get full block name from a non-collision delta index entry"); if (result != VDO_SUCCESS) return UDS_BAD_STATE; get_collision_name(delta_entry, name); return UDS_SUCCESS; } u32 uds_get_delta_entry_value(const struct delta_index_entry *delta_entry) { return get_field(delta_entry->delta_zone->memory, get_delta_entry_offset(delta_entry), delta_entry->value_bits); } static int assert_mutable_entry(const struct delta_index_entry *delta_entry) { int result = VDO_ASSERT((delta_entry->delta_list != &delta_entry->temp_delta_list), "delta index is mutable"); if (result != VDO_SUCCESS) result = UDS_BAD_STATE; return result; } int uds_set_delta_entry_value(const struct delta_index_entry *delta_entry, u32 value) { int result; u32 value_mask = (1 << delta_entry->value_bits) - 1; result = assert_mutable_entry(delta_entry); if (result != UDS_SUCCESS) return result; result = assert_not_at_end(delta_entry); if (result != UDS_SUCCESS) return result; result = VDO_ASSERT((value & value_mask) == value, "Value (%u) being set in a delta index is too large (must fit in %u bits)", value, delta_entry->value_bits); if (result != VDO_SUCCESS) return UDS_INVALID_ARGUMENT; set_field(value, delta_entry->delta_zone->memory, get_delta_entry_offset(delta_entry), delta_entry->value_bits); return UDS_SUCCESS; } /* * Extend the memory used by the delta lists by adding growing_size bytes before the list indicated * by growing_index, then rebalancing the lists in the new chunk. */ static int extend_delta_zone(struct delta_zone *delta_zone, u32 growing_index, size_t growing_size) { ktime_t start_time; ktime_t end_time; struct delta_list *delta_lists; u32 i; size_t used_space; /* Calculate the amount of space that is or will be in use. */ start_time = current_time_ns(CLOCK_MONOTONIC); delta_lists = delta_zone->delta_lists; used_space = growing_size; for (i = 0; i <= delta_zone->list_count + 1; i++) used_space += get_delta_list_byte_size(&delta_lists[i]); if (delta_zone->size < used_space) return UDS_OVERFLOW; /* Compute the new offsets of the delta lists. */ compute_new_list_offsets(delta_zone, growing_index, growing_size, used_space); /* * When we rebalance the delta list, we will include the end guard list in the rebalancing. * It contains the end guard data, which must be copied. */ rebalance_delta_zone(delta_zone, 1, delta_zone->list_count + 1); end_time = current_time_ns(CLOCK_MONOTONIC); delta_zone->rebalance_count++; delta_zone->rebalance_time += ktime_sub(end_time, start_time); return UDS_SUCCESS; } static int insert_bits(struct delta_index_entry *delta_entry, u16 size) { u64 free_before; u64 free_after; u64 source; u64 destination; u32 count; bool before_flag; u8 *memory; struct delta_zone *delta_zone = delta_entry->delta_zone; struct delta_list *delta_list = delta_entry->delta_list; /* Compute bits in use before and after the inserted bits. */ u32 total_size = delta_list->size; u32 before_size = delta_entry->offset; u32 after_size = total_size - delta_entry->offset; if (total_size + size > U16_MAX) { delta_entry->list_overflow = true; delta_zone->overflow_count++; return UDS_OVERFLOW; } /* Compute bits available before and after the delta list. */ free_before = (delta_list[0].start - (delta_list[-1].start + delta_list[-1].size)); free_after = (delta_list[1].start - (delta_list[0].start + delta_list[0].size)); if ((size <= free_before) && (size <= free_after)) { /* * We have enough space to use either before or after the list. Select the smaller * amount of data. If it is exactly the same, try to take from the larger amount of * free space. */ if (before_size < after_size) before_flag = true; else if (after_size < before_size) before_flag = false; else before_flag = free_before > free_after; } else if (size <= free_before) { /* There is space before but not after. */ before_flag = true; } else if (size <= free_after) { /* There is space after but not before. */ before_flag = false; } else { /* * Neither of the surrounding spaces is large enough for this request. Extend * and/or rebalance the delta list memory choosing to move the least amount of * data. */ int result; u32 growing_index = delta_entry->list_number + 1; before_flag = before_size < after_size; if (!before_flag) growing_index++; result = extend_delta_zone(delta_zone, growing_index, BITS_TO_BYTES(size)); if (result != UDS_SUCCESS) return result; } delta_list->size += size; if (before_flag) { source = delta_list->start; destination = source - size; delta_list->start -= size; count = before_size; } else { source = delta_list->start + delta_entry->offset; destination = source + size; count = after_size; } memory = delta_zone->memory; move_bits(memory, source, memory, destination, count); return UDS_SUCCESS; } static void encode_delta(const struct delta_index_entry *delta_entry) { u32 temp; u32 t1; u32 t2; u64 offset; const struct delta_zone *delta_zone = delta_entry->delta_zone; u8 *memory = delta_zone->memory; offset = get_delta_entry_offset(delta_entry) + delta_entry->value_bits; if (delta_entry->delta < delta_zone->min_keys) { set_field(delta_entry->delta, memory, offset, delta_zone->min_bits); return; } temp = delta_entry->delta - delta_zone->min_keys; t1 = (temp % delta_zone->incr_keys) + delta_zone->min_keys; t2 = temp / delta_zone->incr_keys; set_field(t1, memory, offset, delta_zone->min_bits); set_zero(memory, offset + delta_zone->min_bits, t2); set_field(1, memory, offset + delta_zone->min_bits + t2, 1); } static void encode_entry(const struct delta_index_entry *delta_entry, u32 value, const u8 *name) { u8 *memory = delta_entry->delta_zone->memory; u64 offset = get_delta_entry_offset(delta_entry); set_field(value, memory, offset, delta_entry->value_bits); encode_delta(delta_entry); if (name != NULL) set_collision_name(delta_entry, name); } /* * Create a new entry in the delta index. If the entry is a collision, the full 256 bit name must * be provided. */ int uds_put_delta_index_entry(struct delta_index_entry *delta_entry, u32 key, u32 value, const u8 *name) { int result; struct delta_zone *delta_zone; result = assert_mutable_entry(delta_entry); if (result != UDS_SUCCESS) return result; if (delta_entry->is_collision) { /* * The caller wants us to insert a collision entry onto a collision entry. This * happens when we find a collision and attempt to add the name again to the index. * This is normally a fatal error unless we are replaying a closed chapter while we * are rebuilding a volume index. */ return UDS_DUPLICATE_NAME; } if (delta_entry->offset < delta_entry->delta_list->save_offset) { /* * The saved entry offset is after the new entry and will no longer be valid, so * replace it with the insertion point. */ result = uds_remember_delta_index_offset(delta_entry); if (result != UDS_SUCCESS) return result; } if (name != NULL) { /* Insert a collision entry which is placed after this entry. */ result = assert_not_at_end(delta_entry); if (result != UDS_SUCCESS) return result; result = VDO_ASSERT((key == delta_entry->key), "incorrect key for collision entry"); if (result != VDO_SUCCESS) return result; delta_entry->offset += delta_entry->entry_bits; set_delta(delta_entry, 0); delta_entry->is_collision = true; delta_entry->entry_bits += COLLISION_BITS; result = insert_bits(delta_entry, delta_entry->entry_bits); } else if (delta_entry->at_end) { /* Insert a new entry at the end of the delta list. */ result = VDO_ASSERT((key >= delta_entry->key), "key past end of list"); if (result != VDO_SUCCESS) return result; set_delta(delta_entry, key - delta_entry->key); delta_entry->key = key; delta_entry->at_end = false; result = insert_bits(delta_entry, delta_entry->entry_bits); } else { u16 old_entry_size; u16 additional_size; struct delta_index_entry next_entry; u32 next_value; /* * Insert a new entry which requires the delta in the following entry to be * updated. */ result = VDO_ASSERT((key < delta_entry->key), "key precedes following entry"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT((key >= delta_entry->key - delta_entry->delta), "key effects following entry's delta"); if (result != VDO_SUCCESS) return result; old_entry_size = delta_entry->entry_bits; next_entry = *delta_entry; next_value = uds_get_delta_entry_value(&next_entry); set_delta(delta_entry, key - (delta_entry->key - delta_entry->delta)); delta_entry->key = key; set_delta(&next_entry, next_entry.key - key); next_entry.offset += delta_entry->entry_bits; /* The two new entries are always bigger than the single entry being replaced. */ additional_size = (delta_entry->entry_bits + next_entry.entry_bits - old_entry_size); result = insert_bits(delta_entry, additional_size); if (result != UDS_SUCCESS) return result; encode_entry(&next_entry, next_value, NULL); } if (result != UDS_SUCCESS) return result; encode_entry(delta_entry, value, name); delta_zone = delta_entry->delta_zone; delta_zone->record_count++; delta_zone->collision_count += delta_entry->is_collision ? 1 : 0; return UDS_SUCCESS; } static void delete_bits(const struct delta_index_entry *delta_entry, int size) { u64 source; u64 destination; u32 count; bool before_flag; struct delta_list *delta_list = delta_entry->delta_list; u8 *memory = delta_entry->delta_zone->memory; /* Compute bits retained before and after the deleted bits. */ u32 total_size = delta_list->size; u32 before_size = delta_entry->offset; u32 after_size = total_size - delta_entry->offset - size; /* * Determine whether to add to the available space either before or after the delta list. * We prefer to move the least amount of data. If it is exactly the same, try to add to the * smaller amount of free space. */ if (before_size < after_size) { before_flag = true; } else if (after_size < before_size) { before_flag = false; } else { u64 free_before = (delta_list[0].start - (delta_list[-1].start + delta_list[-1].size)); u64 free_after = (delta_list[1].start - (delta_list[0].start + delta_list[0].size)); before_flag = (free_before < free_after); } delta_list->size -= size; if (before_flag) { source = delta_list->start; destination = source + size; delta_list->start += size; count = before_size; } else { destination = delta_list->start + delta_entry->offset; source = destination + size; count = after_size; } move_bits(memory, source, memory, destination, count); } int uds_remove_delta_index_entry(struct delta_index_entry *delta_entry) { int result; struct delta_index_entry next_entry; struct delta_zone *delta_zone; struct delta_list *delta_list; result = assert_mutable_entry(delta_entry); if (result != UDS_SUCCESS) return result; next_entry = *delta_entry; result = uds_next_delta_index_entry(&next_entry); if (result != UDS_SUCCESS) return result; delta_zone = delta_entry->delta_zone; if (delta_entry->is_collision) { /* This is a collision entry, so just remove it. */ delete_bits(delta_entry, delta_entry->entry_bits); next_entry.offset = delta_entry->offset; delta_zone->collision_count -= 1; } else if (next_entry.at_end) { /* This entry is at the end of the list, so just remove it. */ delete_bits(delta_entry, delta_entry->entry_bits); next_entry.key -= delta_entry->delta; next_entry.offset = delta_entry->offset; } else { /* The delta in the next entry needs to be updated. */ u32 next_value = uds_get_delta_entry_value(&next_entry); u16 old_size = delta_entry->entry_bits + next_entry.entry_bits; if (next_entry.is_collision) { next_entry.is_collision = false; delta_zone->collision_count -= 1; } set_delta(&next_entry, delta_entry->delta + next_entry.delta); next_entry.offset = delta_entry->offset; /* The one new entry is always smaller than the two entries being replaced. */ delete_bits(delta_entry, old_size - next_entry.entry_bits); encode_entry(&next_entry, next_value, NULL); } delta_zone->record_count--; delta_zone->discard_count++; *delta_entry = next_entry; delta_list = delta_entry->delta_list; if (delta_entry->offset < delta_list->save_offset) { /* The saved entry offset is no longer valid. */ delta_list->save_key = 0; delta_list->save_offset = 0; } return UDS_SUCCESS; } void uds_get_delta_index_stats(const struct delta_index *delta_index, struct delta_index_stats *stats) { unsigned int z; const struct delta_zone *delta_zone; memset(stats, 0, sizeof(struct delta_index_stats)); for (z = 0; z < delta_index->zone_count; z++) { delta_zone = &delta_index->delta_zones[z]; stats->rebalance_time += delta_zone->rebalance_time; stats->rebalance_count += delta_zone->rebalance_count; stats->record_count += delta_zone->record_count; stats->collision_count += delta_zone->collision_count; stats->discard_count += delta_zone->discard_count; stats->overflow_count += delta_zone->overflow_count; stats->list_count += delta_zone->list_count; } } size_t uds_compute_delta_index_size(u32 entry_count, u32 mean_delta, u32 payload_bits) { u16 min_bits; u32 incr_keys; u32 min_keys; compute_coding_constants(mean_delta, &min_bits, &min_keys, &incr_keys); /* On average, each delta is encoded into about min_bits + 1.5 bits. */ return entry_count * (payload_bits + min_bits + 1) + entry_count / 2; } u32 uds_get_delta_index_page_count(u32 entry_count, u32 list_count, u32 mean_delta, u32 payload_bits, size_t bytes_per_page) { unsigned int bits_per_delta_list; unsigned int bits_per_page; size_t bits_per_index; /* Compute the expected number of bits needed for all the entries. */ bits_per_index = uds_compute_delta_index_size(entry_count, mean_delta, payload_bits); bits_per_delta_list = bits_per_index / list_count; /* Add in the immutable delta list headers. */ bits_per_index += list_count * IMMUTABLE_HEADER_SIZE; /* Compute the number of usable bits on an immutable index page. */ bits_per_page = ((bytes_per_page - sizeof(struct delta_page_header)) * BITS_PER_BYTE); /* * Reduce the bits per page by one immutable delta list header and one delta list to * account for internal fragmentation. */ bits_per_page -= IMMUTABLE_HEADER_SIZE + bits_per_delta_list; /* Now compute the number of pages needed. */ return DIV_ROUND_UP(bits_per_index, bits_per_page); } void uds_log_delta_index_entry(struct delta_index_entry *delta_entry) { vdo_log_ratelimit(vdo_log_info, "List 0x%X Key 0x%X Offset 0x%X%s%s List_size 0x%X%s", delta_entry->list_number, delta_entry->key, delta_entry->offset, delta_entry->at_end ? " end" : "", delta_entry->is_collision ? " collision" : "", delta_entry->delta_list->size, delta_entry->list_overflow ? " overflow" : ""); delta_entry->list_overflow = false; } vdo-8.3.1.1/utils/uds/delta-index.h000066400000000000000000000227211476467262700170050ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_DELTA_INDEX_H #define UDS_DELTA_INDEX_H #include #include "numeric.h" #include "time-utils.h" #include "config.h" #include "io-factory.h" /* * A delta index is a key-value store, where each entry maps an address (the key) to a payload (the * value). The entries are sorted by address, and only the delta between successive addresses is * stored in the entry. The addresses are assumed to be uniformly distributed, and the deltas are * therefore exponentially distributed. * * A delta_index can either be mutable or immutable depending on its expected use. The immutable * form of a delta index is used for the indexes of closed chapters committed to the volume. The * mutable form of a delta index is used by the volume index, and also by the chapter index in an * open chapter. Like the index as a whole, each mutable delta index is divided into a number of * independent zones. */ struct delta_list { /* The offset of the delta list start, in bits */ u64 start; /* The number of bits in the delta list */ u16 size; /* Where the last search "found" the key, in bits */ u16 save_offset; /* The key for the record just before save_offset */ u32 save_key; }; struct delta_zone { /* The delta list memory */ u8 *memory; /* The delta list headers */ struct delta_list *delta_lists; /* Temporary starts of delta lists */ u64 *new_offsets; /* Buffered writer for saving an index */ struct buffered_writer *buffered_writer; /* The size of delta list memory */ size_t size; /* Nanoseconds spent rebalancing */ ktime_t rebalance_time; /* Number of memory rebalances */ u32 rebalance_count; /* The number of bits in a stored value */ u8 value_bits; /* The number of bits in the minimal key code */ u16 min_bits; /* The number of keys used in a minimal code */ u32 min_keys; /* The number of keys used for another code bit */ u32 incr_keys; /* The number of records in the index */ u64 record_count; /* The number of collision records */ u64 collision_count; /* The number of records removed */ u64 discard_count; /* The number of UDS_OVERFLOW errors detected */ u64 overflow_count; /* The index of the first delta list */ u32 first_list; /* The number of delta lists */ u32 list_count; /* Tag belonging to this delta index */ u8 tag; } __aligned(L1_CACHE_BYTES); struct delta_list_save_info { /* Tag identifying which delta index this list is in */ u8 tag; /* Bit offset of the start of the list data */ u8 bit_offset; /* Number of bytes of list data */ u16 byte_count; /* The delta list number within the delta index */ u32 index; } __packed; struct delta_index { /* The zones */ struct delta_zone *delta_zones; /* The number of zones */ unsigned int zone_count; /* The number of delta lists */ u32 list_count; /* Maximum lists per zone */ u32 lists_per_zone; /* Total memory allocated to this index */ size_t memory_size; /* The number of non-empty lists at load time per zone */ u32 load_lists[MAX_ZONES]; /* True if this index is mutable */ bool mutable; /* Tag belonging to this delta index */ u8 tag; }; /* * A delta_index_page describes a single page of a chapter index. The delta_index field allows the * page to be treated as an immutable delta_index. We use the delta_zone field to treat the chapter * index page as a single zone index, and without the need to do an additional memory allocation. */ struct delta_index_page { struct delta_index delta_index; /* These values are loaded from the delta_page_header */ u32 lowest_list_number; u32 highest_list_number; u64 virtual_chapter_number; /* This structure describes the single zone of a delta index page. */ struct delta_zone delta_zone; }; /* * Notes on the delta_index_entries: * * The fields documented as "public" can be read by any code that uses a delta_index. The fields * documented as "private" carry information between delta_index method calls and should not be * used outside the delta_index module. * * (1) The delta_index_entry is used like an iterator when searching a delta list. * * (2) It is also the result of a successful search and can be used to refer to the element found * by the search. * * (3) It is also the result of an unsuccessful search and can be used to refer to the insertion * point for a new record. * * (4) If at_end is true, the delta_list entry can only be used as the insertion point for a new * record at the end of the list. * * (5) If at_end is false and is_collision is true, the delta_list entry fields refer to a * collision entry in the list, and the delta_list entry can be used as a reference to this * entry. * * (6) If at_end is false and is_collision is false, the delta_list entry fields refer to a * non-collision entry in the list. Such delta_list entries can be used as a reference to a * found entry, or an insertion point for a non-collision entry before this entry, or an * insertion point for a collision entry that collides with this entry. */ struct delta_index_entry { /* Public fields */ /* The key for this entry */ u32 key; /* We are after the last list entry */ bool at_end; /* This record is a collision */ bool is_collision; /* Private fields */ /* This delta list overflowed */ bool list_overflow; /* The number of bits used for the value */ u8 value_bits; /* The number of bits used for the entire entry */ u16 entry_bits; /* The delta index zone */ struct delta_zone *delta_zone; /* The delta list containing the entry */ struct delta_list *delta_list; /* The delta list number */ u32 list_number; /* Bit offset of this entry within the list */ u16 offset; /* The delta between this and previous entry */ u32 delta; /* Temporary delta list for immutable indices */ struct delta_list temp_delta_list; }; struct delta_index_stats { /* Number of bytes allocated */ size_t memory_allocated; /* Nanoseconds spent rebalancing */ ktime_t rebalance_time; /* Number of memory rebalances */ u32 rebalance_count; /* The number of records in the index */ u64 record_count; /* The number of collision records */ u64 collision_count; /* The number of records removed */ u64 discard_count; /* The number of UDS_OVERFLOW errors detected */ u64 overflow_count; /* The number of delta lists */ u32 list_count; }; int __must_check uds_initialize_delta_index(struct delta_index *delta_index, unsigned int zone_count, u32 list_count, u32 mean_delta, u32 payload_bits, size_t memory_size, u8 tag); int __must_check uds_initialize_delta_index_page(struct delta_index_page *delta_index_page, u64 expected_nonce, u32 mean_delta, u32 payload_bits, u8 *memory, size_t memory_size); void uds_uninitialize_delta_index(struct delta_index *delta_index); void uds_reset_delta_index(const struct delta_index *delta_index); int __must_check uds_pack_delta_index_page(const struct delta_index *delta_index, u64 header_nonce, u8 *memory, size_t memory_size, u64 virtual_chapter_number, u32 first_list, u32 *list_count); int __must_check uds_start_restoring_delta_index(struct delta_index *delta_index, struct buffered_reader **buffered_readers, unsigned int reader_count); int __must_check uds_finish_restoring_delta_index(struct delta_index *delta_index, struct buffered_reader **buffered_readers, unsigned int reader_count); int __must_check uds_check_guard_delta_lists(struct buffered_reader **buffered_readers, unsigned int reader_count); int __must_check uds_start_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number, struct buffered_writer *buffered_writer); int __must_check uds_finish_saving_delta_index(const struct delta_index *delta_index, unsigned int zone_number); int __must_check uds_write_guard_delta_list(struct buffered_writer *buffered_writer); size_t __must_check uds_compute_delta_index_save_bytes(u32 list_count, size_t memory_size); int __must_check uds_start_delta_index_search(const struct delta_index *delta_index, u32 list_number, u32 key, struct delta_index_entry *iterator); int __must_check uds_next_delta_index_entry(struct delta_index_entry *delta_entry); int __must_check uds_remember_delta_index_offset(const struct delta_index_entry *delta_entry); int __must_check uds_get_delta_index_entry(const struct delta_index *delta_index, u32 list_number, u32 key, const u8 *name, struct delta_index_entry *delta_entry); int __must_check uds_get_delta_entry_collision(const struct delta_index_entry *delta_entry, u8 *name); u32 __must_check uds_get_delta_entry_value(const struct delta_index_entry *delta_entry); int __must_check uds_set_delta_entry_value(const struct delta_index_entry *delta_entry, u32 value); int __must_check uds_put_delta_index_entry(struct delta_index_entry *delta_entry, u32 key, u32 value, const u8 *name); int __must_check uds_remove_delta_index_entry(struct delta_index_entry *delta_entry); void uds_get_delta_index_stats(const struct delta_index *delta_index, struct delta_index_stats *stats); size_t __must_check uds_compute_delta_index_size(u32 entry_count, u32 mean_delta, u32 payload_bits); u32 uds_get_delta_index_page_count(u32 entry_count, u32 list_count, u32 mean_delta, u32 payload_bits, size_t bytes_per_page); void uds_log_delta_index_entry(struct delta_index_entry *delta_entry); #endif /* UDS_DELTA_INDEX_H */ vdo-8.3.1.1/utils/uds/dm-bufio.c000066400000000000000000000130301476467262700162750ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include #include #include "fileUtils.h" #include "logger.h" #include "memory-alloc.h" #include "thread-utils.h" /* * This fake client does not actually do any type of sophisticated buffering. * Instead, it hands out buffers from a list of saved dm_buffer objects, * creating new ones when necessary. When a buffer is marked dirty, the client * writes its data immediately so that it can return the buffer to circulation * and not have to track unsaved buffers. */ struct dm_buffer; struct dm_bufio_client { int status; struct block_device *bdev; off_t start_offset; size_t bytes_per_page; struct mutex buffer_mutex; struct dm_buffer *buffer_list; }; struct dm_buffer { struct dm_bufio_client *client; struct dm_buffer *next; sector_t offset; u8 *data; }; struct dm_bufio_client * dm_bufio_client_create(struct block_device *bdev, unsigned block_size, unsigned reserved_buffers __always_unused, unsigned aux_size __always_unused, void (*alloc_callback)(struct dm_buffer *) __always_unused, void (*write_callback)(struct dm_buffer *) __always_unused, unsigned int flags __always_unused) { int result; struct dm_bufio_client *client; result = vdo_allocate(1, struct dm_bufio_client, __func__, &client); if (result != VDO_SUCCESS) return ERR_PTR(-ENOMEM); result = uds_init_mutex(&client->buffer_mutex); if (result != UDS_SUCCESS) { dm_bufio_client_destroy(client); return ERR_PTR(-result); } client->bytes_per_page = block_size; client->bdev = bdev; return client; } void dm_bufio_client_destroy(struct dm_bufio_client *client) { struct dm_buffer *buffer; while (client->buffer_list != NULL) { buffer = client->buffer_list; client->buffer_list = buffer->next; vdo_free(buffer->data); vdo_free(buffer); } uds_destroy_mutex(&client->buffer_mutex); vdo_free(client); } void dm_bufio_set_sector_offset(struct dm_bufio_client *client, sector_t start) { client->start_offset = start * SECTOR_SIZE; } void *dm_bufio_new(struct dm_bufio_client *client, sector_t block, struct dm_buffer **buffer_ptr) { int result; struct dm_buffer *buffer = NULL; off_t block_offset = block * client->bytes_per_page; uds_lock_mutex(&client->buffer_mutex); if (client->buffer_list != NULL) { buffer = client->buffer_list; client->buffer_list = buffer->next; } uds_unlock_mutex(&client->buffer_mutex); if (buffer == NULL) { result = vdo_allocate(1, struct dm_buffer, __func__, &buffer); if (result != VDO_SUCCESS) return ERR_PTR(-ENOMEM); result = vdo_allocate(client->bytes_per_page, u8, __func__, &buffer->data); if (result != VDO_SUCCESS) { vdo_free(buffer); return ERR_PTR(-ENOMEM); } buffer->client = client; } buffer->offset = client->start_offset + block_offset; *buffer_ptr = buffer; return buffer->data; } /* This gets a new buffer to read data into. */ void *dm_bufio_read(struct dm_bufio_client *client, sector_t block, struct dm_buffer **buffer_ptr) { int result; size_t read_length = 0; struct dm_buffer *buffer; u8 *data; data = dm_bufio_new(client, block, &buffer); if (IS_ERR(data)) { vdo_log_error_strerror(-PTR_ERR(data), "error reading physical page %lu", block); return data; } result = read_data_at_offset(client->bdev->fd, buffer->offset, buffer->data, client->bytes_per_page, &read_length); if (result != UDS_SUCCESS) { dm_bufio_release(buffer); vdo_log_warning_strerror(result, "error reading physical page %lu", block); return ERR_PTR(-EIO); } if (read_length < client->bytes_per_page) memset(&buffer->data[read_length], 0, client->bytes_per_page - read_length); *buffer_ptr = buffer; return buffer->data; } void dm_bufio_prefetch(struct dm_bufio_client *client __always_unused, sector_t block __always_unused, unsigned block_count __always_unused) { /* Prefetching is meaningless when dealing with files. */ } void dm_bufio_release(struct dm_buffer *buffer) { struct dm_bufio_client *client = buffer->client; uds_lock_mutex(&client->buffer_mutex); buffer->next = client->buffer_list; client->buffer_list = buffer; uds_unlock_mutex(&client->buffer_mutex); } /* * This moves the buffer from its current location to a new one without * changing the buffer contents. dm_bufio_mark_buffer_dirty() is required to * write the buffer contents to the new location. */ void dm_bufio_release_move(struct dm_buffer *buffer, sector_t new_block) { struct dm_bufio_client *client = buffer->client; off_t block_offset = new_block * client->bytes_per_page; buffer->offset = client->start_offset + block_offset; } /* Write the buffer immediately rather than have to track dirty buffers. */ void dm_bufio_mark_buffer_dirty(struct dm_buffer *buffer) { int result; struct dm_bufio_client *client = buffer->client; result = write_buffer_at_offset(client->bdev->fd, buffer->offset, buffer->data, client->bytes_per_page); if (client->status == UDS_SUCCESS) client->status = result; } /* Since we already wrote all the dirty buffers, just sync the file. */ int dm_bufio_write_dirty_buffers(struct dm_bufio_client *client) { if (client->status != UDS_SUCCESS) return -client->status; return -logging_fsync(client->bdev->fd, "cannot sync file contents"); } void *dm_bufio_get_block_data(struct dm_buffer *buffer) { return buffer->data; } vdo-8.3.1.1/utils/uds/errors.c000066400000000000000000000160151476467262700161150ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "errors.h" #include #include "logger.h" #include "permassert.h" #include "string-utils.h" static const struct error_info successful = { "UDS_SUCCESS", "Success" }; static const struct error_info error_list[] = { { "UDS_OVERFLOW", "Index overflow" }, { "UDS_INVALID_ARGUMENT", "Invalid argument passed to internal routine" }, { "UDS_BAD_STATE", "UDS data structures are in an invalid state" }, { "UDS_DUPLICATE_NAME", "Attempt to enter the same name into a delta index twice" }, { "UDS_ASSERTION_FAILED", "Assertion failed" }, { "UDS_QUEUED", "Request queued" }, { "UDS_ALREADY_REGISTERED", "Error range already registered" }, { "UDS_OUT_OF_RANGE", "Cannot access data outside specified limits" }, { "UDS_DISABLED", "UDS library context is disabled" }, { "UDS_UNSUPPORTED_VERSION", "Unsupported version" }, { "UDS_CORRUPT_DATA", "Some index structure is corrupt" }, { "UDS_NO_INDEX", "No index found" }, { "UDS_INDEX_NOT_SAVED_CLEANLY", "Index not saved cleanly" }, { "UDS_NO_DIRECTORY", "Expected directory is missing" }, { "UDS_EMODULE_LOAD", "Could not load modules" }, { "UDS_UNKNOWN_ERROR", "Unknown error" }, }; struct error_block { const char *name; int base; int last; int max; const struct error_info *infos; }; #define MAX_ERROR_BLOCKS 6 static struct { int allocated; int count; struct error_block blocks[MAX_ERROR_BLOCKS]; } registered_errors = { .allocated = MAX_ERROR_BLOCKS, .count = 1, .blocks = { { .name = "UDS Error", .base = UDS_ERROR_CODE_BASE, .last = UDS_ERROR_CODE_LAST, .max = UDS_ERROR_CODE_BLOCK_END, .infos = error_list, } }, }; /* Get the error info for an error number. Also returns the name of the error block, if known. */ static const char *get_error_info(int errnum, const struct error_info **info_ptr) { struct error_block *block; if (errnum == UDS_SUCCESS) { *info_ptr = &successful; return NULL; } for (block = registered_errors.blocks; block < registered_errors.blocks + registered_errors.count; block++) { if ((errnum >= block->base) && (errnum < block->last)) { *info_ptr = block->infos + (errnum - block->base); return block->name; } else if ((errnum >= block->last) && (errnum < block->max)) { *info_ptr = NULL; return block->name; } } return NULL; } /* Return a string describing a system error message. */ static inline const char *system_string_error(int errnum, char *buf, size_t buflen) { return strerror_r(errnum, buf, buflen); } /* Convert an error code to a descriptive string. */ const char *uds_string_error(int errnum, char *buf, size_t buflen) { char *buffer = buf; char *buf_end = buf + buflen; const struct error_info *info = NULL; const char *block_name; if (buf == NULL) return NULL; if (errnum < 0) errnum = -errnum; block_name = get_error_info(errnum, &info); if (block_name != NULL) { if (info != NULL) { buffer = vdo_append_to_buffer(buffer, buf_end, "%s: %s", block_name, info->message); } else { buffer = vdo_append_to_buffer(buffer, buf_end, "Unknown %s %d", block_name, errnum); } } else if (info != NULL) { buffer = vdo_append_to_buffer(buffer, buf_end, "%s", info->message); } else { const char *tmp = system_string_error(errnum, buffer, buf_end - buffer); if (tmp != buffer) buffer = vdo_append_to_buffer(buffer, buf_end, "%s", tmp); else buffer += strlen(tmp); } return buf; } /* Convert an error code to its name. */ const char *uds_string_error_name(int errnum, char *buf, size_t buflen) { char *buffer = buf; char *buf_end = buf + buflen; const struct error_info *info = NULL; const char *block_name; if (errnum < 0) errnum = -errnum; block_name = get_error_info(errnum, &info); if (block_name != NULL) { if (info != NULL) { buffer = vdo_append_to_buffer(buffer, buf_end, "%s", info->name); } else { buffer = vdo_append_to_buffer(buffer, buf_end, "%s %d", block_name, errnum); } } else if (info != NULL) { buffer = vdo_append_to_buffer(buffer, buf_end, "%s", info->name); } else { const char *tmp; tmp = system_string_error(errnum, buffer, buf_end - buffer); if (tmp != buffer) buffer = vdo_append_to_buffer(buffer, buf_end, "%s", tmp); else buffer += strlen(tmp); } return buf; } /* * Translate an error code into a value acceptable to the kernel. The input error code may be a * system-generated value (such as -EIO), or an internal UDS status code. The result will be a * negative errno value. */ int uds_status_to_errno(int error) { char error_name[VDO_MAX_ERROR_NAME_SIZE]; char error_message[VDO_MAX_ERROR_MESSAGE_SIZE]; /* 0 is success, and negative values are already system error codes. */ if (likely(error <= 0)) return error; if (error < 1024) { /* This is probably an errno from userspace. */ return -error; } /* Internal UDS errors */ switch (error) { case UDS_NO_INDEX: case UDS_CORRUPT_DATA: /* The index doesn't exist or can't be recovered. */ return -ENOENT; case UDS_INDEX_NOT_SAVED_CLEANLY: case UDS_UNSUPPORTED_VERSION: /* * The index exists, but can't be loaded. Tell the client it exists so they don't * destroy it inadvertently. */ return -EEXIST; case UDS_DISABLED: /* The session is unusable; only returned by requests. */ return -EIO; default: /* Translate an unexpected error into something generic. */ vdo_log_info("%s: mapping status code %d (%s: %s) to -EIO", __func__, error, uds_string_error_name(error, error_name, sizeof(error_name)), uds_string_error(error, error_message, sizeof(error_message))); return -EIO; } } /* * Register a block of error codes. * * @block_name: the name of the block of error codes * @first_error: the first error code in the block * @next_free_error: one past the highest possible error in the block * @infos: a pointer to the error info array for the block * @info_size: the size of the error info array */ int uds_register_error_block(const char *block_name, int first_error, int next_free_error, const struct error_info *infos, size_t info_size) { int result; struct error_block *block; struct error_block new_block = { .name = block_name, .base = first_error, .last = first_error + (info_size / sizeof(struct error_info)), .max = next_free_error, .infos = infos, }; result = VDO_ASSERT(first_error < next_free_error, "well-defined error block range"); if (result != VDO_SUCCESS) return result; if (registered_errors.count == registered_errors.allocated) { /* This should never happen. */ return UDS_OVERFLOW; } for (block = registered_errors.blocks; block < registered_errors.blocks + registered_errors.count; block++) { if (strcmp(block_name, block->name) == 0) return UDS_DUPLICATE_NAME; /* Ensure error ranges do not overlap. */ if ((first_error < block->max) && (next_free_error > block->base)) return UDS_ALREADY_REGISTERED; } registered_errors.blocks[registered_errors.count++] = new_block; return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/errors.h000066400000000000000000000043531476467262700161240ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_ERRORS_H #define UDS_ERRORS_H #include #include #include /* Custom error codes and error-related utilities */ #define VDO_SUCCESS 0 /* Valid status codes for internal UDS functions. */ enum uds_status_codes { /* Successful return */ UDS_SUCCESS = VDO_SUCCESS, /* Used as a base value for reporting internal errors */ UDS_ERROR_CODE_BASE = 1024, /* Index overflow */ UDS_OVERFLOW = UDS_ERROR_CODE_BASE, /* Invalid argument passed to internal routine */ UDS_INVALID_ARGUMENT, /* UDS data structures are in an invalid state */ UDS_BAD_STATE, /* Attempt to enter the same name into an internal structure twice */ UDS_DUPLICATE_NAME, /* An assertion failed */ UDS_ASSERTION_FAILED, /* A request has been queued for later processing (not an error) */ UDS_QUEUED, /* This error range has already been registered */ UDS_ALREADY_REGISTERED, /* Attempt to read or write data outside the valid range */ UDS_OUT_OF_RANGE, /* The index session is disabled */ UDS_DISABLED, /* The index configuration or volume format is no longer supported */ UDS_UNSUPPORTED_VERSION, /* Some index structure is corrupt */ UDS_CORRUPT_DATA, /* No index state found */ UDS_NO_INDEX, /* Attempt to access incomplete index save data */ UDS_INDEX_NOT_SAVED_CLEANLY, /* No directory was found where one was expected */ UDS_NO_DIRECTORY, /* Could not load modules */ UDS_EMODULE_LOAD, /* Unknown error */ UDS_UNKNOWN_ERROR, /* One more than the last UDS_INTERNAL error code */ UDS_ERROR_CODE_LAST, /* One more than the last error this block will ever use */ UDS_ERROR_CODE_BLOCK_END = UDS_ERROR_CODE_BASE + 440, }; enum { VDO_MAX_ERROR_NAME_SIZE = 80, VDO_MAX_ERROR_MESSAGE_SIZE = 128, }; struct error_info { const char *name; const char *message; }; const char * __must_check uds_string_error(int errnum, char *buf, size_t buflen); const char *uds_string_error_name(int errnum, char *buf, size_t buflen); int uds_status_to_errno(int error); int uds_register_error_block(const char *block_name, int first_error, int last_reserved_error, const struct error_info *infos, size_t info_size); #endif /* UDS_ERRORS_H */ vdo-8.3.1.1/utils/uds/event-count.c000066400000000000000000000251131476467262700170470ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ /* * This event count implementation uses a posix semaphore for portability, although a futex would * be slightly superior to use and easy to substitute. It is designed to make signalling as cheap * as possible, since that is the code path likely triggered on most updates to a lock-free data * structure. Waiters are likely going to sleep, so optimizing for that case isn't necessary. * * The critical field is the state, which is really two fields that can be atomically updated in * unison: an event counter and a waiter count. Every call to event_count_prepare() issues a wait * token by atomically incrementing the waiter count. The key invariant is a strict accounting of * the number of tokens issued. Every token returned by event_count_prepare() is a contract that * the caller will call uds_acquire_semaphore() and a signaller will call uds_release_semaphore(), * each exactly once. Atomic updates to the state field ensure that each token is counted once and * that tokens are not lost. Cancelling a token attempts to take a fast-path by simply decrementing * the waiters field, but if the token has already been claimed by a signaller, the canceller must * still wait on the semaphore to consume the transferred token. * * The state field is 64 bits, partitioned into a 16-bit waiter field and a 48-bit counter. We are * unlikely to have 2^16 threads, much less 2^16 threads waiting on any single event transition. * 2^48 microseconds is several years, so a token holder would have to wait that long for the * counter to wrap around, and then call event_count_wait() at the exact right time to see the * re-used counter, in order to lose a wakeup due to counter wrap-around. Using a 32-bit state * field would greatly increase that chance, but if forced to do so, the implementation could * likely tolerate it since callers are supposed to hold tokens for minuscule periods of time. * Fortunately, x64 has 64-bit compare-and-swap, and the performance of interlocked 64-bit * operations appears to be about the same as for 32-bit ones, so being paranoid and using 64 bits * costs us nothing. * * Here are some sequences of calls and state transitions: * * action postcondition * counter waiters semaphore * initialized 0 0 0 * prepare 0 1 0 * wait (blocks) 0 1 0 * signal 1 0 1 * wait (unblocks) 1 0 0 * * signal (fast-path) 1 0 0 * signal (fast-path) 1 0 0 * * prepare A 1 1 0 * prepare B 1 2 0 * signal 2 0 2 * wait B (fast-path) 2 0 1 * wait A (fast-path) 2 0 0 * * prepare 2 1 0 * cancel (fast-path) 2 0 0 * * prepare 2 1 0 * signal 3 0 1 * cancel (must wait) 3 0 0 * * The event count structure is aligned, sized, and allocated to cache line boundaries to avoid any * false sharing between the event count and other shared state. The state field and semaphore * should fit on a single cache line. The instrumentation counters increase the size of the * structure so it rounds up to use two (64-byte x86) cache lines. */ #include "event-count.h" #include #include #include #include "logger.h" #include "memory-alloc.h" #include "thread-utils.h" /* value used to increment the waiters field */ #define ONE_WAITER 1 /* value used to increment the event counter */ #define ONE_EVENT (1 << 16) /* bit mask to access the waiters field */ #define WAITERS_MASK (ONE_EVENT - 1) /* bit mask to access the event counter */ #define EVENTS_MASK ~WAITERS_MASK struct event_count { /* * Atomically mutable state: * low 16 bits: the number of wait tokens not posted to the semaphore * high 48 bits: current event counter */ atomic64_t state; /* Semaphore used to block threads when waiting is required. */ struct semaphore semaphore; /* Declare alignment so we don't share a cache line. */ } __aligned(L1_CACHE_BYTES); static inline bool same_event(event_token_t token1, event_token_t token2) { return (token1 & EVENTS_MASK) == (token2 & EVENTS_MASK); } /* Wake all threads that are waiting for the next event. */ void event_count_broadcast(struct event_count *count) { u64 waiters; u64 state; u64 old_state; /* Even if there are no waiters (yet), we will need a memory barrier. */ smp_mb(); state = old_state = atomic64_read(&count->state); do { event_token_t new_state; /* * Check if there are any tokens that have not yet been transferred to the * semaphore. This is the fast no-waiters path. */ waiters = (state & WAITERS_MASK); if (waiters == 0) /* * Fast path first time through -- no need to signal or post if there are * no observers. */ return; /* * Attempt to atomically claim all the wait tokens and bump the event count using * an atomic compare-and-swap. This operation contains a memory barrier. */ new_state = ((state & ~WAITERS_MASK) + ONE_EVENT); old_state = state; state = atomic64_cmpxchg(&count->state, old_state, new_state); /* * The cmpxchg fails when we lose a race with a new waiter or another signaller, so * try again. */ } while (unlikely(state != old_state)); /* * Wake the waiters by posting to the semaphore. This effectively transfers the wait tokens * to the semaphore. There's sadly no bulk post for posix semaphores, so we've got to loop * to do them all. */ while (waiters-- > 0) uds_release_semaphore(&count->semaphore); } /* * Attempt to cancel a prepared wait token by decrementing the number of waiters in the current * state. This can only be done safely if the event count hasn't been incremented. Returns true if * the wait was successfully cancelled. */ static inline bool fast_cancel(struct event_count *count, event_token_t token) { event_token_t current_token = atomic64_read(&count->state); event_token_t new_token; while (same_event(current_token, token)) { /* * Try to decrement the waiter count via compare-and-swap as if we had never * prepared to wait. */ new_token = atomic64_cmpxchg(&count->state, current_token, current_token - 1); if (new_token == current_token) return true; current_token = new_token; } return false; } /* * Consume a token from the semaphore, waiting (with an optional timeout) if one is not currently * available. Returns true if a token was consumed. */ static bool consume_wait_token(struct event_count *count, const ktime_t *timeout) { /* Try to grab a token without waiting. */ if (uds_attempt_semaphore(&count->semaphore, 0)) return true; if (timeout == NULL) uds_acquire_semaphore(&count->semaphore); else if (!uds_attempt_semaphore(&count->semaphore, *timeout)) return false; return true; } int make_event_count(struct event_count **count_ptr) { /* * The event count will be allocated on a cache line boundary so there will not be false * sharing of the line with any other data structure. */ int result; struct event_count *count = NULL; result = vdo_allocate(1, struct event_count, "event count", &count); if (result != VDO_SUCCESS) return result; atomic64_set(&count->state, 0); result = uds_initialize_semaphore(&count->semaphore, 0); if (result != UDS_SUCCESS) { vdo_free(count); return result; } *count_ptr = count; return UDS_SUCCESS; } /* Free a struct event_count. It must no longer be in use. */ void free_event_count(struct event_count *count) { if (count == NULL) return; uds_destroy_semaphore(&count->semaphore); vdo_free(count); } /* * Prepare to wait for the event count to change by capturing a token of its current state. The * caller MUST eventually either call event_count_wait() or event_count_cancel() exactly once for * each token obtained. */ event_token_t event_count_prepare(struct event_count *count) { return atomic64_add_return(ONE_WAITER, &count->state); } /* * Cancel a wait token that has been prepared but not waited upon. This must be called after * event_count_prepare() when event_count_wait() is not going to be invoked on the token. */ void event_count_cancel(struct event_count *count, event_token_t token) { /* Decrement the waiter count if the event hasn't been signalled. */ if (fast_cancel(count, token)) return; /* * A signaller has already transferred (or promised to transfer) our token to the * semaphore, so we must consume it from the semaphore by waiting. */ event_count_wait(count, token, NULL); } /* * Check if the current event count state corresponds to the provided token, and if it is, wait for * a signal that the state has changed. If a timeout is provided, the wait will terminate after the * timeout has elapsed. Timing out automatically cancels the wait token, so callers must not * attempt to cancel the token in this case. The timeout is measured in nanoseconds. This function * returns true if the state changed, or false if it timed out. */ bool event_count_wait(struct event_count *count, event_token_t token, const ktime_t *timeout) { for (;;) { /* Wait for a signaller to transfer our wait token to the semaphore. */ if (!consume_wait_token(count, timeout)) { /* * The wait timed out, so we must cancel the token instead. Try to * decrement the waiter count if the event hasn't been signalled. */ if (fast_cancel(count, token)) return false; /* * We timed out, but a signaller came in before we could cancel the wait. * We have no choice but to wait for the semaphore to be posted. Since the * signaller has promised to do it, the wait should be short. The timeout * and the signal happened at about the same time, so either outcome could * be returned. It's simpler to ignore the timeout. */ timeout = NULL; continue; } /* A wait token has now been consumed from the semaphore. */ /* Stop waiting if the count has changed since the token was acquired. */ if (!same_event(token, atomic64_read(&count->state))) return true; /* * We consumed someone else's wait token. Put it back in the semaphore, which will * wake another waiter, hopefully one who can stop waiting. */ uds_release_semaphore(&count->semaphore); /* Attempt to give an earlier waiter a shot at the semaphore. */ cond_resched(); } } vdo-8.3.1.1/utils/uds/event-count.h000066400000000000000000000043231476467262700170540ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef EVENT_COUNT_H #define EVENT_COUNT_H #include "time-utils.h" /* * An event count is a lock-free equivalent of a condition variable. * * Using an event count, a lock-free producer/consumer can wait for a state change (adding an item * to an empty queue, for example) without spinning or falling back on the use of mutex-based * locks. Signalling is cheap when there are no waiters (a memory fence), and preparing to wait is * also inexpensive (an atomic add instruction). * * A lock-free producer should call event_count_broadcast() after any mutation to the lock-free * data structure that a consumer might be waiting on. The consumers should poll for work like * this: * * for (;;) { * // Fast path--no additional cost to consumer. * if (lockfree_dequeue(&item)) * return item; * // Two-step wait: get current token and poll state, either cancelling * // the wait or waiting for the token to be signalled. * event_token_t token = event_count_prepare(event_count); * if (lockfree_dequeue(&item)) { * event_count_cancel(event_count, token); * return item; * } * event_count_wait(event_count, token, NULL); * // State has changed, but must check condition again, so loop. * } * * Once event_count_prepare() is called, the caller should neither sleep nor perform long-running * or blocking actions before passing the token to event_count_cancel() or event_count_wait(). The * implementation is optimized for a short polling window, and will not perform well if there are * outstanding tokens that have been signalled but not waited upon. */ struct event_count; typedef unsigned int event_token_t; int __must_check make_event_count(struct event_count **count_ptr); void free_event_count(struct event_count *count); void event_count_broadcast(struct event_count *count); event_token_t __must_check event_count_prepare(struct event_count *count); void event_count_cancel(struct event_count *count, event_token_t token); bool event_count_wait(struct event_count *count, event_token_t token, const ktime_t *timeout); #endif /* EVENT_COUNT_H */ vdo-8.3.1.1/utils/uds/fileUtils.c000066400000000000000000000201441476467262700165370ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "fileUtils.h" #include #include #include #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "permassert.h" #include "string-utils.h" #include "syscalls.h" /**********************************************************************/ int file_exists(const char *path, bool *exists) { struct stat stat_buf; int result = logging_stat_missing_ok(path, &stat_buf, __func__); if (result == UDS_SUCCESS) { *exists = true; } else if (result == ENOENT) { *exists = false; result = UDS_SUCCESS; } return result; } /**********************************************************************/ int open_file(const char *path, enum file_access access, int *fd) { int ret_fd; int flags; mode_t mode; switch (access) { case FU_READ_ONLY: flags = O_RDONLY; mode = 0; break; case FU_READ_WRITE: flags = O_RDWR; mode = 0; break; case FU_CREATE_READ_WRITE: flags = O_CREAT | O_RDWR | O_TRUNC; mode = 0666; break; case FU_CREATE_WRITE_ONLY: flags = O_CREAT | O_WRONLY | O_TRUNC; mode = 0666; break; case FU_READ_ONLY_DIRECT: flags = O_RDONLY | O_DIRECT; mode = 0; break; case FU_READ_WRITE_DIRECT: flags = O_RDWR | O_DIRECT; mode = 0; break; case FU_CREATE_READ_WRITE_DIRECT: flags = O_CREAT | O_RDWR | O_TRUNC | O_DIRECT; mode = 0666; break; case FU_CREATE_WRITE_ONLY_DIRECT: flags = O_CREAT | O_WRONLY | O_TRUNC | O_DIRECT; mode = 0666; break; default: return vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "invalid access mode opening file %s", path); } do { ret_fd = open(path, flags, mode); } while ((ret_fd == -1) && (errno == EINTR)); if (ret_fd < 0) return vdo_log_error_strerror(errno, "open_file(): failed opening %s with file access: %d", path, access); *fd = ret_fd; return UDS_SUCCESS; } /**********************************************************************/ int close_file(int fd, const char *error_message) { return logging_close(fd, error_message); } /**********************************************************************/ void try_close_file(int fd) { int old_errno = errno; int result = close_file(fd, __func__); errno = old_errno; if (result != UDS_SUCCESS) vdo_log_debug_strerror(result, "error closing file"); } /**********************************************************************/ int sync_and_close_file(int fd, const char *error_message) { int result = logging_fsync(fd, error_message); if (result != UDS_SUCCESS) { try_close_file(fd); return result; } return close_file(fd, error_message); } /**********************************************************************/ void try_sync_and_close_file(int fd) { int result = sync_and_close_file(fd, __func__); if (result != UDS_SUCCESS) vdo_log_debug_strerror(result, "error syncing and closing file"); } /**********************************************************************/ int read_buffer(int fd, void *buffer, unsigned int length) { u8 *ptr = buffer; size_t bytes_to_read = length; while (bytes_to_read > 0) { ssize_t bytes_read; int result = logging_read(fd, ptr, bytes_to_read, __func__, &bytes_read); if (result != UDS_SUCCESS) return result; if (bytes_read == 0) return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "unexpected end of file while reading"); ptr += bytes_read; bytes_to_read -= bytes_read; } return UDS_SUCCESS; } /**********************************************************************/ int read_data_at_offset(int fd, off_t offset, void *buffer, size_t size, size_t *length) { u8 *ptr = buffer; size_t bytes_to_read = size; off_t current_offset = offset; while (bytes_to_read > 0) { ssize_t bytes_read; int result = logging_pread(fd, ptr, bytes_to_read, current_offset, __func__, &bytes_read); if (result != UDS_SUCCESS) return result; if (bytes_read == 0) break; ptr += bytes_read; bytes_to_read -= bytes_read; current_offset += bytes_read; } *length = ptr - (u8 *) buffer; return UDS_SUCCESS; } /**********************************************************************/ int write_buffer(int fd, const void *buffer, unsigned int length) { size_t bytes_to_write = length; const u8 *ptr = buffer; while (bytes_to_write > 0) { ssize_t written; int result = logging_write(fd, ptr, bytes_to_write, __func__, &written); if (result != UDS_SUCCESS) return result; if (written == 0) // this should not happen, but if it does, errno won't // be defined, so we need to return our own error return vdo_log_error_strerror(UDS_UNKNOWN_ERROR, "wrote 0 bytes"); bytes_to_write -= written; ptr += written; } return UDS_SUCCESS; } /**********************************************************************/ int write_buffer_at_offset(int fd, off_t offset, const void *buffer, size_t length) { size_t bytes_to_write = length; const u8 *ptr = buffer; off_t current_offset = offset; while (bytes_to_write > 0) { ssize_t written; int result = logging_pwrite(fd, ptr, bytes_to_write, current_offset, __func__, &written); if (result != UDS_SUCCESS) return result; if (written == 0) // this should not happen, but if it does, errno won't // be defined, so we need to return our own error return vdo_log_error_strerror(UDS_UNKNOWN_ERROR, "impossible write error"); bytes_to_write -= written; ptr += written; current_offset += written; } return UDS_SUCCESS; } /**********************************************************************/ int get_open_file_size(int fd, off_t *size_ptr) { struct stat statbuf; if (logging_fstat(fd, &statbuf, "get_open_file_size()") == -1) return errno; *size_ptr = statbuf.st_size; return UDS_SUCCESS; } /**********************************************************************/ int remove_file(const char *file_name) { int result = unlink(file_name); if (result == 0 || errno == ENOENT) return UDS_SUCCESS; return vdo_log_warning_strerror(errno, "Failed to remove %s", file_name); } /**********************************************************************/ bool file_name_match(const char *pattern, const char *string, int flags) { int result = fnmatch(pattern, string, flags); if ((result != 0) && (result != FNM_NOMATCH)) vdo_log_error("file_name_match(): fnmatch(): returned an error: %d, looking for \"%s\" with flags: %d", result, string, flags); return result == 0; } /**********************************************************************/ int make_abs_path(const char *path, char **abs_path) { char *tmp; int result = UDS_SUCCESS; if (path[0] == '/') { result = vdo_duplicate_string(path, __func__, &tmp); } else { char *cwd = get_current_dir_name(); if (cwd == NULL) return errno; result = vdo_alloc_sprintf(__func__, &tmp, "%s/%s", cwd, path); vdo_free(cwd); } if (result == VDO_SUCCESS) *abs_path = tmp; return result; } /**********************************************************************/ int logging_stat(const char *path, struct stat *buf, const char *context) { if (stat(path, buf) == 0) return UDS_SUCCESS; return vdo_log_error_strerror(errno, "%s failed in %s for path %s", __func__, context, path); } /**********************************************************************/ int logging_stat_missing_ok(const char *path, struct stat *buf, const char *context) { if (stat(path, buf) == 0) return UDS_SUCCESS; if (errno == ENOENT) return errno; return vdo_log_error_strerror(errno, "%s failed in %s for path %s", __func__, context, path); } /**********************************************************************/ int logging_fstat(int fd, struct stat *buf, const char *context) { return check_system_call(fstat(fd, buf), __func__, context); } /**********************************************************************/ int logging_fsync(int fd, const char *context) { return check_system_call(fsync(fd), __func__, context); } vdo-8.3.1.1/utils/uds/fileUtils.h000066400000000000000000000163741476467262700165560ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef FILE_UTILS_H #define FILE_UTILS_H #include #include enum file_access { FU_READ_ONLY = 0, // open file with read-only access FU_READ_WRITE = 1, // open file with read-write access FU_CREATE_READ_WRITE = 2, // same, but create and truncate // with 0666 // mode bits if the file doesn't exist FU_CREATE_WRITE_ONLY = 3, // like above, but open for writing // only // Direct I/O: FU_READ_ONLY_DIRECT = 4, // open file with read-only access FU_READ_WRITE_DIRECT = 5, // open file with read-write access FU_CREATE_READ_WRITE_DIRECT = 6, // same, but create and truncate with // 0666 mode bits if the file doesn't // exist FU_CREATE_WRITE_ONLY_DIRECT = 7, // like above, but open for writing // only }; /** * Check whether a file exists. * * @param path The path to the file * @param exists A pointer to a bool which will be set to true if the * file exists and false if it does not. * * @return UDS_SUCCESS or an error code **/ int __must_check file_exists(const char *path, bool *exists); /** * Open a file. * * @param path The path to the file * @param access Access mode selected * @param fd A pointer to the return file descriptor on success * * @return UDS_SUCCESS or an error code **/ int __must_check open_file(const char *path, enum file_access access, int *fd); /** * Close a file. * * @param fd The file descriptor to close * @param error_message The error message to log if the close fails (if * NULL, no error will be logged). * * @return UDS_SUCCESS or an error code **/ int close_file(int fd, const char *error_message); /** * Attempt to close a file, ignoring errors. * * @param fd The file descriptor to close **/ void try_close_file(int fd); /** * Close a file after syncing it. * * @param fd The file descriptor to close * @param error_message The error message to log if the close fails (if * NULL, no error will be logged). * * @return UDS_SUCCESS or an error code **/ int __must_check sync_and_close_file(int fd, const char *error_message); /** * Attempt to sync and then close a file, ignoring errors. * * @param fd The file descriptor to close **/ void try_sync_and_close_file(int fd); /** * Read into a buffer from a file. * * @param fd The file descriptor from which to read * @param buffer The buffer into which to read * @param length The number of bytes to read * * @return UDS_SUCCESS or an error code **/ int __must_check read_buffer(int fd, void *buffer, unsigned int length); /** * Read into a buffer from a file at a given offset into the file. * * @param [in] fd The file descriptor from which to read * @param [in] offset The file offset at which to start reading * @param [in] buffer The buffer into which to read * @param [in] size The size of the buffer * @param [out] length The amount actually read. * * @return UDS_SUCCESS or an error code **/ int __must_check read_data_at_offset(int fd, off_t offset, void *buffer, size_t size, size_t *length); /** * Write a buffer to a file. * * @param fd The file descriptor to which to write * @param buffer The buffer to write * @param length The number of bytes to write * * @return UDS_SUCCESS or an error code **/ int __must_check write_buffer(int fd, const void *buffer, unsigned int length); /** * Write a buffer to a file starting at a given offset in the file. * * @param fd The file descriptor to which to write * @param offset The offset into the file at which to write * @param buffer The buffer to write * @param length The number of bytes to write * * @return UDS_SUCCESS or an error code **/ int __must_check write_buffer_at_offset(int fd, off_t offset, const void *buffer, size_t length); /** * Determine the size of an open file. * * @param fd the file descriptor * @param size_ptr a pointer in which to store the result * * @return UDS_SUCCESS or an error code **/ int __must_check get_open_file_size(int fd, off_t *size_ptr); /** * Remove a file, logging an error if any. * * @param file_name The file name to remove * * @return UDS_SUCCESS or error code. **/ int remove_file(const char *file_name); /** * Match file or path name. * * @param pattern A shell wildcard pattern. * @param string String to match against pattern. * @param flags Modify matching behavior as per fnmatch(3). * * @return True if there was a match, false otherwise. * * @note Logs errors encountered. **/ bool __must_check file_name_match(const char *pattern, const char *string, int flags); /** * Convert a path to an absolute path by adding the current working directory * to the beginning if necessary. On success, abs_path should be * freed by the caller. * * @param [in] path A path to be converted * @param [out] abs_path An absolute path * * @return UDS_SUCCESS or an error code **/ int make_abs_path(const char *path, char **abs_path); /** * Wrap the stat(2) system call. * * @param path The path to stat * @param buf A buffer to hold the stat results * @param context The calling context (for logging) * * @return UDS_SUCCESS or an error code **/ int __must_check logging_stat(const char *path, struct stat *buf, const char *context); /** * Wrap the stat(2) system call. Use this version if it should not be an * error for the file being statted to not exist. * * @param path The path to stat * @param buf A buffer to hold the stat results * @param context The calling context (for logging) * * @return UDS_SUCCESS or an error code **/ int __must_check logging_stat_missing_ok(const char *path, struct stat *buf, const char *context); /** * Wrap the fstat(2) system call. * * @param fd The descriptor to stat * @param buf A buffer to hold the stat results * @param context The calling context (for logging) * * @return UDS_SUCCESS or an error code **/ int __must_check logging_fstat(int fd, struct stat *buf, const char *context); /** * Wrap the fsync(2) system call. * * @param fd The descriptor to sync * @param context The calling context (for logging) * * @return UDS_SUCCESS or an error code **/ int __must_check logging_fsync(int fd, const char *context); #endif /* FILE_UTILS_H */ vdo-8.3.1.1/utils/uds/funnel-queue.c000066400000000000000000000117721476467262700172170ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "funnel-queue.h" #include "cpu.h" #include "memory-alloc.h" #include "permassert.h" int vdo_make_funnel_queue(struct funnel_queue **queue_ptr) { int result; struct funnel_queue *queue; result = vdo_allocate(1, struct funnel_queue, "funnel queue", &queue); if (result != VDO_SUCCESS) return result; /* * Initialize the stub entry and put it in the queue, establishing the invariant that * queue->newest and queue->oldest are never null. */ queue->stub.next = NULL; queue->newest = &queue->stub; queue->oldest = &queue->stub; *queue_ptr = queue; return VDO_SUCCESS; } void vdo_free_funnel_queue(struct funnel_queue *queue) { vdo_free(queue); } static struct funnel_queue_entry *get_oldest(struct funnel_queue *queue) { /* * Barrier requirements: We need a read barrier between reading a "next" field pointer * value and reading anything it points to. There's an accompanying barrier in * vdo_funnel_queue_put() between its caller setting up the entry and making it visible. */ struct funnel_queue_entry *oldest = queue->oldest; struct funnel_queue_entry *next = READ_ONCE(oldest->next); if (oldest == &queue->stub) { /* * When the oldest entry is the stub and it has no successor, the queue is * logically empty. */ if (next == NULL) return NULL; /* * The stub entry has a successor, so the stub can be dequeued and ignored without * breaking the queue invariants. */ oldest = next; queue->oldest = oldest; next = READ_ONCE(oldest->next); } /* * We have a non-stub candidate to dequeue. If it lacks a successor, we'll need to put the * stub entry back on the queue first. */ if (next == NULL) { struct funnel_queue_entry *newest = READ_ONCE(queue->newest); if (oldest != newest) { /* * Another thread has already swung queue->newest atomically, but not yet * assigned previous->next. The queue is really still empty. */ return NULL; } /* * Put the stub entry back on the queue, ensuring a successor will eventually be * seen. */ vdo_funnel_queue_put(queue, &queue->stub); /* Check again for a successor. */ next = READ_ONCE(oldest->next); if (next == NULL) { /* * We lost a race with a producer who swapped queue->newest before we did, * but who hasn't yet updated previous->next. Try again later. */ return NULL; } } return oldest; } /* * Poll a queue, removing the oldest entry if the queue is not empty. This function must only be * called from a single consumer thread. */ struct funnel_queue_entry *vdo_funnel_queue_poll(struct funnel_queue *queue) { struct funnel_queue_entry *oldest = get_oldest(queue); if (oldest == NULL) return oldest; /* * Dequeue the oldest entry and return it. Only one consumer thread may call this function, * so no locking, atomic operations, or fences are needed; queue->oldest is owned by the * consumer and oldest->next is never used by a producer thread after it is swung from NULL * to non-NULL. */ queue->oldest = READ_ONCE(oldest->next); /* * Make sure the caller sees the proper stored data for this entry. Since we've already * fetched the entry pointer we stored in "queue->oldest", this also ensures that on entry * to the next call we'll properly see the dependent data. */ smp_rmb(); /* * If "oldest" is a very light-weight work item, we'll be looking for the next one very * soon, so prefetch it now. */ uds_prefetch_address(queue->oldest, true); WRITE_ONCE(oldest->next, NULL); return oldest; } /* * Check whether the funnel queue is empty or not. If the queue is in a transition state with one * or more entries being added such that the list view is incomplete, this function will report the * queue as empty. */ bool vdo_is_funnel_queue_empty(struct funnel_queue *queue) { return get_oldest(queue) == NULL; } /* * Check whether the funnel queue is idle or not. If the queue has entries available to be * retrieved, it is not idle. If the queue is in a transition state with one or more entries being * added such that the list view is incomplete, it may not be possible to retrieve an entry with * the vdo_funnel_queue_poll() function, but the queue will not be considered idle. */ bool vdo_is_funnel_queue_idle(struct funnel_queue *queue) { /* * Oldest is not the stub, so there's another entry, though if next is NULL we can't * retrieve it yet. */ if (queue->oldest != &queue->stub) return false; /* * Oldest is the stub, but newest has been updated by _put(); either there's another, * retrievable entry in the list, or the list is officially empty but in the intermediate * state of having an entry added. * * Whether anything is retrievable depends on whether stub.next has been updated and become * visible to us, but for idleness we don't care. And due to memory ordering in _put(), the * update to newest would be visible to us at the same time or sooner. */ if (READ_ONCE(queue->newest) != &queue->stub) return false; return true; } vdo-8.3.1.1/utils/uds/funnel-queue.h000066400000000000000000000114451476467262700172210ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_FUNNEL_QUEUE_H #define VDO_FUNNEL_QUEUE_H #include #include /* * A funnel queue is a simple (almost) lock-free queue that accepts entries from multiple threads * (multi-producer) and delivers them to a single thread (single-consumer). "Funnel" is an attempt * to evoke the image of requests from more than one producer being "funneled down" to a single * consumer. * * This is an unsynchronized but thread-safe data structure when used as intended. There is no * mechanism to ensure that only one thread is consuming from the queue. If more than one thread * attempts to consume from the queue, the resulting behavior is undefined. Clients must not * directly access or manipulate the internals of the queue, which are only exposed for the purpose * of allowing the very simple enqueue operation to be inlined. * * The implementation requires that a funnel_queue_entry structure (a link pointer) is embedded in * the queue entries, and pointers to those structures are used exclusively by the queue. No macros * are defined to template the queue, so the offset of the funnel_queue_entry in the records placed * in the queue must all be the same so the client can derive their structure pointer from the * entry pointer returned by vdo_funnel_queue_poll(). * * Callers are wholly responsible for allocating and freeing the entries. Entries may be freed as * soon as they are returned since this queue is not susceptible to the "ABA problem" present in * many lock-free data structures. The queue is dynamically allocated to ensure cache-line * alignment, but no other dynamic allocation is used. * * The algorithm is not actually 100% lock-free. There is a single point in vdo_funnel_queue_put() * at which a preempted producer will prevent the consumers from seeing items added to the queue by * later producers, and only if the queue is short enough or the consumer fast enough for it to * reach what was the end of the queue at the time of the preemption. * * The consumer function, vdo_funnel_queue_poll(), will return NULL when the queue is empty. To * wait for data to consume, spin (if safe) or combine the queue with a struct event_count to * signal the presence of new entries. */ /* This queue link structure must be embedded in client entries. */ struct funnel_queue_entry { /* The next (newer) entry in the queue. */ struct funnel_queue_entry *next; }; /* * The dynamically allocated queue structure, which is allocated on a cache line boundary so the * producer and consumer fields in the structure will land on separate cache lines. This should be * consider opaque but it is exposed here so vdo_funnel_queue_put() can be inlined. */ struct __aligned(L1_CACHE_BYTES) funnel_queue { /* * The producers' end of the queue, an atomically exchanged pointer that will never be * NULL. */ struct funnel_queue_entry *newest; /* The consumer's end of the queue, which is owned by the consumer and never NULL. */ struct funnel_queue_entry *oldest __aligned(L1_CACHE_BYTES); /* A dummy entry used to provide the non-NULL invariants above. */ struct funnel_queue_entry stub; }; int __must_check vdo_make_funnel_queue(struct funnel_queue **queue_ptr); void vdo_free_funnel_queue(struct funnel_queue *queue); /* * Put an entry on the end of the queue. * * The entry pointer must be to the struct funnel_queue_entry embedded in the caller's data * structure. The caller must be able to derive the address of the start of their data structure * from the pointer that passed in here, so every entry in the queue must have the struct * funnel_queue_entry at the same offset within the client's structure. */ static inline void vdo_funnel_queue_put(struct funnel_queue *queue, struct funnel_queue_entry *entry) { struct funnel_queue_entry *previous; /* * Barrier requirements: All stores relating to the entry ("next" pointer, containing data * structure fields) must happen before the previous->next store making it visible to the * consumer. Also, the entry's "next" field initialization to NULL must happen before any * other producer threads can see the entry (the xchg) and try to update the "next" field. * * xchg implements a full barrier. */ WRITE_ONCE(entry->next, NULL); previous = xchg(&queue->newest, entry); /* * Preemptions between these two statements hide the rest of the queue from the consumer, * preventing consumption until the following assignment runs. */ WRITE_ONCE(previous->next, entry); } struct funnel_queue_entry *__must_check vdo_funnel_queue_poll(struct funnel_queue *queue); bool __must_check vdo_is_funnel_queue_empty(struct funnel_queue *queue); bool __must_check vdo_is_funnel_queue_idle(struct funnel_queue *queue); #endif /* VDO_FUNNEL_QUEUE_H */ vdo-8.3.1.1/utils/uds/funnel-requestqueue.h000066400000000000000000000016241476467262700206300ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_REQUEST_QUEUE_H #define UDS_REQUEST_QUEUE_H #include "indexer.h" /* * A simple request queue which will handle new requests in the order in which they are received, * and will attempt to handle requeued requests before new ones. However, the nature of the * implementation means that it cannot guarantee this ordering; the prioritization is merely a * hint. */ struct uds_request_queue; typedef void (*uds_request_queue_processor_fn)(struct uds_request *); int __must_check uds_make_request_queue(const char *queue_name, uds_request_queue_processor_fn processor, struct uds_request_queue **queue_ptr); void uds_request_queue_enqueue(struct uds_request_queue *queue, struct uds_request *request); void uds_request_queue_finish(struct uds_request_queue *queue); #endif /* UDS_REQUEST_QUEUE_H */ vdo-8.3.1.1/utils/uds/geometry.c000066400000000000000000000176771476467262700164530ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "geometry.h" #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "delta-index.h" #include "indexer.h" /* * An index volume is divided into a fixed number of fixed-size chapters, each consisting of a * fixed number of fixed-size pages. The volume layout is defined by two constants and four * parameters. The constants are that index records are 32 bytes long (16-byte block name plus * 16-byte metadata) and that open chapter index hash slots are one byte long. The four parameters * are the number of bytes in a page, the number of record pages in a chapter, the number of * chapters in a volume, and the number of chapters that are sparse. From these parameters, we can * derive the rest of the layout and other index properties. * * The index volume is sized by its maximum memory footprint. For a dense index, the persistent * storage is about 10 times the size of the memory footprint. For a sparse index, the persistent * storage is about 100 times the size of the memory footprint. * * For a small index with a memory footprint less than 1GB, there are three possible memory * configurations: 0.25GB, 0.5GB and 0.75GB. The default geometry for each is 1024 index records * per 32 KB page, 1024 chapters per volume, and either 64, 128, or 192 record pages per chapter * (resulting in 6, 13, or 20 index pages per chapter) depending on the memory configuration. For * the VDO default of a 0.25 GB index, this yields a deduplication window of 256 GB using about 2.5 * GB for the persistent storage and 256 MB of RAM. * * For a larger index with a memory footprint that is a multiple of 1 GB, the geometry is 1024 * index records per 32 KB page, 256 record pages per chapter, 26 index pages per chapter, and 1024 * chapters for every GB of memory footprint. For a 1 GB volume, this yields a deduplication window * of 1 TB using about 9GB of persistent storage and 1 GB of RAM. * * The above numbers hold for volumes which have no sparse chapters. A sparse volume has 10 times * as many chapters as the corresponding non-sparse volume, which provides 10 times the * deduplication window while using 10 times as much persistent storage as the equivalent * non-sparse volume with the same memory footprint. * * If the volume has been converted from a non-lvm format to an lvm volume, the number of chapters * per volume will have been reduced by one by eliminating physical chapter 0, and the virtual * chapter that formerly mapped to physical chapter 0 may be remapped to another physical chapter. * This remapping is expressed by storing which virtual chapter was remapped, and which physical * chapter it was moved to. */ int uds_make_index_geometry(size_t bytes_per_page, u32 record_pages_per_chapter, u32 chapters_per_volume, u32 sparse_chapters_per_volume, u64 remapped_virtual, u64 remapped_physical, struct index_geometry **geometry_ptr) { int result; struct index_geometry *geometry; result = vdo_allocate(1, struct index_geometry, "geometry", &geometry); if (result != VDO_SUCCESS) return result; geometry->bytes_per_page = bytes_per_page; geometry->record_pages_per_chapter = record_pages_per_chapter; geometry->chapters_per_volume = chapters_per_volume; geometry->sparse_chapters_per_volume = sparse_chapters_per_volume; geometry->dense_chapters_per_volume = chapters_per_volume - sparse_chapters_per_volume; geometry->remapped_virtual = remapped_virtual; geometry->remapped_physical = remapped_physical; geometry->records_per_page = bytes_per_page / BYTES_PER_RECORD; geometry->records_per_chapter = geometry->records_per_page * record_pages_per_chapter; geometry->records_per_volume = (u64) geometry->records_per_chapter * chapters_per_volume; geometry->chapter_mean_delta = 1 << DEFAULT_CHAPTER_MEAN_DELTA_BITS; geometry->chapter_payload_bits = bits_per(record_pages_per_chapter - 1); /* * We want 1 delta list for every 64 records in the chapter. * The "| 077" ensures that the chapter_delta_list_bits computation * does not underflow. */ geometry->chapter_delta_list_bits = bits_per((geometry->records_per_chapter - 1) | 077) - 6; geometry->delta_lists_per_chapter = 1 << geometry->chapter_delta_list_bits; /* We need enough address bits to achieve the desired mean delta. */ geometry->chapter_address_bits = (DEFAULT_CHAPTER_MEAN_DELTA_BITS - geometry->chapter_delta_list_bits + bits_per(geometry->records_per_chapter - 1)); geometry->index_pages_per_chapter = uds_get_delta_index_page_count(geometry->records_per_chapter, geometry->delta_lists_per_chapter, geometry->chapter_mean_delta, geometry->chapter_payload_bits, bytes_per_page); geometry->pages_per_chapter = geometry->index_pages_per_chapter + record_pages_per_chapter; geometry->pages_per_volume = geometry->pages_per_chapter * chapters_per_volume; geometry->bytes_per_volume = bytes_per_page * (geometry->pages_per_volume + HEADER_PAGES_PER_VOLUME); *geometry_ptr = geometry; return UDS_SUCCESS; } int uds_copy_index_geometry(struct index_geometry *source, struct index_geometry **geometry_ptr) { return uds_make_index_geometry(source->bytes_per_page, source->record_pages_per_chapter, source->chapters_per_volume, source->sparse_chapters_per_volume, source->remapped_virtual, source->remapped_physical, geometry_ptr); } void uds_free_index_geometry(struct index_geometry *geometry) { vdo_free(geometry); } u32 __must_check uds_map_to_physical_chapter(const struct index_geometry *geometry, u64 virtual_chapter) { u64 delta; if (!uds_is_reduced_index_geometry(geometry)) return virtual_chapter % geometry->chapters_per_volume; if (likely(virtual_chapter > geometry->remapped_virtual)) { delta = virtual_chapter - geometry->remapped_virtual; if (likely(delta > geometry->remapped_physical)) return delta % geometry->chapters_per_volume; else return delta - 1; } if (virtual_chapter == geometry->remapped_virtual) return geometry->remapped_physical; delta = geometry->remapped_virtual - virtual_chapter; if (delta < geometry->chapters_per_volume) return geometry->chapters_per_volume - delta; /* This chapter is so old the answer doesn't matter. */ return 0; } /* Check whether any sparse chapters are in use. */ bool uds_has_sparse_chapters(const struct index_geometry *geometry, u64 oldest_virtual_chapter, u64 newest_virtual_chapter) { return uds_is_sparse_index_geometry(geometry) && ((newest_virtual_chapter - oldest_virtual_chapter + 1) > geometry->dense_chapters_per_volume); } bool uds_is_chapter_sparse(const struct index_geometry *geometry, u64 oldest_virtual_chapter, u64 newest_virtual_chapter, u64 virtual_chapter_number) { return uds_has_sparse_chapters(geometry, oldest_virtual_chapter, newest_virtual_chapter) && ((virtual_chapter_number + geometry->dense_chapters_per_volume) <= newest_virtual_chapter); } /* Calculate how many chapters to expire after opening the newest chapter. */ u32 uds_chapters_to_expire(const struct index_geometry *geometry, u64 newest_chapter) { /* If the index isn't full yet, don't expire anything. */ if (newest_chapter < geometry->chapters_per_volume) return 0; /* If a chapter is out of order... */ if (geometry->remapped_physical > 0) { u64 oldest_chapter = newest_chapter - geometry->chapters_per_volume; /* * ... expire an extra chapter when expiring the moved chapter to free physical * space for the new chapter ... */ if (oldest_chapter == geometry->remapped_virtual) return 2; /* * ... but don't expire anything when the new chapter will use the physical chapter * freed by expiring the moved chapter. */ if (oldest_chapter == (geometry->remapped_virtual + geometry->remapped_physical)) return 0; } /* Normally, just expire one. */ return 1; } vdo-8.3.1.1/utils/uds/geometry.h000066400000000000000000000111201476467262700164310ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_INDEX_GEOMETRY_H #define UDS_INDEX_GEOMETRY_H #include "indexer.h" /* * The index_geometry records parameters that define the layout of a UDS index volume, and the size and * shape of various index structures. It is created when the index is created, and is referenced by * many index sub-components. */ struct index_geometry { /* Size of a chapter page, in bytes */ size_t bytes_per_page; /* Number of record pages in a chapter */ u32 record_pages_per_chapter; /* Total number of chapters in a volume */ u32 chapters_per_volume; /* Number of sparsely-indexed chapters in a volume */ u32 sparse_chapters_per_volume; /* Number of bits used to determine delta list numbers */ u8 chapter_delta_list_bits; /* Virtual chapter remapped from physical chapter 0 */ u64 remapped_virtual; /* New physical chapter where the remapped chapter can be found */ u64 remapped_physical; /* * The following properties are derived from the ones above, but they are computed and * recorded as fields for convenience. */ /* Total number of pages in a volume, excluding the header */ u32 pages_per_volume; /* Total number of bytes in a volume, including the header */ size_t bytes_per_volume; /* Number of pages in a chapter */ u32 pages_per_chapter; /* Number of index pages in a chapter index */ u32 index_pages_per_chapter; /* Number of records that fit on a page */ u32 records_per_page; /* Number of records that fit in a chapter */ u32 records_per_chapter; /* Number of records that fit in a volume */ u64 records_per_volume; /* Number of delta lists per chapter index */ u32 delta_lists_per_chapter; /* Mean delta for chapter indexes */ u32 chapter_mean_delta; /* Number of bits needed for record page numbers */ u8 chapter_payload_bits; /* Number of bits used to compute addresses for chapter delta lists */ u8 chapter_address_bits; /* Number of densely-indexed chapters in a volume */ u32 dense_chapters_per_volume; }; enum { /* The number of bytes in a record (name + metadata) */ BYTES_PER_RECORD = (UDS_RECORD_NAME_SIZE + UDS_RECORD_DATA_SIZE), /* The default length of a page in a chapter, in bytes */ DEFAULT_BYTES_PER_PAGE = 1024 * BYTES_PER_RECORD, /* The default maximum number of records per page */ DEFAULT_RECORDS_PER_PAGE = DEFAULT_BYTES_PER_PAGE / BYTES_PER_RECORD, /* The default number of record pages in a chapter */ DEFAULT_RECORD_PAGES_PER_CHAPTER = 256, /* The default number of record pages in a chapter for a small index */ SMALL_RECORD_PAGES_PER_CHAPTER = 64, /* The default number of chapters in a volume */ DEFAULT_CHAPTERS_PER_VOLUME = 1024, /* The default number of sparsely-indexed chapters in a volume */ DEFAULT_SPARSE_CHAPTERS_PER_VOLUME = 0, /* The log2 of the default mean delta */ DEFAULT_CHAPTER_MEAN_DELTA_BITS = 16, /* The log2 of the number of delta lists in a large chapter */ DEFAULT_CHAPTER_DELTA_LIST_BITS = 12, /* The log2 of the number of delta lists in a small chapter */ SMALL_CHAPTER_DELTA_LIST_BITS = 10, /* The number of header pages per volume */ HEADER_PAGES_PER_VOLUME = 1, }; int __must_check uds_make_index_geometry(size_t bytes_per_page, u32 record_pages_per_chapter, u32 chapters_per_volume, u32 sparse_chapters_per_volume, u64 remapped_virtual, u64 remapped_physical, struct index_geometry **geometry_ptr); int __must_check uds_copy_index_geometry(struct index_geometry *source, struct index_geometry **geometry_ptr); void uds_free_index_geometry(struct index_geometry *geometry); u32 __must_check uds_map_to_physical_chapter(const struct index_geometry *geometry, u64 virtual_chapter); /* * Check whether this geometry is reduced by a chapter. This will only be true if the volume was * converted from a non-lvm volume to an lvm volume. */ static inline bool __must_check uds_is_reduced_index_geometry(const struct index_geometry *geometry) { return !!(geometry->chapters_per_volume & 1); } static inline bool __must_check uds_is_sparse_index_geometry(const struct index_geometry *geometry) { return geometry->sparse_chapters_per_volume > 0; } bool __must_check uds_has_sparse_chapters(const struct index_geometry *geometry, u64 oldest_virtual_chapter, u64 newest_virtual_chapter); bool __must_check uds_is_chapter_sparse(const struct index_geometry *geometry, u64 oldest_virtual_chapter, u64 newest_virtual_chapter, u64 virtual_chapter_number); u32 __must_check uds_chapters_to_expire(const struct index_geometry *geometry, u64 newest_chapter); #endif /* UDS_INDEX_GEOMETRY_H */ vdo-8.3.1.1/utils/uds/hash-utils.h000066400000000000000000000036501476467262700166700ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_HASH_UTILS_H #define UDS_HASH_UTILS_H #include "numeric.h" #include "geometry.h" #include "indexer.h" /* Utilities for extracting portions of a request name for various uses. */ /* How various portions of a record name are apportioned. */ enum { VOLUME_INDEX_BYTES_OFFSET = 0, VOLUME_INDEX_BYTES_COUNT = 8, CHAPTER_INDEX_BYTES_OFFSET = 8, CHAPTER_INDEX_BYTES_COUNT = 6, SAMPLE_BYTES_OFFSET = 14, SAMPLE_BYTES_COUNT = 2, }; static inline u64 uds_extract_chapter_index_bytes(const struct uds_record_name *name) { const u8 *chapter_bits = &name->name[CHAPTER_INDEX_BYTES_OFFSET]; u64 bytes = (u64) get_unaligned_be16(chapter_bits) << 32; bytes |= get_unaligned_be32(chapter_bits + 2); return bytes; } static inline u64 uds_extract_volume_index_bytes(const struct uds_record_name *name) { return get_unaligned_be64(&name->name[VOLUME_INDEX_BYTES_OFFSET]); } static inline u32 uds_extract_sampling_bytes(const struct uds_record_name *name) { return get_unaligned_be16(&name->name[SAMPLE_BYTES_OFFSET]); } /* Compute the chapter delta list for a given name. */ static inline u32 uds_hash_to_chapter_delta_list(const struct uds_record_name *name, const struct index_geometry *geometry) { return ((uds_extract_chapter_index_bytes(name) >> geometry->chapter_address_bits) & ((1 << geometry->chapter_delta_list_bits) - 1)); } /* Compute the chapter delta address for a given name. */ static inline u32 uds_hash_to_chapter_delta_address(const struct uds_record_name *name, const struct index_geometry *geometry) { return uds_extract_chapter_index_bytes(name) & ((1 << geometry->chapter_address_bits) - 1); } static inline unsigned int uds_name_to_hash_slot(const struct uds_record_name *name, unsigned int slot_count) { return (unsigned int) (uds_extract_chapter_index_bytes(name) % slot_count); } #endif /* UDS_HASH_UTILS_H */ vdo-8.3.1.1/utils/uds/hlist.h000066400000000000000000000054771476467262700157430ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef HLIST_H #define HLIST_H #include /* * An "hlist" is a doubly linked list with the listhead being a single pointer * to the head of the list. * * The Linux kernel provides an hlist implementation in . This * file defines the hlist interfaces used by UDS for the user mode build. * * The equivalent used in the user implementation is a LIST. */ struct hlist_head { struct hlist_node *first; }; struct hlist_node { struct hlist_node *next, **pprev; }; #define INIT_HLIST_HEAD(ptr) ((ptr)->first = NULL) #define hlist_entry(ptr, type, member) container_of(ptr,type,member) #define hlist_entry_safe(ptr, type, member) \ __extension__({ typeof(ptr) ____ptr = (ptr); \ ____ptr ? hlist_entry(____ptr, type, member) : NULL; \ }) /** * Iterate over list of given type * @param pos the type * to use as a loop cursor. * @param head the head for your list. * @param member the name of the hlist_node within the struct. */ #define hlist_for_each_entry(pos, head, member) \ for (pos = hlist_entry_safe((head)->first, typeof(*(pos)), member); \ pos; \ pos = hlist_entry_safe((pos)->member.next, typeof(*(pos)), member)) /** * Add a new entry at the beginning of the hlist * @param n new entry to be added * @param h hlist head to add it after */ static inline void hlist_add_head(struct hlist_node *n, struct hlist_head *h) { struct hlist_node *first = h->first; WRITE_ONCE(n->next, first); if (first) WRITE_ONCE(first->pprev, &n->next); WRITE_ONCE(h->first, n); WRITE_ONCE(n->pprev, &h->first); } /** * Delete the specified hlist_node from its list * @param n Node to delete. */ static inline void hlist_del(struct hlist_node *n) { struct hlist_node *next = n->next; struct hlist_node **pprev = n->pprev; WRITE_ONCE(*pprev, next); if (next) WRITE_ONCE(next->pprev, pprev); } /** * Is the specified hlist_head structure an empty hlist? * @param h Structure to check. */ static inline int hlist_empty(const struct hlist_head *h) { return !READ_ONCE(h->first); } #endif /* HLIST_H */ vdo-8.3.1.1/utils/uds/index-layout.c000066400000000000000000001421031476467262700172210ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "index-layout.h" #include #include "logger.h" #include "memory-alloc.h" #include "murmurhash3.h" #include "numeric.h" #include "time-utils.h" #include "config.h" #include "open-chapter.h" #include "volume-index.h" /* * The UDS layout on storage media is divided into a number of fixed-size regions, the sizes of * which are computed when the index is created. Every header and region begins on 4K block * boundary. Save regions are further sub-divided into regions of their own. * * Each region has a kind and an instance number. Some kinds only have one instance and therefore * use RL_SOLE_INSTANCE (-1) as the instance number. The RL_KIND_INDEX used to use instances to * represent sub-indices; now, however there is only ever one sub-index and therefore one instance. * The RL_KIND_VOLUME_INDEX uses instances to record which zone is being saved. * * Every region header has a type and version. * * +-+-+---------+--------+--------+-+ * | | | I N D E X 0 101, 0 | | * |H|C+---------+--------+--------+S| * |D|f| Volume | Save | Save |e| * |R|g| Region | Region | Region |a| * | | | 201, -1 | 202, 0 | 202, 1 |l| * +-+-+--------+---------+--------+-+ * * The header contains the encoded region layout table as well as some index configuration data. * The sub-index region and its subdivisions are maintained in the same table. * * There are two save regions to preserve the old state in case saving the new state is incomplete. * They are used in alternation. Each save region is further divided into sub-regions. * * +-+-----+------+------+-----+-----+ * |H| IPM | MI | MI | | OC | * |D| | zone | zone | ... | | * |R| 301 | 302 | 302 | | 303 | * | | -1 | 0 | 1 | | -1 | * +-+-----+------+------+-----+-----+ * * The header contains the encoded region layout table as well as index state data for that save. * Each save also has a unique nonce. */ #define NONCE_INFO_SIZE 32 #define MAX_SAVES 2 enum region_kind { RL_KIND_EMPTY = 0, RL_KIND_HEADER = 1, RL_KIND_CONFIG = 100, RL_KIND_INDEX = 101, RL_KIND_SEAL = 102, RL_KIND_VOLUME = 201, RL_KIND_SAVE = 202, RL_KIND_INDEX_PAGE_MAP = 301, RL_KIND_VOLUME_INDEX = 302, RL_KIND_OPEN_CHAPTER = 303, }; /* Some region types are historical and are no longer used. */ enum region_type { RH_TYPE_FREE = 0, /* unused */ RH_TYPE_SUPER = 1, RH_TYPE_SAVE = 2, RH_TYPE_CHECKPOINT = 3, /* unused */ RH_TYPE_UNSAVED = 4, }; #define RL_SOLE_INSTANCE 65535 /* * Super block version 2 is the first released version. * * Super block version 3 is the normal version used from RHEL 8.2 onwards. * * Super block versions 4 through 6 were incremental development versions and * are not supported. * * Super block version 7 is used for volumes which have been reduced in size by one chapter in * order to make room to prepend LVM metadata to a volume originally created without lvm. This * allows the index to retain most its deduplication records. */ #define SUPER_VERSION_MINIMUM 3 #define SUPER_VERSION_CURRENT 3 #define SUPER_VERSION_MAXIMUM 7 static const u8 LAYOUT_MAGIC[] = "*ALBIREO*SINGLE*FILE*LAYOUT*001*"; static const u64 REGION_MAGIC = 0x416c6252676e3031; /* 'AlbRgn01' */ #define MAGIC_SIZE (sizeof(LAYOUT_MAGIC) - 1) struct region_header { u64 magic; u64 region_blocks; u16 type; /* Currently always version 1 */ u16 version; u16 region_count; u16 payload; }; struct layout_region { u64 start_block; u64 block_count; u32 __unused; u16 kind; u16 instance; }; struct region_table { size_t encoded_size; struct region_header header; struct layout_region regions[]; }; struct index_save_data { u64 timestamp; u64 nonce; /* Currently always version 1 */ u32 version; u32 unused__; }; struct index_state_version { s32 signature; s32 version_id; }; static const struct index_state_version INDEX_STATE_VERSION_301 = { .signature = -1, .version_id = 301, }; struct index_state_data301 { struct index_state_version version; u64 newest_chapter; u64 oldest_chapter; u64 last_save; u32 unused; u32 padding; }; struct index_save_layout { unsigned int zone_count; struct layout_region index_save; struct layout_region header; struct layout_region index_page_map; struct layout_region free_space; struct layout_region volume_index_zones[MAX_ZONES]; struct layout_region open_chapter; struct index_save_data save_data; struct index_state_data301 state_data; }; struct sub_index_layout { u64 nonce; struct layout_region sub_index; struct layout_region volume; struct index_save_layout *saves; }; struct super_block_data { u8 magic_label[MAGIC_SIZE]; u8 nonce_info[NONCE_INFO_SIZE]; u64 nonce; u32 version; u32 block_size; u16 index_count; u16 max_saves; /* Padding reflects a blank field on permanent storage */ u8 padding[4]; u64 open_chapter_blocks; u64 page_map_blocks; u64 volume_offset; u64 start_offset; }; struct index_layout { struct io_factory *factory; size_t factory_size; off_t offset; struct super_block_data super; struct layout_region header; struct layout_region config; struct sub_index_layout index; struct layout_region seal; u64 total_blocks; }; struct save_layout_sizes { unsigned int save_count; size_t block_size; u64 volume_blocks; u64 volume_index_blocks; u64 page_map_blocks; u64 open_chapter_blocks; u64 save_blocks; u64 sub_index_blocks; u64 total_blocks; size_t total_size; }; static inline bool is_converted_super_block(struct super_block_data *super) { return super->version == 7; } static int __must_check compute_sizes(const struct uds_configuration *config, struct save_layout_sizes *sls) { int result; struct index_geometry *geometry = config->geometry; memset(sls, 0, sizeof(*sls)); sls->save_count = MAX_SAVES; sls->block_size = UDS_BLOCK_SIZE; sls->volume_blocks = geometry->bytes_per_volume / sls->block_size; result = uds_compute_volume_index_save_blocks(config, sls->block_size, &sls->volume_index_blocks); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot compute index save size"); sls->page_map_blocks = DIV_ROUND_UP(uds_compute_index_page_map_save_size(geometry), sls->block_size); sls->open_chapter_blocks = DIV_ROUND_UP(uds_compute_saved_open_chapter_size(geometry), sls->block_size); sls->save_blocks = 1 + (sls->volume_index_blocks + sls->page_map_blocks + sls->open_chapter_blocks); sls->sub_index_blocks = sls->volume_blocks + (sls->save_count * sls->save_blocks); sls->total_blocks = 3 + sls->sub_index_blocks; sls->total_size = sls->total_blocks * sls->block_size; return UDS_SUCCESS; } int uds_compute_index_size(const struct uds_parameters *parameters, u64 *index_size) { int result; struct uds_configuration *index_config; struct save_layout_sizes sizes; if (index_size == NULL) { vdo_log_error("Missing output size pointer"); return -EINVAL; } result = uds_make_configuration(parameters, &index_config); if (result != UDS_SUCCESS) { vdo_log_error_strerror(result, "cannot compute index size"); return uds_status_to_errno(result); } result = compute_sizes(index_config, &sizes); uds_free_configuration(index_config); if (result != UDS_SUCCESS) return uds_status_to_errno(result); *index_size = sizes.total_size; return UDS_SUCCESS; } /* Create unique data using the current time and a pseudorandom number. */ static void create_unique_nonce_data(u8 *buffer) { ktime_t now = current_time_ns(CLOCK_REALTIME); u32 rand; size_t offset = 0; get_random_bytes(&rand, sizeof(u32)); memcpy(buffer + offset, &now, sizeof(now)); offset += sizeof(now); memcpy(buffer + offset, &rand, sizeof(rand)); offset += sizeof(rand); while (offset < NONCE_INFO_SIZE) { size_t len = min(NONCE_INFO_SIZE - offset, offset); memcpy(buffer + offset, buffer, len); offset += len; } } static u64 hash_stuff(u64 start, const void *data, size_t len) { u32 seed = start ^ (start >> 27); u8 hash_buffer[16]; murmurhash3_128(data, len, seed, hash_buffer); return get_unaligned_le64(hash_buffer + 4); } /* Generate a primary nonce from the provided data. */ static u64 generate_primary_nonce(const void *data, size_t len) { return hash_stuff(0xa1b1e0fc, data, len); } /* * Deterministically generate a secondary nonce from an existing nonce and some arbitrary data by * hashing the original nonce and the data to produce a new nonce. */ static u64 generate_secondary_nonce(u64 nonce, const void *data, size_t len) { return hash_stuff(nonce + 1, data, len); } static int __must_check open_layout_reader(struct index_layout *layout, struct layout_region *lr, off_t offset, struct buffered_reader **reader_ptr) { return uds_make_buffered_reader(layout->factory, lr->start_block + offset, lr->block_count, reader_ptr); } static int open_region_reader(struct index_layout *layout, struct layout_region *region, struct buffered_reader **reader_ptr) { return open_layout_reader(layout, region, -layout->super.start_offset, reader_ptr); } static int __must_check open_layout_writer(struct index_layout *layout, struct layout_region *lr, off_t offset, struct buffered_writer **writer_ptr) { return uds_make_buffered_writer(layout->factory, lr->start_block + offset, lr->block_count, writer_ptr); } static int open_region_writer(struct index_layout *layout, struct layout_region *region, struct buffered_writer **writer_ptr) { return open_layout_writer(layout, region, -layout->super.start_offset, writer_ptr); } static void generate_super_block_data(struct save_layout_sizes *sls, struct super_block_data *super) { memset(super, 0, sizeof(*super)); memcpy(super->magic_label, LAYOUT_MAGIC, MAGIC_SIZE); create_unique_nonce_data(super->nonce_info); super->nonce = generate_primary_nonce(super->nonce_info, sizeof(super->nonce_info)); super->version = SUPER_VERSION_CURRENT; super->block_size = sls->block_size; super->index_count = 1; super->max_saves = sls->save_count; super->open_chapter_blocks = sls->open_chapter_blocks; super->page_map_blocks = sls->page_map_blocks; super->volume_offset = 0; super->start_offset = 0; } static void define_sub_index_nonce(struct index_layout *layout) { struct sub_index_nonce_data { u64 offset; u16 index_id; }; struct sub_index_layout *sil = &layout->index; u64 primary_nonce = layout->super.nonce; u8 buffer[sizeof(struct sub_index_nonce_data)] = { 0 }; size_t offset = 0; encode_u64_le(buffer, &offset, sil->sub_index.start_block); encode_u16_le(buffer, &offset, 0); sil->nonce = generate_secondary_nonce(primary_nonce, buffer, sizeof(buffer)); if (sil->nonce == 0) { sil->nonce = generate_secondary_nonce(~primary_nonce + 1, buffer, sizeof(buffer)); } } static void setup_sub_index(struct index_layout *layout, u64 start_block, struct save_layout_sizes *sls) { struct sub_index_layout *sil = &layout->index; u64 next_block = start_block; unsigned int i; sil->sub_index = (struct layout_region) { .start_block = start_block, .block_count = sls->sub_index_blocks, .kind = RL_KIND_INDEX, .instance = 0, }; sil->volume = (struct layout_region) { .start_block = next_block, .block_count = sls->volume_blocks, .kind = RL_KIND_VOLUME, .instance = RL_SOLE_INSTANCE, }; next_block += sls->volume_blocks; for (i = 0; i < sls->save_count; i++) { sil->saves[i].index_save = (struct layout_region) { .start_block = next_block, .block_count = sls->save_blocks, .kind = RL_KIND_SAVE, .instance = i, }; next_block += sls->save_blocks; } define_sub_index_nonce(layout); } static void initialize_layout(struct index_layout *layout, struct save_layout_sizes *sls) { u64 next_block = layout->offset / sls->block_size; layout->total_blocks = sls->total_blocks; generate_super_block_data(sls, &layout->super); layout->header = (struct layout_region) { .start_block = next_block++, .block_count = 1, .kind = RL_KIND_HEADER, .instance = RL_SOLE_INSTANCE, }; layout->config = (struct layout_region) { .start_block = next_block++, .block_count = 1, .kind = RL_KIND_CONFIG, .instance = RL_SOLE_INSTANCE, }; setup_sub_index(layout, next_block, sls); next_block += sls->sub_index_blocks; layout->seal = (struct layout_region) { .start_block = next_block, .block_count = 1, .kind = RL_KIND_SEAL, .instance = RL_SOLE_INSTANCE, }; } static int __must_check make_index_save_region_table(struct index_save_layout *isl, struct region_table **table_ptr) { int result; unsigned int z; struct region_table *table; struct layout_region *lr; u16 region_count; size_t payload; size_t type; if (isl->zone_count > 0) { /* * Normal save regions: header, page map, volume index zones, * open chapter, and possibly free space. */ region_count = 3 + isl->zone_count; if (isl->free_space.block_count > 0) region_count++; payload = sizeof(isl->save_data) + sizeof(isl->state_data); type = RH_TYPE_SAVE; } else { /* Empty save regions: header, page map, free space. */ region_count = 3; payload = sizeof(isl->save_data); type = RH_TYPE_UNSAVED; } result = vdo_allocate_extended(struct region_table, region_count, struct layout_region, "layout region table for ISL", &table); if (result != VDO_SUCCESS) return result; lr = &table->regions[0]; *lr++ = isl->header; *lr++ = isl->index_page_map; for (z = 0; z < isl->zone_count; z++) *lr++ = isl->volume_index_zones[z]; if (isl->zone_count > 0) *lr++ = isl->open_chapter; if (isl->free_space.block_count > 0) *lr++ = isl->free_space; table->header = (struct region_header) { .magic = REGION_MAGIC, .region_blocks = isl->index_save.block_count, .type = type, .version = 1, .region_count = region_count, .payload = payload, }; table->encoded_size = (sizeof(struct region_header) + payload + region_count * sizeof(struct layout_region)); *table_ptr = table; return UDS_SUCCESS; } static void encode_region_table(u8 *buffer, size_t *offset, struct region_table *table) { unsigned int i; encode_u64_le(buffer, offset, REGION_MAGIC); encode_u64_le(buffer, offset, table->header.region_blocks); encode_u16_le(buffer, offset, table->header.type); encode_u16_le(buffer, offset, table->header.version); encode_u16_le(buffer, offset, table->header.region_count); encode_u16_le(buffer, offset, table->header.payload); for (i = 0; i < table->header.region_count; i++) { encode_u64_le(buffer, offset, table->regions[i].start_block); encode_u64_le(buffer, offset, table->regions[i].block_count); encode_u32_le(buffer, offset, 0); encode_u16_le(buffer, offset, table->regions[i].kind); encode_u16_le(buffer, offset, table->regions[i].instance); } } static int __must_check write_index_save_header(struct index_save_layout *isl, struct region_table *table, struct buffered_writer *writer) { int result; u8 *buffer; size_t offset = 0; result = vdo_allocate(table->encoded_size, u8, "index save data", &buffer); if (result != VDO_SUCCESS) return result; encode_region_table(buffer, &offset, table); encode_u64_le(buffer, &offset, isl->save_data.timestamp); encode_u64_le(buffer, &offset, isl->save_data.nonce); encode_u32_le(buffer, &offset, isl->save_data.version); encode_u32_le(buffer, &offset, 0); if (isl->zone_count > 0) { encode_u32_le(buffer, &offset, INDEX_STATE_VERSION_301.signature); encode_u32_le(buffer, &offset, INDEX_STATE_VERSION_301.version_id); encode_u64_le(buffer, &offset, isl->state_data.newest_chapter); encode_u64_le(buffer, &offset, isl->state_data.oldest_chapter); encode_u64_le(buffer, &offset, isl->state_data.last_save); encode_u64_le(buffer, &offset, 0); } result = uds_write_to_buffered_writer(writer, buffer, offset); vdo_free(buffer); if (result != UDS_SUCCESS) return result; return uds_flush_buffered_writer(writer); } static int write_index_save_layout(struct index_layout *layout, struct index_save_layout *isl) { int result; struct region_table *table; struct buffered_writer *writer; result = make_index_save_region_table(isl, &table); if (result != UDS_SUCCESS) return result; result = open_region_writer(layout, &isl->header, &writer); if (result != UDS_SUCCESS) { vdo_free(table); return result; } result = write_index_save_header(isl, table, writer); vdo_free(table); uds_free_buffered_writer(writer); return result; } static void reset_index_save_layout(struct index_save_layout *isl, u64 page_map_blocks) { u64 free_blocks; u64 next_block = isl->index_save.start_block; isl->zone_count = 0; memset(&isl->save_data, 0, sizeof(isl->save_data)); isl->header = (struct layout_region) { .start_block = next_block++, .block_count = 1, .kind = RL_KIND_HEADER, .instance = RL_SOLE_INSTANCE, }; isl->index_page_map = (struct layout_region) { .start_block = next_block, .block_count = page_map_blocks, .kind = RL_KIND_INDEX_PAGE_MAP, .instance = RL_SOLE_INSTANCE, }; next_block += page_map_blocks; free_blocks = isl->index_save.block_count - page_map_blocks - 1; isl->free_space = (struct layout_region) { .start_block = next_block, .block_count = free_blocks, .kind = RL_KIND_EMPTY, .instance = RL_SOLE_INSTANCE, }; } static int __must_check invalidate_old_save(struct index_layout *layout, struct index_save_layout *isl) { reset_index_save_layout(isl, layout->super.page_map_blocks); return write_index_save_layout(layout, isl); } static int discard_index_state_data(struct index_layout *layout) { int result; int saved_result = UDS_SUCCESS; unsigned int i; for (i = 0; i < layout->super.max_saves; i++) { result = invalidate_old_save(layout, &layout->index.saves[i]); if (result != UDS_SUCCESS) saved_result = result; } if (saved_result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "%s: cannot destroy all index saves", __func__); } return UDS_SUCCESS; } static int __must_check make_layout_region_table(struct index_layout *layout, struct region_table **table_ptr) { int result; unsigned int i; /* Regions: header, config, index, volume, saves, seal */ u16 region_count = 5 + layout->super.max_saves; u16 payload; struct region_table *table; struct layout_region *lr; result = vdo_allocate_extended(struct region_table, region_count, struct layout_region, "layout region table", &table); if (result != VDO_SUCCESS) return result; lr = &table->regions[0]; *lr++ = layout->header; *lr++ = layout->config; *lr++ = layout->index.sub_index; *lr++ = layout->index.volume; for (i = 0; i < layout->super.max_saves; i++) *lr++ = layout->index.saves[i].index_save; *lr++ = layout->seal; if (is_converted_super_block(&layout->super)) { payload = sizeof(struct super_block_data); } else { payload = (sizeof(struct super_block_data) - sizeof(layout->super.volume_offset) - sizeof(layout->super.start_offset)); } table->header = (struct region_header) { .magic = REGION_MAGIC, .region_blocks = layout->total_blocks, .type = RH_TYPE_SUPER, .version = 1, .region_count = region_count, .payload = payload, }; table->encoded_size = (sizeof(struct region_header) + payload + region_count * sizeof(struct layout_region)); *table_ptr = table; return UDS_SUCCESS; } static int __must_check write_layout_header(struct index_layout *layout, struct region_table *table, struct buffered_writer *writer) { int result; u8 *buffer; size_t offset = 0; result = vdo_allocate(table->encoded_size, u8, "layout data", &buffer); if (result != VDO_SUCCESS) return result; encode_region_table(buffer, &offset, table); memcpy(buffer + offset, &layout->super.magic_label, MAGIC_SIZE); offset += MAGIC_SIZE; memcpy(buffer + offset, &layout->super.nonce_info, NONCE_INFO_SIZE); offset += NONCE_INFO_SIZE; encode_u64_le(buffer, &offset, layout->super.nonce); encode_u32_le(buffer, &offset, layout->super.version); encode_u32_le(buffer, &offset, layout->super.block_size); encode_u16_le(buffer, &offset, layout->super.index_count); encode_u16_le(buffer, &offset, layout->super.max_saves); encode_u32_le(buffer, &offset, 0); encode_u64_le(buffer, &offset, layout->super.open_chapter_blocks); encode_u64_le(buffer, &offset, layout->super.page_map_blocks); if (is_converted_super_block(&layout->super)) { encode_u64_le(buffer, &offset, layout->super.volume_offset); encode_u64_le(buffer, &offset, layout->super.start_offset); } result = uds_write_to_buffered_writer(writer, buffer, offset); vdo_free(buffer); if (result != UDS_SUCCESS) return result; return uds_flush_buffered_writer(writer); } static int __must_check write_uds_index_config(struct index_layout *layout, struct uds_configuration *config, off_t offset) { int result; struct buffered_writer *writer = NULL; result = open_layout_writer(layout, &layout->config, offset, &writer); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "failed to open config region"); result = uds_write_config_contents(writer, config, layout->super.version); if (result != UDS_SUCCESS) { uds_free_buffered_writer(writer); return vdo_log_error_strerror(result, "failed to write config region"); } result = uds_flush_buffered_writer(writer); if (result != UDS_SUCCESS) { uds_free_buffered_writer(writer); return vdo_log_error_strerror(result, "cannot flush config writer"); } uds_free_buffered_writer(writer); return UDS_SUCCESS; } static int __must_check save_layout(struct index_layout *layout, off_t offset) { int result; struct buffered_writer *writer = NULL; struct region_table *table; result = make_layout_region_table(layout, &table); if (result != UDS_SUCCESS) return result; result = open_layout_writer(layout, &layout->header, offset, &writer); if (result != UDS_SUCCESS) { vdo_free(table); return result; } result = write_layout_header(layout, table, writer); vdo_free(table); uds_free_buffered_writer(writer); return result; } static int create_index_layout(struct index_layout *layout, struct uds_configuration *config) { int result; struct save_layout_sizes sizes; result = compute_sizes(config, &sizes); if (result != UDS_SUCCESS) return result; result = vdo_allocate(sizes.save_count, struct index_save_layout, __func__, &layout->index.saves); if (result != VDO_SUCCESS) return result; initialize_layout(layout, &sizes); result = discard_index_state_data(layout); if (result != UDS_SUCCESS) return result; result = write_uds_index_config(layout, config, 0); if (result != UDS_SUCCESS) return result; return save_layout(layout, 0); } static u64 generate_index_save_nonce(u64 volume_nonce, struct index_save_layout *isl) { struct save_nonce_data { struct index_save_data data; u64 offset; } nonce_data; u8 buffer[sizeof(nonce_data)]; size_t offset = 0; encode_u64_le(buffer, &offset, isl->save_data.timestamp); encode_u64_le(buffer, &offset, 0); encode_u32_le(buffer, &offset, isl->save_data.version); encode_u32_le(buffer, &offset, 0U); encode_u64_le(buffer, &offset, isl->index_save.start_block); VDO_ASSERT_LOG_ONLY(offset == sizeof(nonce_data), "%zu bytes encoded of %zu expected", offset, sizeof(nonce_data)); return generate_secondary_nonce(volume_nonce, buffer, sizeof(buffer)); } static u64 validate_index_save_layout(struct index_save_layout *isl, u64 volume_nonce) { if ((isl->zone_count == 0) || (isl->save_data.timestamp == 0)) return 0; if (isl->save_data.nonce != generate_index_save_nonce(volume_nonce, isl)) return 0; return isl->save_data.timestamp; } static int find_latest_uds_index_save_slot(struct index_layout *layout, struct index_save_layout **isl_ptr) { struct index_save_layout *latest = NULL; struct index_save_layout *isl; unsigned int i; u64 save_time = 0; u64 latest_time = 0; for (i = 0; i < layout->super.max_saves; i++) { isl = &layout->index.saves[i]; save_time = validate_index_save_layout(isl, layout->index.nonce); if (save_time > latest_time) { latest = isl; latest_time = save_time; } } if (latest == NULL) { vdo_log_error("No valid index save found"); return UDS_INDEX_NOT_SAVED_CLEANLY; } *isl_ptr = latest; return UDS_SUCCESS; } int uds_discard_open_chapter(struct index_layout *layout) { int result; struct index_save_layout *isl; struct buffered_writer *writer; result = find_latest_uds_index_save_slot(layout, &isl); if (result != UDS_SUCCESS) return result; result = open_region_writer(layout, &isl->open_chapter, &writer); if (result != UDS_SUCCESS) return result; result = uds_write_to_buffered_writer(writer, NULL, UDS_BLOCK_SIZE); if (result != UDS_SUCCESS) { uds_free_buffered_writer(writer); return result; } result = uds_flush_buffered_writer(writer); uds_free_buffered_writer(writer); return result; } int uds_load_index_state(struct index_layout *layout, struct uds_index *index) { int result; unsigned int zone; struct index_save_layout *isl; struct buffered_reader *readers[MAX_ZONES]; result = find_latest_uds_index_save_slot(layout, &isl); if (result != UDS_SUCCESS) return result; index->newest_virtual_chapter = isl->state_data.newest_chapter; index->oldest_virtual_chapter = isl->state_data.oldest_chapter; index->last_save = isl->state_data.last_save; result = open_region_reader(layout, &isl->open_chapter, &readers[0]); if (result != UDS_SUCCESS) return result; result = uds_load_open_chapter(index, readers[0]); uds_free_buffered_reader(readers[0]); if (result != UDS_SUCCESS) return result; for (zone = 0; zone < isl->zone_count; zone++) { result = open_region_reader(layout, &isl->volume_index_zones[zone], &readers[zone]); if (result != UDS_SUCCESS) { for (; zone > 0; zone--) uds_free_buffered_reader(readers[zone - 1]); return result; } } result = uds_load_volume_index(index->volume_index, readers, isl->zone_count); for (zone = 0; zone < isl->zone_count; zone++) uds_free_buffered_reader(readers[zone]); if (result != UDS_SUCCESS) return result; result = open_region_reader(layout, &isl->index_page_map, &readers[0]); if (result != UDS_SUCCESS) return result; result = uds_read_index_page_map(index->volume->index_page_map, readers[0]); uds_free_buffered_reader(readers[0]); return result; } static struct index_save_layout *select_oldest_index_save_layout(struct index_layout *layout) { struct index_save_layout *oldest = NULL; struct index_save_layout *isl; unsigned int i; u64 save_time = 0; u64 oldest_time = 0; for (i = 0; i < layout->super.max_saves; i++) { isl = &layout->index.saves[i]; save_time = validate_index_save_layout(isl, layout->index.nonce); if (oldest == NULL || save_time < oldest_time) { oldest = isl; oldest_time = save_time; } } return oldest; } static void instantiate_index_save_layout(struct index_save_layout *isl, struct super_block_data *super, u64 volume_nonce, unsigned int zone_count) { unsigned int z; u64 next_block; u64 free_blocks; u64 volume_index_blocks; isl->zone_count = zone_count; memset(&isl->save_data, 0, sizeof(isl->save_data)); isl->save_data.timestamp = ktime_to_ms(current_time_ns(CLOCK_REALTIME)); isl->save_data.version = 1; isl->save_data.nonce = generate_index_save_nonce(volume_nonce, isl); next_block = isl->index_save.start_block; isl->header = (struct layout_region) { .start_block = next_block++, .block_count = 1, .kind = RL_KIND_HEADER, .instance = RL_SOLE_INSTANCE, }; isl->index_page_map = (struct layout_region) { .start_block = next_block, .block_count = super->page_map_blocks, .kind = RL_KIND_INDEX_PAGE_MAP, .instance = RL_SOLE_INSTANCE, }; next_block += super->page_map_blocks; free_blocks = (isl->index_save.block_count - 1 - super->page_map_blocks - super->open_chapter_blocks); volume_index_blocks = free_blocks / isl->zone_count; for (z = 0; z < isl->zone_count; z++) { isl->volume_index_zones[z] = (struct layout_region) { .start_block = next_block, .block_count = volume_index_blocks, .kind = RL_KIND_VOLUME_INDEX, .instance = z, }; next_block += volume_index_blocks; free_blocks -= volume_index_blocks; } isl->open_chapter = (struct layout_region) { .start_block = next_block, .block_count = super->open_chapter_blocks, .kind = RL_KIND_OPEN_CHAPTER, .instance = RL_SOLE_INSTANCE, }; next_block += super->open_chapter_blocks; isl->free_space = (struct layout_region) { .start_block = next_block, .block_count = free_blocks, .kind = RL_KIND_EMPTY, .instance = RL_SOLE_INSTANCE, }; } static int setup_uds_index_save_slot(struct index_layout *layout, unsigned int zone_count, struct index_save_layout **isl_ptr) { int result; struct index_save_layout *isl; isl = select_oldest_index_save_layout(layout); result = invalidate_old_save(layout, isl); if (result != UDS_SUCCESS) return result; instantiate_index_save_layout(isl, &layout->super, layout->index.nonce, zone_count); *isl_ptr = isl; return UDS_SUCCESS; } static void cancel_uds_index_save(struct index_save_layout *isl) { memset(&isl->save_data, 0, sizeof(isl->save_data)); memset(&isl->state_data, 0, sizeof(isl->state_data)); isl->zone_count = 0; } int uds_save_index_state(struct index_layout *layout, struct uds_index *index) { int result; unsigned int zone; struct index_save_layout *isl; struct buffered_writer *writers[MAX_ZONES]; result = setup_uds_index_save_slot(layout, index->zone_count, &isl); if (result != UDS_SUCCESS) return result; isl->state_data = (struct index_state_data301) { .newest_chapter = index->newest_virtual_chapter, .oldest_chapter = index->oldest_virtual_chapter, .last_save = index->last_save, }; result = open_region_writer(layout, &isl->open_chapter, &writers[0]); if (result != UDS_SUCCESS) { cancel_uds_index_save(isl); return result; } result = uds_save_open_chapter(index, writers[0]); uds_free_buffered_writer(writers[0]); if (result != UDS_SUCCESS) { cancel_uds_index_save(isl); return result; } for (zone = 0; zone < index->zone_count; zone++) { result = open_region_writer(layout, &isl->volume_index_zones[zone], &writers[zone]); if (result != UDS_SUCCESS) { for (; zone > 0; zone--) uds_free_buffered_writer(writers[zone - 1]); cancel_uds_index_save(isl); return result; } } result = uds_save_volume_index(index->volume_index, writers, index->zone_count); for (zone = 0; zone < index->zone_count; zone++) uds_free_buffered_writer(writers[zone]); if (result != UDS_SUCCESS) { cancel_uds_index_save(isl); return result; } result = open_region_writer(layout, &isl->index_page_map, &writers[0]); if (result != UDS_SUCCESS) { cancel_uds_index_save(isl); return result; } result = uds_write_index_page_map(index->volume->index_page_map, writers[0]); uds_free_buffered_writer(writers[0]); if (result != UDS_SUCCESS) { cancel_uds_index_save(isl); return result; } return write_index_save_layout(layout, isl); } static int __must_check load_region_table(struct buffered_reader *reader, struct region_table **table_ptr) { int result; unsigned int i; struct region_header header; struct region_table *table; u8 buffer[sizeof(struct region_header)]; size_t offset = 0; result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot read region table header"); decode_u64_le(buffer, &offset, &header.magic); decode_u64_le(buffer, &offset, &header.region_blocks); decode_u16_le(buffer, &offset, &header.type); decode_u16_le(buffer, &offset, &header.version); decode_u16_le(buffer, &offset, &header.region_count); decode_u16_le(buffer, &offset, &header.payload); if (header.magic != REGION_MAGIC) return UDS_NO_INDEX; if (header.version != 1) { return vdo_log_error_strerror(UDS_UNSUPPORTED_VERSION, "unknown region table version %hu", header.version); } result = vdo_allocate_extended(struct region_table, header.region_count, struct layout_region, "single file layout region table", &table); if (result != VDO_SUCCESS) return result; table->header = header; for (i = 0; i < header.region_count; i++) { u8 region_buffer[sizeof(struct layout_region)]; offset = 0; result = uds_read_from_buffered_reader(reader, region_buffer, sizeof(region_buffer)); if (result != UDS_SUCCESS) { vdo_free(table); return vdo_log_error_strerror(UDS_CORRUPT_DATA, "cannot read region table layouts"); } decode_u64_le(region_buffer, &offset, &table->regions[i].start_block); decode_u64_le(region_buffer, &offset, &table->regions[i].block_count); offset += sizeof(u32); decode_u16_le(region_buffer, &offset, &table->regions[i].kind); decode_u16_le(region_buffer, &offset, &table->regions[i].instance); } *table_ptr = table; return UDS_SUCCESS; } static int __must_check read_super_block_data(struct buffered_reader *reader, struct index_layout *layout, size_t saved_size) { int result; struct super_block_data *super = &layout->super; u8 *buffer; size_t offset = 0; result = vdo_allocate(saved_size, u8, "super block data", &buffer); if (result != VDO_SUCCESS) return result; result = uds_read_from_buffered_reader(reader, buffer, saved_size); if (result != UDS_SUCCESS) { vdo_free(buffer); return vdo_log_error_strerror(result, "cannot read region table header"); } memcpy(&super->magic_label, buffer, MAGIC_SIZE); offset += MAGIC_SIZE; memcpy(&super->nonce_info, buffer + offset, NONCE_INFO_SIZE); offset += NONCE_INFO_SIZE; decode_u64_le(buffer, &offset, &super->nonce); decode_u32_le(buffer, &offset, &super->version); decode_u32_le(buffer, &offset, &super->block_size); decode_u16_le(buffer, &offset, &super->index_count); decode_u16_le(buffer, &offset, &super->max_saves); offset += sizeof(u32); decode_u64_le(buffer, &offset, &super->open_chapter_blocks); decode_u64_le(buffer, &offset, &super->page_map_blocks); if (is_converted_super_block(super)) { decode_u64_le(buffer, &offset, &super->volume_offset); decode_u64_le(buffer, &offset, &super->start_offset); } else { super->volume_offset = 0; super->start_offset = 0; } vdo_free(buffer); if (memcmp(super->magic_label, LAYOUT_MAGIC, MAGIC_SIZE) != 0) return vdo_log_error_strerror(UDS_CORRUPT_DATA, "unknown superblock magic label"); if ((super->version < SUPER_VERSION_MINIMUM) || (super->version == 4) || (super->version == 5) || (super->version == 6) || (super->version > SUPER_VERSION_MAXIMUM)) { return vdo_log_error_strerror(UDS_UNSUPPORTED_VERSION, "unknown superblock version number %u", super->version); } if (super->volume_offset < super->start_offset) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "inconsistent offsets (start %llu, volume %llu)", (unsigned long long) super->start_offset, (unsigned long long) super->volume_offset); } /* Sub-indexes are no longer used but the layout retains this field. */ if (super->index_count != 1) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "invalid subindex count %u", super->index_count); } if (generate_primary_nonce(super->nonce_info, sizeof(super->nonce_info)) != super->nonce) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "inconsistent superblock nonce"); } return UDS_SUCCESS; } static int __must_check verify_region(struct layout_region *lr, u64 start_block, enum region_kind kind, unsigned int instance) { if (lr->start_block != start_block) return vdo_log_error_strerror(UDS_CORRUPT_DATA, "incorrect layout region offset"); if (lr->kind != kind) return vdo_log_error_strerror(UDS_CORRUPT_DATA, "incorrect layout region kind"); if (lr->instance != instance) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "incorrect layout region instance"); } return UDS_SUCCESS; } static int __must_check verify_sub_index(struct index_layout *layout, u64 start_block, struct region_table *table) { int result; unsigned int i; struct sub_index_layout *sil = &layout->index; u64 next_block = start_block; sil->sub_index = table->regions[2]; result = verify_region(&sil->sub_index, next_block, RL_KIND_INDEX, 0); if (result != UDS_SUCCESS) return result; define_sub_index_nonce(layout); sil->volume = table->regions[3]; result = verify_region(&sil->volume, next_block, RL_KIND_VOLUME, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; next_block += sil->volume.block_count + layout->super.volume_offset; for (i = 0; i < layout->super.max_saves; i++) { sil->saves[i].index_save = table->regions[i + 4]; result = verify_region(&sil->saves[i].index_save, next_block, RL_KIND_SAVE, i); if (result != UDS_SUCCESS) return result; next_block += sil->saves[i].index_save.block_count; } next_block -= layout->super.volume_offset; if (next_block != start_block + sil->sub_index.block_count) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "sub index region does not span all saves"); } return UDS_SUCCESS; } static int __must_check reconstitute_layout(struct index_layout *layout, struct region_table *table, u64 first_block) { int result; u64 next_block = first_block; result = vdo_allocate(layout->super.max_saves, struct index_save_layout, __func__, &layout->index.saves); if (result != VDO_SUCCESS) return result; layout->total_blocks = table->header.region_blocks; layout->header = table->regions[0]; result = verify_region(&layout->header, next_block++, RL_KIND_HEADER, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; layout->config = table->regions[1]; result = verify_region(&layout->config, next_block++, RL_KIND_CONFIG, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; result = verify_sub_index(layout, next_block, table); if (result != UDS_SUCCESS) return result; next_block += layout->index.sub_index.block_count; layout->seal = table->regions[table->header.region_count - 1]; result = verify_region(&layout->seal, next_block + layout->super.volume_offset, RL_KIND_SEAL, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; if (++next_block != (first_block + layout->total_blocks)) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "layout table does not span total blocks"); } return UDS_SUCCESS; } static int __must_check load_super_block(struct index_layout *layout, size_t block_size, u64 first_block, struct buffered_reader *reader) { int result; struct region_table *table = NULL; struct super_block_data *super = &layout->super; result = load_region_table(reader, &table); if (result != UDS_SUCCESS) return result; if (table->header.type != RH_TYPE_SUPER) { vdo_free(table); return vdo_log_error_strerror(UDS_CORRUPT_DATA, "not a superblock region table"); } result = read_super_block_data(reader, layout, table->header.payload); if (result != UDS_SUCCESS) { vdo_free(table); return vdo_log_error_strerror(result, "unknown superblock format"); } if (super->block_size != block_size) { vdo_free(table); return vdo_log_error_strerror(UDS_CORRUPT_DATA, "superblock saved block_size %u differs from supplied block_size %zu", super->block_size, block_size); } first_block -= (super->volume_offset - super->start_offset); result = reconstitute_layout(layout, table, first_block); vdo_free(table); return result; } static int __must_check read_index_save_data(struct buffered_reader *reader, struct index_save_layout *isl, size_t saved_size) { int result; struct index_state_version file_version; u8 buffer[sizeof(struct index_save_data) + sizeof(struct index_state_data301)]; size_t offset = 0; if (saved_size != sizeof(buffer)) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "unexpected index save data size %zu", saved_size); } result = uds_read_from_buffered_reader(reader, buffer, sizeof(buffer)); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "cannot read index save data"); decode_u64_le(buffer, &offset, &isl->save_data.timestamp); decode_u64_le(buffer, &offset, &isl->save_data.nonce); decode_u32_le(buffer, &offset, &isl->save_data.version); offset += sizeof(u32); if (isl->save_data.version > 1) { return vdo_log_error_strerror(UDS_UNSUPPORTED_VERSION, "unknown index save version number %u", isl->save_data.version); } decode_s32_le(buffer, &offset, &file_version.signature); decode_s32_le(buffer, &offset, &file_version.version_id); if ((file_version.signature != INDEX_STATE_VERSION_301.signature) || (file_version.version_id != INDEX_STATE_VERSION_301.version_id)) { return vdo_log_error_strerror(UDS_UNSUPPORTED_VERSION, "index state version %d,%d is unsupported", file_version.signature, file_version.version_id); } decode_u64_le(buffer, &offset, &isl->state_data.newest_chapter); decode_u64_le(buffer, &offset, &isl->state_data.oldest_chapter); decode_u64_le(buffer, &offset, &isl->state_data.last_save); /* Skip past some historical fields that are now unused */ offset += sizeof(u32) + sizeof(u32); return UDS_SUCCESS; } static int __must_check reconstruct_index_save(struct index_save_layout *isl, struct region_table *table) { int result; unsigned int z; struct layout_region *last_region; u64 next_block = isl->index_save.start_block; u64 last_block = next_block + isl->index_save.block_count; isl->zone_count = table->header.region_count - 3; last_region = &table->regions[table->header.region_count - 1]; if (last_region->kind == RL_KIND_EMPTY) { isl->free_space = *last_region; isl->zone_count--; } else { isl->free_space = (struct layout_region) { .start_block = last_block, .block_count = 0, .kind = RL_KIND_EMPTY, .instance = RL_SOLE_INSTANCE, }; } isl->header = table->regions[0]; result = verify_region(&isl->header, next_block++, RL_KIND_HEADER, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; isl->index_page_map = table->regions[1]; result = verify_region(&isl->index_page_map, next_block, RL_KIND_INDEX_PAGE_MAP, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; next_block += isl->index_page_map.block_count; for (z = 0; z < isl->zone_count; z++) { isl->volume_index_zones[z] = table->regions[z + 2]; result = verify_region(&isl->volume_index_zones[z], next_block, RL_KIND_VOLUME_INDEX, z); if (result != UDS_SUCCESS) return result; next_block += isl->volume_index_zones[z].block_count; } isl->open_chapter = table->regions[isl->zone_count + 2]; result = verify_region(&isl->open_chapter, next_block, RL_KIND_OPEN_CHAPTER, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; next_block += isl->open_chapter.block_count; result = verify_region(&isl->free_space, next_block, RL_KIND_EMPTY, RL_SOLE_INSTANCE); if (result != UDS_SUCCESS) return result; next_block += isl->free_space.block_count; if (next_block != last_block) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "index save layout table incomplete"); } return UDS_SUCCESS; } static int __must_check load_index_save(struct index_save_layout *isl, struct buffered_reader *reader, unsigned int instance) { int result; struct region_table *table = NULL; result = load_region_table(reader, &table); if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "cannot read index save %u header", instance); } if (table->header.region_blocks != isl->index_save.block_count) { u64 region_blocks = table->header.region_blocks; vdo_free(table); return vdo_log_error_strerror(UDS_CORRUPT_DATA, "unexpected index save %u region block count %llu", instance, (unsigned long long) region_blocks); } if (table->header.type == RH_TYPE_UNSAVED) { vdo_free(table); reset_index_save_layout(isl, 0); return UDS_SUCCESS; } if (table->header.type != RH_TYPE_SAVE) { vdo_log_error_strerror(UDS_CORRUPT_DATA, "unexpected index save %u header type %u", instance, table->header.type); vdo_free(table); return UDS_CORRUPT_DATA; } result = read_index_save_data(reader, isl, table->header.payload); if (result != UDS_SUCCESS) { vdo_free(table); return vdo_log_error_strerror(result, "unknown index save %u data format", instance); } result = reconstruct_index_save(isl, table); vdo_free(table); if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "cannot reconstruct index save %u", instance); } return UDS_SUCCESS; } static int __must_check load_sub_index_regions(struct index_layout *layout) { int result; unsigned int j; struct index_save_layout *isl; struct buffered_reader *reader; for (j = 0; j < layout->super.max_saves; j++) { isl = &layout->index.saves[j]; result = open_region_reader(layout, &isl->index_save, &reader); if (result != UDS_SUCCESS) { vdo_log_error_strerror(result, "cannot get reader for index 0 save %u", j); return result; } result = load_index_save(isl, reader, j); uds_free_buffered_reader(reader); if (result != UDS_SUCCESS) { /* Another save slot might be valid. */ reset_index_save_layout(isl, 0); continue; } } return UDS_SUCCESS; } static int __must_check verify_uds_index_config(struct index_layout *layout, struct uds_configuration *config) { int result; struct buffered_reader *reader = NULL; u64 offset; offset = layout->super.volume_offset - layout->super.start_offset; result = open_layout_reader(layout, &layout->config, offset, &reader); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "failed to open config reader"); result = uds_validate_config_contents(reader, config); if (result != UDS_SUCCESS) { uds_free_buffered_reader(reader); return vdo_log_error_strerror(result, "failed to read config region"); } uds_free_buffered_reader(reader); return UDS_SUCCESS; } static int load_index_layout(struct index_layout *layout, struct uds_configuration *config) { int result; struct buffered_reader *reader; result = uds_make_buffered_reader(layout->factory, layout->offset / UDS_BLOCK_SIZE, 1, &reader); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "unable to read superblock"); result = load_super_block(layout, UDS_BLOCK_SIZE, layout->offset / UDS_BLOCK_SIZE, reader); uds_free_buffered_reader(reader); if (result != UDS_SUCCESS) return result; result = verify_uds_index_config(layout, config); if (result != UDS_SUCCESS) return result; return load_sub_index_regions(layout); } static int create_layout_factory(struct index_layout *layout, const struct uds_configuration *config) { int result; size_t writable_size; struct io_factory *factory = NULL; result = uds_make_io_factory(config->bdev, &factory); if (result != UDS_SUCCESS) return result; writable_size = uds_get_writable_size(factory) & -UDS_BLOCK_SIZE; if (writable_size < config->size + config->offset) { uds_put_io_factory(factory); vdo_log_error("index storage (%zu) is smaller than the requested size %zu", writable_size, config->size + config->offset); return -ENOSPC; } layout->factory = factory; layout->factory_size = (config->size > 0) ? config->size : writable_size; layout->offset = config->offset; return UDS_SUCCESS; } int uds_make_index_layout(struct uds_configuration *config, bool new_layout, struct index_layout **layout_ptr) { int result; struct index_layout *layout = NULL; struct save_layout_sizes sizes; result = compute_sizes(config, &sizes); if (result != UDS_SUCCESS) return result; result = vdo_allocate(1, struct index_layout, __func__, &layout); if (result != VDO_SUCCESS) return result; result = create_layout_factory(layout, config); if (result != UDS_SUCCESS) { uds_free_index_layout(layout); return result; } if (layout->factory_size < sizes.total_size) { vdo_log_error("index storage (%zu) is smaller than the required size %llu", layout->factory_size, (unsigned long long) sizes.total_size); uds_free_index_layout(layout); return -ENOSPC; } if (new_layout) result = create_index_layout(layout, config); else result = load_index_layout(layout, config); if (result != UDS_SUCCESS) { uds_free_index_layout(layout); return result; } *layout_ptr = layout; return UDS_SUCCESS; } void uds_free_index_layout(struct index_layout *layout) { if (layout == NULL) return; vdo_free(layout->index.saves); if (layout->factory != NULL) uds_put_io_factory(layout->factory); vdo_free(layout); } int uds_replace_index_layout_storage(struct index_layout *layout, struct block_device *bdev) { return uds_replace_storage(layout->factory, bdev); } /* Obtain a dm_bufio_client for the volume region. */ int uds_open_volume_bufio(struct index_layout *layout, size_t block_size, unsigned int reserved_buffers, struct dm_bufio_client **client_ptr) { off_t offset = (layout->index.volume.start_block + layout->super.volume_offset - layout->super.start_offset); return uds_make_bufio(layout->factory, offset, block_size, reserved_buffers, client_ptr); } u64 uds_get_volume_nonce(struct index_layout *layout) { return layout->index.nonce; } vdo-8.3.1.1/utils/uds/index-layout.h000066400000000000000000000025311476467262700172260ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_INDEX_LAYOUT_H #define UDS_INDEX_LAYOUT_H #include "config.h" #include "indexer.h" #include "io-factory.h" /* * The index layout describes the format of the index on the underlying storage, and is responsible * for creating those structures when the index is first created. It also validates the index data * when loading a saved index, and updates it when saving the index. */ struct index_layout; int __must_check uds_make_index_layout(struct uds_configuration *config, bool new_layout, struct index_layout **layout_ptr); void uds_free_index_layout(struct index_layout *layout); int __must_check uds_replace_index_layout_storage(struct index_layout *layout, struct block_device *bdev); int __must_check uds_load_index_state(struct index_layout *layout, struct uds_index *index); int __must_check uds_save_index_state(struct index_layout *layout, struct uds_index *index); int __must_check uds_discard_open_chapter(struct index_layout *layout); u64 __must_check uds_get_volume_nonce(struct index_layout *layout); int __must_check uds_open_volume_bufio(struct index_layout *layout, size_t block_size, unsigned int reserved_buffers, struct dm_bufio_client **client_ptr); #endif /* UDS_INDEX_LAYOUT_H */ vdo-8.3.1.1/utils/uds/index-page-map.c000066400000000000000000000114271476467262700173770ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "index-page-map.h" #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "permassert.h" #include "string-utils.h" #include "thread-utils.h" #include "hash-utils.h" #include "indexer.h" /* * The index page map is conceptually a two-dimensional array indexed by chapter number and index * page number within the chapter. Each entry contains the number of the last delta list on that * index page. In order to save memory, the information for the last page in each chapter is not * recorded, as it is known from the geometry. */ static const u8 PAGE_MAP_MAGIC[] = "ALBIPM02"; #define PAGE_MAP_MAGIC_LENGTH (sizeof(PAGE_MAP_MAGIC) - 1) static inline u32 get_entry_count(const struct index_geometry *geometry) { return geometry->chapters_per_volume * (geometry->index_pages_per_chapter - 1); } int uds_make_index_page_map(const struct index_geometry *geometry, struct index_page_map **map_ptr) { int result; struct index_page_map *map; result = vdo_allocate(1, struct index_page_map, "page map", &map); if (result != VDO_SUCCESS) return result; map->geometry = geometry; map->entries_per_chapter = geometry->index_pages_per_chapter - 1; result = vdo_allocate(get_entry_count(geometry), u16, "Index Page Map Entries", &map->entries); if (result != VDO_SUCCESS) { uds_free_index_page_map(map); return result; } *map_ptr = map; return UDS_SUCCESS; } void uds_free_index_page_map(struct index_page_map *map) { if (map != NULL) { vdo_free(map->entries); vdo_free(map); } } void uds_update_index_page_map(struct index_page_map *map, u64 virtual_chapter_number, u32 chapter_number, u32 index_page_number, u32 delta_list_number) { size_t slot; map->last_update = virtual_chapter_number; if (index_page_number == map->entries_per_chapter) return; slot = (chapter_number * map->entries_per_chapter) + index_page_number; map->entries[slot] = delta_list_number; } u32 uds_find_index_page_number(const struct index_page_map *map, const struct uds_record_name *name, u32 chapter_number) { u32 delta_list_number = uds_hash_to_chapter_delta_list(name, map->geometry); u32 slot = chapter_number * map->entries_per_chapter; u32 page; for (page = 0; page < map->entries_per_chapter; page++) { if (delta_list_number <= map->entries[slot + page]) break; } return page; } void uds_get_list_number_bounds(const struct index_page_map *map, u32 chapter_number, u32 index_page_number, u32 *lowest_list, u32 *highest_list) { u32 slot = chapter_number * map->entries_per_chapter; *lowest_list = ((index_page_number == 0) ? 0 : map->entries[slot + index_page_number - 1] + 1); *highest_list = ((index_page_number < map->entries_per_chapter) ? map->entries[slot + index_page_number] : map->geometry->delta_lists_per_chapter - 1); } u64 uds_compute_index_page_map_save_size(const struct index_geometry *geometry) { return PAGE_MAP_MAGIC_LENGTH + sizeof(u64) + sizeof(u16) * get_entry_count(geometry); } int uds_write_index_page_map(struct index_page_map *map, struct buffered_writer *writer) { int result; u8 *buffer; size_t offset = 0; u64 saved_size = uds_compute_index_page_map_save_size(map->geometry); u32 i; result = vdo_allocate(saved_size, u8, "page map data", &buffer); if (result != VDO_SUCCESS) return result; memcpy(buffer, PAGE_MAP_MAGIC, PAGE_MAP_MAGIC_LENGTH); offset += PAGE_MAP_MAGIC_LENGTH; encode_u64_le(buffer, &offset, map->last_update); for (i = 0; i < get_entry_count(map->geometry); i++) encode_u16_le(buffer, &offset, map->entries[i]); result = uds_write_to_buffered_writer(writer, buffer, offset); vdo_free(buffer); if (result != UDS_SUCCESS) return result; return uds_flush_buffered_writer(writer); } int uds_read_index_page_map(struct index_page_map *map, struct buffered_reader *reader) { int result; u8 magic[PAGE_MAP_MAGIC_LENGTH]; u8 *buffer; size_t offset = 0; u64 saved_size = uds_compute_index_page_map_save_size(map->geometry); u32 i; result = vdo_allocate(saved_size, u8, "page map data", &buffer); if (result != VDO_SUCCESS) return result; result = uds_read_from_buffered_reader(reader, buffer, saved_size); if (result != UDS_SUCCESS) { vdo_free(buffer); return result; } memcpy(&magic, buffer, PAGE_MAP_MAGIC_LENGTH); offset += PAGE_MAP_MAGIC_LENGTH; if (memcmp(magic, PAGE_MAP_MAGIC, PAGE_MAP_MAGIC_LENGTH) != 0) { vdo_free(buffer); return UDS_CORRUPT_DATA; } decode_u64_le(buffer, &offset, &map->last_update); for (i = 0; i < get_entry_count(map->geometry); i++) decode_u16_le(buffer, &offset, &map->entries[i]); vdo_free(buffer); vdo_log_debug("read index page map, last update %llu", (unsigned long long) map->last_update); return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/index-page-map.h000066400000000000000000000030131476467262700173740ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_INDEX_PAGE_MAP_H #define UDS_INDEX_PAGE_MAP_H #include "geometry.h" #include "io-factory.h" /* * The index maintains a page map which records how the chapter delta lists are distributed among * the index pages for each chapter, allowing the volume to be efficient about reading only pages * that it knows it will need. */ struct index_page_map { const struct index_geometry *geometry; u64 last_update; u32 entries_per_chapter; u16 *entries; }; int __must_check uds_make_index_page_map(const struct index_geometry *geometry, struct index_page_map **map_ptr); void uds_free_index_page_map(struct index_page_map *map); int __must_check uds_read_index_page_map(struct index_page_map *map, struct buffered_reader *reader); int __must_check uds_write_index_page_map(struct index_page_map *map, struct buffered_writer *writer); void uds_update_index_page_map(struct index_page_map *map, u64 virtual_chapter_number, u32 chapter_number, u32 index_page_number, u32 delta_list_number); u32 __must_check uds_find_index_page_number(const struct index_page_map *map, const struct uds_record_name *name, u32 chapter_number); void uds_get_list_number_bounds(const struct index_page_map *map, u32 chapter_number, u32 index_page_number, u32 *lowest_list, u32 *highest_list); u64 uds_compute_index_page_map_save_size(const struct index_geometry *geometry); #endif /* UDS_INDEX_PAGE_MAP_H */ vdo-8.3.1.1/utils/uds/index-session.c000066400000000000000000000535361476467262700174020ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "index-session.h" #include #include "logger.h" #include "memory-alloc.h" #include "time-utils.h" #include "funnel-requestqueue.h" #include "index.h" #include "index-layout.h" /* * The index session contains a lock (the request_mutex) which ensures that only one thread can * change the state of its index at a time. The state field indicates the current state of the * index through a set of descriptive flags. The request_mutex must be notified whenever a * non-transient state flag is cleared. The request_mutex is also used to count the number of * requests currently in progress so that they can be drained when suspending or closing the index. * * If the index session is suspended shortly after opening an index, it may have to suspend during * a rebuild. Depending on the size of the index, a rebuild may take a significant amount of time, * so UDS allows the rebuild to be paused in order to suspend the session in a timely manner. When * the index session is resumed, the rebuild can continue from where it left off. If the index * session is shut down with a suspended rebuild, the rebuild progress is abandoned and the rebuild * will start from the beginning the next time the index is loaded. The mutex and status fields in * the index_load_context are used to record the state of any interrupted rebuild. */ enum index_session_flag_bit { IS_FLAG_BIT_START = 8, /* The session has started loading an index but not completed it. */ IS_FLAG_BIT_LOADING = IS_FLAG_BIT_START, /* The session has loaded an index, which can handle requests. */ IS_FLAG_BIT_LOADED, /* The session's index has been permanently disabled. */ IS_FLAG_BIT_DISABLED, /* The session's index is suspended. */ IS_FLAG_BIT_SUSPENDED, /* The session is handling some index state change. */ IS_FLAG_BIT_WAITING, /* The session's index is closing and draining requests. */ IS_FLAG_BIT_CLOSING, /* The session is being destroyed and is draining requests. */ IS_FLAG_BIT_DESTROYING, }; enum index_session_flag { IS_FLAG_LOADED = (1 << IS_FLAG_BIT_LOADED), IS_FLAG_LOADING = (1 << IS_FLAG_BIT_LOADING), IS_FLAG_DISABLED = (1 << IS_FLAG_BIT_DISABLED), IS_FLAG_SUSPENDED = (1 << IS_FLAG_BIT_SUSPENDED), IS_FLAG_WAITING = (1 << IS_FLAG_BIT_WAITING), IS_FLAG_CLOSING = (1 << IS_FLAG_BIT_CLOSING), IS_FLAG_DESTROYING = (1 << IS_FLAG_BIT_DESTROYING), }; /* Release a reference to an index session. */ static void release_index_session(struct uds_index_session *index_session) { mutex_lock(&index_session->request_mutex); if (--index_session->request_count == 0) uds_broadcast_cond(&index_session->request_cond); mutex_unlock(&index_session->request_mutex); } /* * Acquire a reference to the index session for an asynchronous index request. The reference must * eventually be released with a corresponding call to release_index_session(). */ static int get_index_session(struct uds_index_session *index_session) { unsigned int state; int result = UDS_SUCCESS; mutex_lock(&index_session->request_mutex); index_session->request_count++; state = index_session->state; mutex_unlock(&index_session->request_mutex); if (state == IS_FLAG_LOADED) { return UDS_SUCCESS; } else if (state & IS_FLAG_DISABLED) { result = UDS_DISABLED; } else if ((state & IS_FLAG_LOADING) || (state & IS_FLAG_SUSPENDED) || (state & IS_FLAG_WAITING)) { result = -EBUSY; } else { result = UDS_NO_INDEX; } release_index_session(index_session); return result; } int uds_launch_request(struct uds_request *request) { size_t internal_size; int result; if (request->callback == NULL) { vdo_log_error("missing required callback"); return -EINVAL; } switch (request->type) { case UDS_DELETE: case UDS_POST: case UDS_QUERY: case UDS_QUERY_NO_UPDATE: case UDS_UPDATE: break; default: vdo_log_error("received invalid callback type"); return -EINVAL; } /* Reset all internal fields before processing. */ internal_size = sizeof(struct uds_request) - offsetof(struct uds_request, zone_number); // FIXME should be using struct_group for this instead memset((char *) request + sizeof(*request) - internal_size, 0, internal_size); result = get_index_session(request->session); if (result != UDS_SUCCESS) return result; request->found = false; request->unbatched = false; request->index = request->session->index; uds_enqueue_request(request, STAGE_TRIAGE); return UDS_SUCCESS; } static void enter_callback_stage(struct uds_request *request) { if (request->status != UDS_SUCCESS) { /* All request errors are considered unrecoverable */ mutex_lock(&request->session->request_mutex); request->session->state |= IS_FLAG_DISABLED; mutex_unlock(&request->session->request_mutex); } uds_request_queue_enqueue(request->session->callback_queue, request); } static inline void count_once(u64 *count_ptr) { WRITE_ONCE(*count_ptr, READ_ONCE(*count_ptr) + 1); } static void update_session_stats(struct uds_request *request) { struct session_stats *session_stats = &request->session->stats; count_once(&session_stats->requests); switch (request->type) { case UDS_POST: if (request->found) count_once(&session_stats->posts_found); else count_once(&session_stats->posts_not_found); if (request->location == UDS_LOCATION_IN_OPEN_CHAPTER) count_once(&session_stats->posts_found_open_chapter); else if (request->location == UDS_LOCATION_IN_DENSE) count_once(&session_stats->posts_found_dense); else if (request->location == UDS_LOCATION_IN_SPARSE) count_once(&session_stats->posts_found_sparse); break; case UDS_UPDATE: if (request->found) count_once(&session_stats->updates_found); else count_once(&session_stats->updates_not_found); break; case UDS_DELETE: if (request->found) count_once(&session_stats->deletions_found); else count_once(&session_stats->deletions_not_found); break; case UDS_QUERY: case UDS_QUERY_NO_UPDATE: if (request->found) count_once(&session_stats->queries_found); else count_once(&session_stats->queries_not_found); break; default: request->status = VDO_ASSERT(false, "unknown request type: %d", request->type); } } static void handle_callbacks(struct uds_request *request) { struct uds_index_session *index_session = request->session; if (request->status == UDS_SUCCESS) update_session_stats(request); request->status = uds_status_to_errno(request->status); request->callback(request); release_index_session(index_session); } static int __must_check make_empty_index_session(struct uds_index_session **index_session_ptr) { int result; struct uds_index_session *session; result = vdo_allocate(1, struct uds_index_session, __func__, &session); if (result != VDO_SUCCESS) return result; mutex_init(&session->request_mutex); uds_init_cond(&session->request_cond); mutex_init(&session->load_context.mutex); uds_init_cond(&session->load_context.cond); result = uds_make_request_queue("callbackW", &handle_callbacks, &session->callback_queue); if (result != UDS_SUCCESS) { uds_destroy_cond(&session->load_context.cond); mutex_destroy(&session->load_context.mutex); uds_destroy_cond(&session->request_cond); mutex_destroy(&session->request_mutex); vdo_free(session); return result; } *index_session_ptr = session; return UDS_SUCCESS; } int uds_create_index_session(struct uds_index_session **session) { if (session == NULL) { vdo_log_error("missing session pointer"); return -EINVAL; } return uds_status_to_errno(make_empty_index_session(session)); } static int __must_check start_loading_index_session(struct uds_index_session *index_session) { int result; mutex_lock(&index_session->request_mutex); if (index_session->state & IS_FLAG_SUSPENDED) { vdo_log_info("Index session is suspended"); result = -EBUSY; } else if (index_session->state != 0) { vdo_log_info("Index is already loaded"); result = -EBUSY; } else { index_session->state |= IS_FLAG_LOADING; result = UDS_SUCCESS; } mutex_unlock(&index_session->request_mutex); return result; } static void finish_loading_index_session(struct uds_index_session *index_session, int result) { mutex_lock(&index_session->request_mutex); index_session->state &= ~IS_FLAG_LOADING; if (result == UDS_SUCCESS) index_session->state |= IS_FLAG_LOADED; uds_broadcast_cond(&index_session->request_cond); mutex_unlock(&index_session->request_mutex); } static int initialize_index_session(struct uds_index_session *index_session, enum uds_open_index_type open_type) { int result; struct uds_configuration *config; result = uds_make_configuration(&index_session->parameters, &config); if (result != UDS_SUCCESS) { vdo_log_error_strerror(result, "Failed to allocate config"); return result; } memset(&index_session->stats, 0, sizeof(index_session->stats)); result = uds_make_index(config, open_type, &index_session->load_context, enter_callback_stage, &index_session->index); if (result != UDS_SUCCESS) vdo_log_error_strerror(result, "Failed to make index"); else uds_log_configuration(config); uds_free_configuration(config); return result; } static const char *get_open_type_string(enum uds_open_index_type open_type) { switch (open_type) { case UDS_CREATE: return "creating index"; case UDS_LOAD: return "loading or rebuilding index"; case UDS_NO_REBUILD: return "loading index"; default: return "unknown open method"; } } /* * Open an index under the given session. This operation will fail if the * index session is suspended, or if there is already an open index. */ int uds_open_index(enum uds_open_index_type open_type, const struct uds_parameters *parameters, struct uds_index_session *session) { int result; char name[BDEVNAME_SIZE]; if (parameters == NULL) { vdo_log_error("missing required parameters"); return -EINVAL; } if (parameters->bdev == NULL) { vdo_log_error("missing required block device"); return -EINVAL; } if (session == NULL) { vdo_log_error("missing required session pointer"); return -EINVAL; } result = start_loading_index_session(session); if (result != UDS_SUCCESS) return uds_status_to_errno(result); session->parameters = *parameters; format_dev_t(name, parameters->bdev->bd_dev); vdo_log_info("%s: %s", get_open_type_string(open_type), name); result = initialize_index_session(session, open_type); if (result != UDS_SUCCESS) vdo_log_error_strerror(result, "Failed %s", get_open_type_string(open_type)); finish_loading_index_session(session, result); return uds_status_to_errno(result); } static void wait_for_no_requests_in_progress(struct uds_index_session *index_session) { mutex_lock(&index_session->request_mutex); while (index_session->request_count > 0) { uds_wait_cond(&index_session->request_cond, &index_session->request_mutex); } mutex_unlock(&index_session->request_mutex); } static int __must_check save_index(struct uds_index_session *index_session) { wait_for_no_requests_in_progress(index_session); return uds_save_index(index_session->index); } static void suspend_rebuild(struct uds_index_session *session) { mutex_lock(&session->load_context.mutex); switch (session->load_context.status) { case INDEX_OPENING: session->load_context.status = INDEX_SUSPENDING; /* Wait until the index indicates that it is not replaying. */ while ((session->load_context.status != INDEX_SUSPENDED) && (session->load_context.status != INDEX_READY)) { uds_wait_cond(&session->load_context.cond, &session->load_context.mutex); } break; case INDEX_READY: /* Index load does not need to be suspended. */ break; case INDEX_SUSPENDED: case INDEX_SUSPENDING: case INDEX_FREEING: default: /* These cases should not happen. */ VDO_ASSERT_LOG_ONLY(false, "Bad load context state %u", session->load_context.status); break; } mutex_unlock(&session->load_context.mutex); } /* * Suspend index operation, draining all current index requests and preventing new index requests * from starting. Optionally saves all index data before returning. */ int uds_suspend_index_session(struct uds_index_session *session, bool save) { int result = UDS_SUCCESS; bool no_work = false; bool rebuilding = false; /* Wait for any current index state change to complete. */ mutex_lock(&session->request_mutex); while (session->state & IS_FLAG_CLOSING) uds_wait_cond(&session->request_cond, &session->request_mutex); if ((session->state & IS_FLAG_WAITING) || (session->state & IS_FLAG_DESTROYING)) { no_work = true; vdo_log_info("Index session is already changing state"); result = -EBUSY; } else if (session->state & IS_FLAG_SUSPENDED) { no_work = true; } else if (session->state & IS_FLAG_LOADING) { session->state |= IS_FLAG_WAITING; rebuilding = true; } else if (session->state & IS_FLAG_LOADED) { session->state |= IS_FLAG_WAITING; } else { no_work = true; session->state |= IS_FLAG_SUSPENDED; uds_broadcast_cond(&session->request_cond); } mutex_unlock(&session->request_mutex); if (no_work) return uds_status_to_errno(result); if (rebuilding) suspend_rebuild(session); else if (save) result = save_index(session); else result = uds_flush_index_session(session); mutex_lock(&session->request_mutex); session->state &= ~IS_FLAG_WAITING; session->state |= IS_FLAG_SUSPENDED; uds_broadcast_cond(&session->request_cond); mutex_unlock(&session->request_mutex); return uds_status_to_errno(result); } static int replace_device(struct uds_index_session *session, struct block_device *bdev) { int result; result = uds_replace_index_storage(session->index, bdev); if (result != UDS_SUCCESS) return result; session->parameters.bdev = bdev; return UDS_SUCCESS; } /* * Resume index operation after being suspended. If the index is suspended and the supplied block * device differs from the current backing store, the index will start using the new backing store. */ int uds_resume_index_session(struct uds_index_session *session, struct block_device *bdev) { int result = UDS_SUCCESS; bool no_work = false; bool resume_replay = false; mutex_lock(&session->request_mutex); if (session->state & IS_FLAG_WAITING) { vdo_log_info("Index session is already changing state"); no_work = true; result = -EBUSY; } else if (!(session->state & IS_FLAG_SUSPENDED)) { /* If not suspended, just succeed. */ no_work = true; result = UDS_SUCCESS; } else { session->state |= IS_FLAG_WAITING; if (session->state & IS_FLAG_LOADING) resume_replay = true; } mutex_unlock(&session->request_mutex); if (no_work) return result; if ((session->index != NULL) && (bdev != session->parameters.bdev)) { result = replace_device(session, bdev); if (result != UDS_SUCCESS) { mutex_lock(&session->request_mutex); session->state &= ~IS_FLAG_WAITING; uds_broadcast_cond(&session->request_cond); mutex_unlock(&session->request_mutex); return uds_status_to_errno(result); } } if (resume_replay) { mutex_lock(&session->load_context.mutex); switch (session->load_context.status) { case INDEX_SUSPENDED: session->load_context.status = INDEX_OPENING; /* Notify the index to start replaying again. */ uds_broadcast_cond(&session->load_context.cond); break; case INDEX_READY: /* There is no index rebuild to resume. */ break; case INDEX_OPENING: case INDEX_SUSPENDING: case INDEX_FREEING: default: /* These cases should not happen; do nothing. */ VDO_ASSERT_LOG_ONLY(false, "Bad load context state %u", session->load_context.status); break; } mutex_unlock(&session->load_context.mutex); } mutex_lock(&session->request_mutex); session->state &= ~IS_FLAG_WAITING; session->state &= ~IS_FLAG_SUSPENDED; uds_broadcast_cond(&session->request_cond); mutex_unlock(&session->request_mutex); return UDS_SUCCESS; } static int save_and_free_index(struct uds_index_session *index_session) { int result = UDS_SUCCESS; bool suspended; struct uds_index *index = index_session->index; if (index == NULL) return UDS_SUCCESS; mutex_lock(&index_session->request_mutex); suspended = (index_session->state & IS_FLAG_SUSPENDED); mutex_unlock(&index_session->request_mutex); if (!suspended) { result = uds_save_index(index); if (result != UDS_SUCCESS) vdo_log_warning_strerror(result, "ignoring error from save_index"); } uds_free_index(index); index_session->index = NULL; /* * Reset all index state that happens to be in the index * session, so it doesn't affect any future index. */ mutex_lock(&index_session->load_context.mutex); index_session->load_context.status = INDEX_OPENING; mutex_unlock(&index_session->load_context.mutex); mutex_lock(&index_session->request_mutex); /* Only the suspend bit will remain relevant. */ index_session->state &= IS_FLAG_SUSPENDED; mutex_unlock(&index_session->request_mutex); return result; } /* Save and close the current index. */ int uds_close_index(struct uds_index_session *index_session) { int result = UDS_SUCCESS; /* Wait for any current index state change to complete. */ mutex_lock(&index_session->request_mutex); while ((index_session->state & IS_FLAG_WAITING) || (index_session->state & IS_FLAG_CLOSING)) { uds_wait_cond(&index_session->request_cond, &index_session->request_mutex); } if (index_session->state & IS_FLAG_SUSPENDED) { vdo_log_info("Index session is suspended"); result = -EBUSY; } else if ((index_session->state & IS_FLAG_DESTROYING) || !(index_session->state & IS_FLAG_LOADED)) { /* The index doesn't exist, hasn't finished loading, or is being destroyed. */ result = UDS_NO_INDEX; } else { index_session->state |= IS_FLAG_CLOSING; } mutex_unlock(&index_session->request_mutex); if (result != UDS_SUCCESS) return uds_status_to_errno(result); vdo_log_debug("Closing index"); wait_for_no_requests_in_progress(index_session); result = save_and_free_index(index_session); vdo_log_debug("Closed index"); mutex_lock(&index_session->request_mutex); index_session->state &= ~IS_FLAG_CLOSING; uds_broadcast_cond(&index_session->request_cond); mutex_unlock(&index_session->request_mutex); return uds_status_to_errno(result); } /* This will save and close an open index before destroying the session. */ int uds_destroy_index_session(struct uds_index_session *index_session) { int result; bool load_pending = false; vdo_log_debug("Destroying index session"); /* Wait for any current index state change to complete. */ mutex_lock(&index_session->request_mutex); while ((index_session->state & IS_FLAG_WAITING) || (index_session->state & IS_FLAG_CLOSING)) { uds_wait_cond(&index_session->request_cond, &index_session->request_mutex); } if (index_session->state & IS_FLAG_DESTROYING) { mutex_unlock(&index_session->request_mutex); vdo_log_info("Index session is already closing"); return -EBUSY; } index_session->state |= IS_FLAG_DESTROYING; load_pending = ((index_session->state & IS_FLAG_LOADING) && (index_session->state & IS_FLAG_SUSPENDED)); mutex_unlock(&index_session->request_mutex); if (load_pending) { /* Tell the index to terminate the rebuild. */ mutex_lock(&index_session->load_context.mutex); if (index_session->load_context.status == INDEX_SUSPENDED) { index_session->load_context.status = INDEX_FREEING; uds_broadcast_cond(&index_session->load_context.cond); } mutex_unlock(&index_session->load_context.mutex); /* Wait until the load exits before proceeding. */ mutex_lock(&index_session->request_mutex); while (index_session->state & IS_FLAG_LOADING) { uds_wait_cond(&index_session->request_cond, &index_session->request_mutex); } mutex_unlock(&index_session->request_mutex); } wait_for_no_requests_in_progress(index_session); result = save_and_free_index(index_session); uds_request_queue_finish(index_session->callback_queue); index_session->callback_queue = NULL; uds_destroy_cond(&index_session->load_context.cond); mutex_destroy(&index_session->load_context.mutex); uds_destroy_cond(&index_session->request_cond); mutex_destroy(&index_session->request_mutex); vdo_log_debug("Destroyed index session"); vdo_free(index_session); return uds_status_to_errno(result); } /* Wait until all callbacks for index operations are complete. */ int uds_flush_index_session(struct uds_index_session *index_session) { wait_for_no_requests_in_progress(index_session); uds_wait_for_idle_index(index_session->index); return UDS_SUCCESS; } /* Statistics collection is intended to be thread-safe. */ static void collect_stats(const struct uds_index_session *index_session, struct uds_index_stats *stats) { const struct session_stats *session_stats = &index_session->stats; stats->current_time = ktime_to_seconds(current_time_ns(CLOCK_REALTIME)); stats->posts_found = READ_ONCE(session_stats->posts_found); stats->in_memory_posts_found = READ_ONCE(session_stats->posts_found_open_chapter); stats->dense_posts_found = READ_ONCE(session_stats->posts_found_dense); stats->sparse_posts_found = READ_ONCE(session_stats->posts_found_sparse); stats->posts_not_found = READ_ONCE(session_stats->posts_not_found); stats->updates_found = READ_ONCE(session_stats->updates_found); stats->updates_not_found = READ_ONCE(session_stats->updates_not_found); stats->deletions_found = READ_ONCE(session_stats->deletions_found); stats->deletions_not_found = READ_ONCE(session_stats->deletions_not_found); stats->queries_found = READ_ONCE(session_stats->queries_found); stats->queries_not_found = READ_ONCE(session_stats->queries_not_found); stats->requests = READ_ONCE(session_stats->requests); } int uds_get_index_session_stats(struct uds_index_session *index_session, struct uds_index_stats *stats) { if (stats == NULL) { vdo_log_error("received a NULL index stats pointer"); return -EINVAL; } collect_stats(index_session, stats); if (index_session->index != NULL) { uds_get_index_stats(index_session->index, stats); } else { stats->entries_indexed = 0; stats->memory_used = 0; stats->collisions = 0; stats->entries_discarded = 0; } return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/index-session.h000066400000000000000000000047061476467262700174020ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_INDEX_SESSION_H #define UDS_INDEX_SESSION_H #include #include #include "thread-utils.h" #include "config.h" #include "indexer.h" /* * The index session mediates all interactions with a UDS index. Once the index session is created, * it can be used to open, close, suspend, or recreate an index. It implements the majority of the * functions in the top-level UDS API. * * If any deduplication request fails due to an internal error, the index is marked disabled. It * will not accept any further requests and can only be closed. Closing the index will clear the * disabled flag, and the index can then be reopened and recovered using the same index session. */ struct __aligned(L1_CACHE_BYTES) session_stats { /* Post requests that found an entry */ u64 posts_found; /* Post requests found in the open chapter */ u64 posts_found_open_chapter; /* Post requests found in the dense index */ u64 posts_found_dense; /* Post requests found in the sparse index */ u64 posts_found_sparse; /* Post requests that did not find an entry */ u64 posts_not_found; /* Update requests that found an entry */ u64 updates_found; /* Update requests that did not find an entry */ u64 updates_not_found; /* Delete requests that found an entry */ u64 deletions_found; /* Delete requests that did not find an entry */ u64 deletions_not_found; /* Query requests that found an entry */ u64 queries_found; /* Query requests that did not find an entry */ u64 queries_not_found; /* Total number of requests */ u64 requests; }; enum index_suspend_status { /* An index load has started but the index is not ready for use. */ INDEX_OPENING = 0, /* The index is able to handle requests. */ INDEX_READY, /* The index is attempting to suspend a rebuild. */ INDEX_SUSPENDING, /* An index rebuild has been suspended. */ INDEX_SUSPENDED, /* An index rebuild is being stopped in order to shut down. */ INDEX_FREEING, }; struct index_load_context { struct mutex mutex; struct cond_var cond; enum index_suspend_status status; }; struct uds_index_session { unsigned int state; struct uds_index *index; struct uds_request_queue *callback_queue; struct uds_parameters parameters; struct index_load_context load_context; struct mutex request_mutex; struct cond_var request_cond; int request_count; struct session_stats stats; }; #endif /* UDS_INDEX_SESSION_H */ vdo-8.3.1.1/utils/uds/index.c000066400000000000000000001221751476467262700157150ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "index.h" #include "logger.h" #include "memory-alloc.h" #include "funnel-requestqueue.h" #include "hash-utils.h" #include "sparse-cache.h" static const u64 NO_LAST_SAVE = U64_MAX; /* * When searching for deduplication records, the index first searches the volume index, and then * searches the chapter index for the relevant chapter. If the chapter has been fully committed to * storage, the chapter pages are loaded into the page cache. If the chapter has not yet been * committed (either the open chapter or a recently closed one), the index searches the in-memory * representation of the chapter. Finally, if the volume index does not find a record and the index * is sparse, the index will search the sparse cache. * * The index send two kinds of messages to coordinate between zones: chapter close messages for the * chapter writer, and sparse cache barrier messages for the sparse cache. * * The chapter writer is responsible for committing chapters of records to storage. Since zones can * get different numbers of records, some zones may fall behind others. Each time a zone fills up * its available space in a chapter, it informs the chapter writer that the chapter is complete, * and also informs all other zones that it has closed the chapter. Each other zone will then close * the chapter immediately, regardless of how full it is, in order to minimize skew between zones. * Once every zone has closed the chapter, the chapter writer will commit that chapter to storage. * * The last zone to close the chapter also removes the oldest chapter from the volume index. * Although that chapter is invalid for zones that have moved on, the existence of the open chapter * means that those zones will never ask the volume index about it. No zone is allowed to get more * than one chapter ahead of any other. If a zone is so far ahead that it tries to close another * chapter before the previous one has been closed by all zones, it is forced to wait. * * The sparse cache relies on having the same set of chapter indexes available to all zones. When a * request wants to add a chapter to the sparse cache, it sends a barrier message to each zone * during the triage stage that acts as a rendezvous. Once every zone has reached the barrier and * paused its operations, the cache membership is changed and each zone is then informed that it * can proceed. More details can be found in the sparse cache documentation. * * If a sparse cache has only one zone, it will not create a triage queue, but it still needs the * barrier message to change the sparse cache membership, so the index simulates the message by * invoking the handler directly. */ struct chapter_writer { /* The index to which we belong */ struct uds_index *index; /* The thread to do the writing */ struct thread *thread; /* The lock protecting the following fields */ struct mutex mutex; /* The condition signalled on state changes */ struct cond_var cond; /* Set to true to stop the thread */ bool stop; /* The result from the most recent write */ int result; /* The number of bytes allocated by the chapter writer */ size_t memory_size; /* The number of zones which have submitted a chapter for writing */ unsigned int zones_to_write; /* Open chapter index used by uds_close_open_chapter() */ struct open_chapter_index *open_chapter_index; /* Collated records used by uds_close_open_chapter() */ struct uds_volume_record *collated_records; /* The chapters to write (one per zone) */ struct open_chapter_zone *chapters[]; }; static bool is_zone_chapter_sparse(const struct index_zone *zone, u64 virtual_chapter) { return uds_is_chapter_sparse(zone->index->volume->geometry, zone->oldest_virtual_chapter, zone->newest_virtual_chapter, virtual_chapter); } static int launch_zone_message(struct uds_zone_message message, unsigned int zone, struct uds_index *index) { int result; struct uds_request *request; result = vdo_allocate(1, struct uds_request, __func__, &request); if (result != VDO_SUCCESS) return result; request->index = index; request->unbatched = true; request->zone_number = zone; request->zone_message = message; uds_enqueue_request(request, STAGE_MESSAGE); return UDS_SUCCESS; } static void enqueue_barrier_messages(struct uds_index *index, u64 virtual_chapter) { struct uds_zone_message message = { .type = UDS_MESSAGE_SPARSE_CACHE_BARRIER, .virtual_chapter = virtual_chapter, }; unsigned int zone; for (zone = 0; zone < index->zone_count; zone++) { int result = launch_zone_message(message, zone, index); VDO_ASSERT_LOG_ONLY((result == UDS_SUCCESS), "barrier message allocation"); } } /* * Determine whether this request should trigger a sparse cache barrier message to change the * membership of the sparse cache. If a change in membership is desired, the function returns the * chapter number to add. */ static u64 triage_index_request(struct uds_index *index, struct uds_request *request) { u64 virtual_chapter; struct index_zone *zone; virtual_chapter = uds_lookup_volume_index_name(index->volume_index, &request->record_name); if (virtual_chapter == NO_CHAPTER) return NO_CHAPTER; zone = index->zones[request->zone_number]; if (!is_zone_chapter_sparse(zone, virtual_chapter)) return NO_CHAPTER; /* * FIXME: Optimize for a common case by remembering the chapter from the most recent * barrier message and skipping this chapter if is it the same. */ return virtual_chapter; } /* * Simulate a message to change the sparse cache membership for a single-zone sparse index. This * allows us to forgo the complicated locking required by a multi-zone sparse index. Any other kind * of index does nothing here. */ static int simulate_index_zone_barrier_message(struct index_zone *zone, struct uds_request *request) { u64 sparse_virtual_chapter; if ((zone->index->zone_count > 1) || !uds_is_sparse_index_geometry(zone->index->volume->geometry)) return UDS_SUCCESS; sparse_virtual_chapter = triage_index_request(zone->index, request); if (sparse_virtual_chapter == NO_CHAPTER) return UDS_SUCCESS; return uds_update_sparse_cache(zone, sparse_virtual_chapter); } /* This is the request processing function for the triage queue. */ static void triage_request(struct uds_request *request) { struct uds_index *index = request->index; u64 sparse_virtual_chapter = triage_index_request(index, request); if (sparse_virtual_chapter != NO_CHAPTER) enqueue_barrier_messages(index, sparse_virtual_chapter); uds_enqueue_request(request, STAGE_INDEX); } static int finish_previous_chapter(struct uds_index *index, u64 current_chapter_number) { int result; struct chapter_writer *writer = index->chapter_writer; mutex_lock(&writer->mutex); while (index->newest_virtual_chapter < current_chapter_number) uds_wait_cond(&writer->cond, &writer->mutex); result = writer->result; mutex_unlock(&writer->mutex); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "Writing of previous open chapter failed"); return UDS_SUCCESS; } static int swap_open_chapter(struct index_zone *zone) { int result; result = finish_previous_chapter(zone->index, zone->newest_virtual_chapter); if (result != UDS_SUCCESS) return result; swap(zone->open_chapter, zone->writing_chapter); return UDS_SUCCESS; } /* * Inform the chapter writer that this zone is done with this chapter. The chapter won't start * writing until all zones have closed it. */ static unsigned int start_closing_chapter(struct uds_index *index, unsigned int zone_number, struct open_chapter_zone *chapter) { unsigned int finished_zones; struct chapter_writer *writer = index->chapter_writer; mutex_lock(&writer->mutex); finished_zones = ++writer->zones_to_write; writer->chapters[zone_number] = chapter; uds_broadcast_cond(&writer->cond); mutex_unlock(&writer->mutex); return finished_zones; } static int announce_chapter_closed(struct index_zone *zone, u64 closed_chapter) { int result; unsigned int i; struct uds_zone_message zone_message = { .type = UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED, .virtual_chapter = closed_chapter, }; for (i = 0; i < zone->index->zone_count; i++) { if (zone->id == i) continue; result = launch_zone_message(zone_message, i, zone->index); if (result != UDS_SUCCESS) return result; } return UDS_SUCCESS; } static int open_next_chapter(struct index_zone *zone) { int result; u64 closed_chapter; u64 expiring; unsigned int finished_zones; u32 expire_chapters; vdo_log_debug("closing chapter %llu of zone %u after %u entries (%u short)", (unsigned long long) zone->newest_virtual_chapter, zone->id, zone->open_chapter->size, zone->open_chapter->capacity - zone->open_chapter->size); result = swap_open_chapter(zone); if (result != UDS_SUCCESS) return result; closed_chapter = zone->newest_virtual_chapter++; uds_set_volume_index_zone_open_chapter(zone->index->volume_index, zone->id, zone->newest_virtual_chapter); uds_reset_open_chapter(zone->open_chapter); finished_zones = start_closing_chapter(zone->index, zone->id, zone->writing_chapter); if ((finished_zones == 1) && (zone->index->zone_count > 1)) { result = announce_chapter_closed(zone, closed_chapter); if (result != UDS_SUCCESS) return result; } expiring = zone->oldest_virtual_chapter; expire_chapters = uds_chapters_to_expire(zone->index->volume->geometry, zone->newest_virtual_chapter); zone->oldest_virtual_chapter += expire_chapters; if (finished_zones < zone->index->zone_count) return UDS_SUCCESS; while (expire_chapters-- > 0) uds_forget_chapter(zone->index->volume, expiring++); return UDS_SUCCESS; } static int handle_chapter_closed(struct index_zone *zone, u64 virtual_chapter) { if (zone->newest_virtual_chapter == virtual_chapter) return open_next_chapter(zone); return UDS_SUCCESS; } static int dispatch_index_zone_control_request(struct uds_request *request) { struct uds_zone_message *message = &request->zone_message; struct index_zone *zone = request->index->zones[request->zone_number]; switch (message->type) { case UDS_MESSAGE_SPARSE_CACHE_BARRIER: return uds_update_sparse_cache(zone, message->virtual_chapter); case UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED: return handle_chapter_closed(zone, message->virtual_chapter); default: vdo_log_error("invalid message type: %d", message->type); return UDS_INVALID_ARGUMENT; } } static void set_request_location(struct uds_request *request, enum uds_index_region new_location) { request->location = new_location; request->found = ((new_location == UDS_LOCATION_IN_OPEN_CHAPTER) || (new_location == UDS_LOCATION_IN_DENSE) || (new_location == UDS_LOCATION_IN_SPARSE)); } static void set_chapter_location(struct uds_request *request, const struct index_zone *zone, u64 virtual_chapter) { request->found = true; if (virtual_chapter == zone->newest_virtual_chapter) request->location = UDS_LOCATION_IN_OPEN_CHAPTER; else if (is_zone_chapter_sparse(zone, virtual_chapter)) request->location = UDS_LOCATION_IN_SPARSE; else request->location = UDS_LOCATION_IN_DENSE; } static int search_sparse_cache_in_zone(struct index_zone *zone, struct uds_request *request, u64 virtual_chapter, bool *found) { int result; struct volume *volume; u16 record_page_number; u32 chapter; result = uds_search_sparse_cache(zone, &request->record_name, &virtual_chapter, &record_page_number); if ((result != UDS_SUCCESS) || (virtual_chapter == NO_CHAPTER)) return result; request->virtual_chapter = virtual_chapter; volume = zone->index->volume; chapter = uds_map_to_physical_chapter(volume->geometry, virtual_chapter); return uds_search_cached_record_page(volume, request, chapter, record_page_number, found); } static int get_record_from_zone(struct index_zone *zone, struct uds_request *request, bool *found) { struct volume *volume; if (request->location == UDS_LOCATION_RECORD_PAGE_LOOKUP) { *found = true; return UDS_SUCCESS; } else if (request->location == UDS_LOCATION_UNAVAILABLE) { *found = false; return UDS_SUCCESS; } if (request->virtual_chapter == zone->newest_virtual_chapter) { uds_search_open_chapter(zone->open_chapter, &request->record_name, &request->old_metadata, found); return UDS_SUCCESS; } if ((zone->newest_virtual_chapter > 0) && (request->virtual_chapter == (zone->newest_virtual_chapter - 1)) && (zone->writing_chapter->size > 0)) { uds_search_open_chapter(zone->writing_chapter, &request->record_name, &request->old_metadata, found); return UDS_SUCCESS; } volume = zone->index->volume; if (is_zone_chapter_sparse(zone, request->virtual_chapter) && uds_sparse_cache_contains(volume->sparse_cache, request->virtual_chapter, request->zone_number)) return search_sparse_cache_in_zone(zone, request, request->virtual_chapter, found); return uds_search_volume_page_cache(volume, request, found); } static int put_record_in_zone(struct index_zone *zone, struct uds_request *request, const struct uds_record_data *metadata) { unsigned int remaining; remaining = uds_put_open_chapter(zone->open_chapter, &request->record_name, metadata); if (remaining == 0) return open_next_chapter(zone); return UDS_SUCCESS; } static int search_index_zone(struct index_zone *zone, struct uds_request *request) { int result; struct volume_index_record record; bool overflow_record, found = false; struct uds_record_data *metadata; u64 chapter; result = uds_get_volume_index_record(zone->index->volume_index, &request->record_name, &record); if (result != UDS_SUCCESS) return result; if (record.is_found) { if (request->requeued && request->virtual_chapter != record.virtual_chapter) set_request_location(request, UDS_LOCATION_UNKNOWN); request->virtual_chapter = record.virtual_chapter; result = get_record_from_zone(zone, request, &found); if (result != UDS_SUCCESS) return result; } if (found) set_chapter_location(request, zone, record.virtual_chapter); /* * If a record has overflowed a chapter index in more than one chapter (or overflowed in * one chapter and collided with an existing record), it will exist as a collision record * in the volume index, but we won't find it in the volume. This case needs special * handling. */ overflow_record = (record.is_found && record.is_collision && !found); chapter = zone->newest_virtual_chapter; if (found || overflow_record) { if ((request->type == UDS_QUERY_NO_UPDATE) || ((request->type == UDS_QUERY) && overflow_record)) { /* There is nothing left to do. */ return UDS_SUCCESS; } if (record.virtual_chapter != chapter) { /* * Update the volume index to reference the new chapter for the block. If * the record had been deleted or dropped from the chapter index, it will * be back. */ result = uds_set_volume_index_record_chapter(&record, chapter); } else if (request->type != UDS_UPDATE) { /* The record is already in the open chapter. */ return UDS_SUCCESS; } } else { /* * The record wasn't in the volume index, so check whether the * name is in a cached sparse chapter. If we found the name on * a previous search, use that result instead. */ if (request->location == UDS_LOCATION_RECORD_PAGE_LOOKUP) { found = true; } else if (request->location == UDS_LOCATION_UNAVAILABLE) { found = false; } else if (uds_is_sparse_index_geometry(zone->index->volume->geometry) && !uds_is_volume_index_sample(zone->index->volume_index, &request->record_name)) { result = search_sparse_cache_in_zone(zone, request, NO_CHAPTER, &found); if (result != UDS_SUCCESS) return result; } if (found) set_request_location(request, UDS_LOCATION_IN_SPARSE); if ((request->type == UDS_QUERY_NO_UPDATE) || ((request->type == UDS_QUERY) && !found)) { /* There is nothing left to do. */ return UDS_SUCCESS; } /* * Add a new entry to the volume index referencing the open chapter. This needs to * be done both for new records, and for records from cached sparse chapters. */ result = uds_put_volume_index_record(&record, chapter); } if (result == UDS_OVERFLOW) { /* * The volume index encountered a delta list overflow. The condition was already * logged. We will go on without adding the record to the open chapter. */ return UDS_SUCCESS; } if (result != UDS_SUCCESS) return result; if (!found || (request->type == UDS_UPDATE)) { /* This is a new record or we're updating an existing record. */ metadata = &request->new_metadata; } else { /* Move the existing record to the open chapter. */ metadata = &request->old_metadata; } return put_record_in_zone(zone, request, metadata); } static int remove_from_index_zone(struct index_zone *zone, struct uds_request *request) { int result; struct volume_index_record record; result = uds_get_volume_index_record(zone->index->volume_index, &request->record_name, &record); if (result != UDS_SUCCESS) return result; if (!record.is_found) return UDS_SUCCESS; /* If the request was requeued, check whether the saved state is still valid. */ if (record.is_collision) { set_chapter_location(request, zone, record.virtual_chapter); } else { /* Non-collision records are hints, so resolve the name in the chapter. */ bool found; if (request->requeued && request->virtual_chapter != record.virtual_chapter) set_request_location(request, UDS_LOCATION_UNKNOWN); request->virtual_chapter = record.virtual_chapter; result = get_record_from_zone(zone, request, &found); if (result != UDS_SUCCESS) return result; if (!found) { /* There is no record to remove. */ return UDS_SUCCESS; } } set_chapter_location(request, zone, record.virtual_chapter); /* * Delete the volume index entry for the named record only. Note that a later search might * later return stale advice if there is a colliding name in the same chapter, but it's a * very rare case (1 in 2^21). */ result = uds_remove_volume_index_record(&record); if (result != UDS_SUCCESS) return result; /* * If the record is in the open chapter, we must remove it or mark it deleted to avoid * trouble if the record is added again later. */ if (request->location == UDS_LOCATION_IN_OPEN_CHAPTER) uds_remove_from_open_chapter(zone->open_chapter, &request->record_name); return UDS_SUCCESS; } static int dispatch_index_request(struct uds_index *index, struct uds_request *request) { int result; struct index_zone *zone = index->zones[request->zone_number]; if (!request->requeued) { result = simulate_index_zone_barrier_message(zone, request); if (result != UDS_SUCCESS) return result; } switch (request->type) { case UDS_POST: case UDS_UPDATE: case UDS_QUERY: case UDS_QUERY_NO_UPDATE: result = search_index_zone(zone, request); break; case UDS_DELETE: result = remove_from_index_zone(zone, request); break; default: result = vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "invalid request type: %d", request->type); break; } return result; } /* This is the request processing function invoked by each zone's thread. */ static void execute_zone_request(struct uds_request *request) { int result; struct uds_index *index = request->index; if (request->zone_message.type != UDS_MESSAGE_NONE) { result = dispatch_index_zone_control_request(request); if (result != UDS_SUCCESS) { vdo_log_error_strerror(result, "error executing message: %d", request->zone_message.type); } /* Once the message is processed it can be freed. */ vdo_free(vdo_forget(request)); return; } index->need_to_save = true; if (request->requeued && (request->status != UDS_SUCCESS)) { set_request_location(request, UDS_LOCATION_UNAVAILABLE); index->callback(request); return; } result = dispatch_index_request(index, request); if (result == UDS_QUEUED) { /* The request has been requeued so don't let it complete. */ return; } if (!request->found) set_request_location(request, UDS_LOCATION_UNAVAILABLE); request->status = result; index->callback(request); } static int initialize_index_queues(struct uds_index *index, const struct index_geometry *geometry) { int result; unsigned int i; for (i = 0; i < index->zone_count; i++) { result = uds_make_request_queue("indexW", &execute_zone_request, &index->zone_queues[i]); if (result != UDS_SUCCESS) return result; } /* The triage queue is only needed for sparse multi-zone indexes. */ if ((index->zone_count > 1) && uds_is_sparse_index_geometry(geometry)) { result = uds_make_request_queue("triageW", &triage_request, &index->triage_queue); if (result != UDS_SUCCESS) return result; } return UDS_SUCCESS; } /* This is the driver function for the chapter writer thread. */ static void close_chapters(void *arg) { int result; struct chapter_writer *writer = arg; struct uds_index *index = writer->index; vdo_log_debug("chapter writer starting"); mutex_lock(&writer->mutex); for (;;) { while (writer->zones_to_write < index->zone_count) { if (writer->stop && (writer->zones_to_write == 0)) { /* * We've been told to stop, and all of the zones are in the same * open chapter, so we can exit now. */ mutex_unlock(&writer->mutex); vdo_log_debug("chapter writer stopping"); return; } uds_wait_cond(&writer->cond, &writer->mutex); } /* * Release the lock while closing a chapter. We probably don't need to do this, but * it seems safer in principle. It's OK to access the chapter and chapter_number * fields without the lock since those aren't allowed to change until we're done. */ mutex_unlock(&writer->mutex); if (index->has_saved_open_chapter) { /* * Remove the saved open chapter the first time we close an open chapter * after loading from a clean shutdown, or after doing a clean save. The * lack of the saved open chapter will indicate that a recovery is * necessary. */ index->has_saved_open_chapter = false; result = uds_discard_open_chapter(index->layout); if (result == UDS_SUCCESS) vdo_log_debug("Discarding saved open chapter"); } result = uds_close_open_chapter(writer->chapters, index->zone_count, index->volume, writer->open_chapter_index, writer->collated_records, index->newest_virtual_chapter); mutex_lock(&writer->mutex); index->newest_virtual_chapter++; index->oldest_virtual_chapter += uds_chapters_to_expire(index->volume->geometry, index->newest_virtual_chapter); writer->result = result; writer->zones_to_write = 0; uds_broadcast_cond(&writer->cond); } } static void stop_chapter_writer(struct chapter_writer *writer) { struct thread *writer_thread = NULL; mutex_lock(&writer->mutex); if (writer->thread != NULL) { writer_thread = writer->thread; writer->thread = NULL; writer->stop = true; uds_broadcast_cond(&writer->cond); } mutex_unlock(&writer->mutex); if (writer_thread != NULL) vdo_join_threads(writer_thread); } static void free_chapter_writer(struct chapter_writer *writer) { if (writer == NULL) return; stop_chapter_writer(writer); mutex_destroy(&writer->mutex); uds_destroy_cond(&writer->cond); uds_free_open_chapter_index(writer->open_chapter_index); vdo_free(writer->collated_records); vdo_free(writer); } static int make_chapter_writer(struct uds_index *index, struct chapter_writer **writer_ptr) { int result; struct chapter_writer *writer; size_t collated_records_size = (sizeof(struct uds_volume_record) * index->volume->geometry->records_per_chapter); result = vdo_allocate_extended(struct chapter_writer, index->zone_count, struct open_chapter_zone *, "Chapter Writer", &writer); if (result != VDO_SUCCESS) return result; writer->index = index; mutex_init(&writer->mutex); uds_init_cond(&writer->cond); result = vdo_allocate_cache_aligned(collated_records_size, "collated records", &writer->collated_records); if (result != VDO_SUCCESS) { free_chapter_writer(writer); return result; } result = uds_make_open_chapter_index(&writer->open_chapter_index, index->volume->geometry, index->volume->nonce); if (result != UDS_SUCCESS) { free_chapter_writer(writer); return result; } writer->memory_size = (sizeof(struct chapter_writer) + index->zone_count * sizeof(struct open_chapter_zone *) + collated_records_size + writer->open_chapter_index->memory_size); result = vdo_create_thread(close_chapters, writer, "writer", &writer->thread); if (result != VDO_SUCCESS) { free_chapter_writer(writer); return result; } *writer_ptr = writer; return UDS_SUCCESS; } static int load_index(struct uds_index *index) { int result; u64 last_save_chapter; result = uds_load_index_state(index->layout, index); if (result != UDS_SUCCESS) return UDS_INDEX_NOT_SAVED_CLEANLY; last_save_chapter = ((index->last_save != NO_LAST_SAVE) ? index->last_save : 0); vdo_log_info("loaded index from chapter %llu through chapter %llu", (unsigned long long) index->oldest_virtual_chapter, (unsigned long long) last_save_chapter); return UDS_SUCCESS; } static int rebuild_index_page_map(struct uds_index *index, u64 vcn) { int result; struct delta_index_page *chapter_index_page; struct index_geometry *geometry = index->volume->geometry; u32 chapter = uds_map_to_physical_chapter(geometry, vcn); u32 expected_list_number = 0; u32 index_page_number; u32 lowest_delta_list; u32 highest_delta_list; for (index_page_number = 0; index_page_number < geometry->index_pages_per_chapter; index_page_number++) { result = uds_get_volume_index_page(index->volume, chapter, index_page_number, &chapter_index_page); if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "failed to read index page %u in chapter %u", index_page_number, chapter); } lowest_delta_list = chapter_index_page->lowest_list_number; highest_delta_list = chapter_index_page->highest_list_number; if (lowest_delta_list != expected_list_number) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "chapter %u index page %u is corrupt", chapter, index_page_number); } uds_update_index_page_map(index->volume->index_page_map, vcn, chapter, index_page_number, highest_delta_list); expected_list_number = highest_delta_list + 1; } return UDS_SUCCESS; } static int replay_record(struct uds_index *index, const struct uds_record_name *name, u64 virtual_chapter, bool will_be_sparse_chapter) { int result; struct volume_index_record record; bool update_record; if (will_be_sparse_chapter && !uds_is_volume_index_sample(index->volume_index, name)) { /* * This entry will be in a sparse chapter after the rebuild completes, and it is * not a sample, so just skip over it. */ return UDS_SUCCESS; } result = uds_get_volume_index_record(index->volume_index, name, &record); if (result != UDS_SUCCESS) return result; if (record.is_found) { if (record.is_collision) { if (record.virtual_chapter == virtual_chapter) { /* The record is already correct. */ return UDS_SUCCESS; } update_record = true; } else if (record.virtual_chapter == virtual_chapter) { /* * There is a volume index entry pointing to the current chapter, but we * don't know if it is for the same name as the one we are currently * working on or not. For now, we're just going to assume that it isn't. * This will create one extra collision record if there was a deleted * record in the current chapter. */ update_record = false; } else { /* * If we're rebuilding, we don't normally want to go to disk to see if the * record exists, since we will likely have just read the record from disk * (i.e. we know it's there). The exception to this is when we find an * entry in the volume index that has a different chapter. In this case, we * need to search that chapter to determine if the volume index entry was * for the same record or a different one. */ result = uds_search_volume_page_cache_for_rebuild(index->volume, name, record.virtual_chapter, &update_record); if (result != UDS_SUCCESS) return result; } } else { update_record = false; } if (update_record) { /* * Update the volume index to reference the new chapter for the block. If the * record had been deleted or dropped from the chapter index, it will be back. */ result = uds_set_volume_index_record_chapter(&record, virtual_chapter); } else { /* * Add a new entry to the volume index referencing the open chapter. This should be * done regardless of whether we are a brand new record or a sparse record, i.e. * one that doesn't exist in the index but does on disk, since for a sparse record, * we would want to un-sparsify if it did exist. */ result = uds_put_volume_index_record(&record, virtual_chapter); } if ((result == UDS_DUPLICATE_NAME) || (result == UDS_OVERFLOW)) { /* The rebuilt index will lose these records. */ return UDS_SUCCESS; } return result; } static bool check_for_suspend(struct uds_index *index) { bool closing; if (index->load_context == NULL) return false; mutex_lock(&index->load_context->mutex); if (index->load_context->status != INDEX_SUSPENDING) { mutex_unlock(&index->load_context->mutex); return false; } /* Notify that we are suspended and wait for the resume. */ index->load_context->status = INDEX_SUSPENDED; uds_broadcast_cond(&index->load_context->cond); while ((index->load_context->status != INDEX_OPENING) && (index->load_context->status != INDEX_FREEING)) uds_wait_cond(&index->load_context->cond, &index->load_context->mutex); closing = (index->load_context->status == INDEX_FREEING); mutex_unlock(&index->load_context->mutex); return closing; } static int replay_chapter(struct uds_index *index, u64 virtual, bool sparse) { int result; u32 i; u32 j; const struct index_geometry *geometry; u32 physical_chapter; if (check_for_suspend(index)) { vdo_log_info("Replay interrupted by index shutdown at chapter %llu", (unsigned long long) virtual); return -EBUSY; } geometry = index->volume->geometry; physical_chapter = uds_map_to_physical_chapter(geometry, virtual); uds_prefetch_volume_chapter(index->volume, physical_chapter); uds_set_volume_index_open_chapter(index->volume_index, virtual); result = rebuild_index_page_map(index, virtual); if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "could not rebuild index page map for chapter %u", physical_chapter); } for (i = 0; i < geometry->record_pages_per_chapter; i++) { u8 *record_page; u32 record_page_number; record_page_number = geometry->index_pages_per_chapter + i; result = uds_get_volume_record_page(index->volume, physical_chapter, record_page_number, &record_page); if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "could not get page %d", record_page_number); } for (j = 0; j < geometry->records_per_page; j++) { const u8 *name_bytes; struct uds_record_name name; name_bytes = record_page + (j * BYTES_PER_RECORD); memcpy(&name.name, name_bytes, UDS_RECORD_NAME_SIZE); result = replay_record(index, &name, virtual, sparse); if (result != UDS_SUCCESS) return result; } } return UDS_SUCCESS; } static int replay_volume(struct uds_index *index) { int result; u64 old_map_update; u64 new_map_update; u64 virtual; u64 from_virtual = index->oldest_virtual_chapter; u64 upto_virtual = index->newest_virtual_chapter; bool will_be_sparse; vdo_log_info("Replaying volume from chapter %llu through chapter %llu", (unsigned long long) from_virtual, (unsigned long long) upto_virtual); /* * The index failed to load, so the volume index is empty. Add records to the volume index * in order, skipping non-hooks in chapters which will be sparse to save time. * * Go through each record page of each chapter and add the records back to the volume * index. This should not cause anything to be written to either the open chapter or the * on-disk volume. Also skip the on-disk chapter corresponding to upto_virtual, as this * would have already been purged from the volume index when the chapter was opened. * * Also, go through each index page for each chapter and rebuild the index page map. */ old_map_update = index->volume->index_page_map->last_update; for (virtual = from_virtual; virtual < upto_virtual; virtual++) { will_be_sparse = uds_is_chapter_sparse(index->volume->geometry, from_virtual, upto_virtual, virtual); result = replay_chapter(index, virtual, will_be_sparse); if (result != UDS_SUCCESS) return result; } /* Also reap the chapter being replaced by the open chapter. */ uds_set_volume_index_open_chapter(index->volume_index, upto_virtual); new_map_update = index->volume->index_page_map->last_update; if (new_map_update != old_map_update) { vdo_log_info("replay changed index page map update from %llu to %llu", (unsigned long long) old_map_update, (unsigned long long) new_map_update); } return UDS_SUCCESS; } static int rebuild_index(struct uds_index *index) { int result; u64 lowest; u64 highest; bool is_empty = false; u32 chapters_per_volume = index->volume->geometry->chapters_per_volume; index->volume->lookup_mode = LOOKUP_FOR_REBUILD; result = uds_find_volume_chapter_boundaries(index->volume, &lowest, &highest, &is_empty); if (result != UDS_SUCCESS) { return vdo_log_fatal_strerror(result, "cannot rebuild index: unknown volume chapter boundaries"); } if (is_empty) { index->newest_virtual_chapter = 0; index->oldest_virtual_chapter = 0; index->volume->lookup_mode = LOOKUP_NORMAL; return UDS_SUCCESS; } index->newest_virtual_chapter = highest + 1; index->oldest_virtual_chapter = lowest; if (index->newest_virtual_chapter == (index->oldest_virtual_chapter + chapters_per_volume)) { /* Skip the chapter shadowed by the open chapter. */ index->oldest_virtual_chapter++; } result = replay_volume(index); if (result != UDS_SUCCESS) return result; index->volume->lookup_mode = LOOKUP_NORMAL; return UDS_SUCCESS; } static void free_index_zone(struct index_zone *zone) { if (zone == NULL) return; uds_free_open_chapter(zone->open_chapter); uds_free_open_chapter(zone->writing_chapter); vdo_free(zone); } static int make_index_zone(struct uds_index *index, unsigned int zone_number) { int result; struct index_zone *zone; result = vdo_allocate(1, struct index_zone, "index zone", &zone); if (result != VDO_SUCCESS) return result; result = uds_make_open_chapter(index->volume->geometry, index->zone_count, &zone->open_chapter); if (result != UDS_SUCCESS) { free_index_zone(zone); return result; } result = uds_make_open_chapter(index->volume->geometry, index->zone_count, &zone->writing_chapter); if (result != UDS_SUCCESS) { free_index_zone(zone); return result; } zone->index = index; zone->id = zone_number; index->zones[zone_number] = zone; return UDS_SUCCESS; } int uds_make_index(struct uds_configuration *config, enum uds_open_index_type open_type, struct index_load_context *load_context, index_callback_fn callback, struct uds_index **new_index) { int result; bool loaded = false; bool new = (open_type == UDS_CREATE); struct uds_index *index = NULL; struct index_zone *zone; u64 nonce; unsigned int z; result = vdo_allocate_extended(struct uds_index, config->zone_count, struct uds_request_queue *, "index", &index); if (result != VDO_SUCCESS) return result; index->zone_count = config->zone_count; result = uds_make_index_layout(config, new, &index->layout); if (result != UDS_SUCCESS) { uds_free_index(index); return result; } result = vdo_allocate(index->zone_count, struct index_zone *, "zones", &index->zones); if (result != VDO_SUCCESS) { uds_free_index(index); return result; } result = uds_make_volume(config, index->layout, &index->volume); if (result != UDS_SUCCESS) { uds_free_index(index); return result; } index->volume->lookup_mode = LOOKUP_NORMAL; for (z = 0; z < index->zone_count; z++) { result = make_index_zone(index, z); if (result != UDS_SUCCESS) { uds_free_index(index); return vdo_log_error_strerror(result, "Could not create index zone"); } } nonce = uds_get_volume_nonce(index->layout); result = uds_make_volume_index(config, nonce, &index->volume_index); if (result != UDS_SUCCESS) { uds_free_index(index); return vdo_log_error_strerror(result, "could not make volume index"); } index->load_context = load_context; index->callback = callback; result = initialize_index_queues(index, config->geometry); if (result != UDS_SUCCESS) { uds_free_index(index); return result; } result = make_chapter_writer(index, &index->chapter_writer); if (result != UDS_SUCCESS) { uds_free_index(index); return result; } if (!new) { result = load_index(index); switch (result) { case UDS_SUCCESS: loaded = true; break; case -ENOMEM: /* We should not try a rebuild for this error. */ vdo_log_error_strerror(result, "index could not be loaded"); break; default: vdo_log_error_strerror(result, "index could not be loaded"); if (open_type == UDS_LOAD) { result = rebuild_index(index); if (result != UDS_SUCCESS) { vdo_log_error_strerror(result, "index could not be rebuilt"); } } break; } } if (result != UDS_SUCCESS) { uds_free_index(index); return vdo_log_error_strerror(result, "fatal error in %s()", __func__); } for (z = 0; z < index->zone_count; z++) { zone = index->zones[z]; zone->oldest_virtual_chapter = index->oldest_virtual_chapter; zone->newest_virtual_chapter = index->newest_virtual_chapter; } if (index->load_context != NULL) { mutex_lock(&index->load_context->mutex); index->load_context->status = INDEX_READY; /* * If we get here, suspend is meaningless, but notify any thread trying to suspend * us so it doesn't hang. */ uds_broadcast_cond(&index->load_context->cond); mutex_unlock(&index->load_context->mutex); } index->has_saved_open_chapter = loaded; index->need_to_save = !loaded; *new_index = index; return UDS_SUCCESS; } void uds_free_index(struct uds_index *index) { unsigned int i; if (index == NULL) return; uds_request_queue_finish(index->triage_queue); for (i = 0; i < index->zone_count; i++) uds_request_queue_finish(index->zone_queues[i]); free_chapter_writer(index->chapter_writer); uds_free_volume_index(index->volume_index); if (index->zones != NULL) { for (i = 0; i < index->zone_count; i++) free_index_zone(index->zones[i]); vdo_free(index->zones); } uds_free_volume(index->volume); uds_free_index_layout(vdo_forget(index->layout)); vdo_free(index); } /* Wait for the chapter writer to complete any outstanding writes. */ void uds_wait_for_idle_index(struct uds_index *index) { struct chapter_writer *writer = index->chapter_writer; mutex_lock(&writer->mutex); while (writer->zones_to_write > 0) uds_wait_cond(&writer->cond, &writer->mutex); mutex_unlock(&writer->mutex); } /* This function assumes that all requests have been drained. */ int uds_save_index(struct uds_index *index) { int result; if (!index->need_to_save) return UDS_SUCCESS; uds_wait_for_idle_index(index); index->prev_save = index->last_save; index->last_save = ((index->newest_virtual_chapter == 0) ? NO_LAST_SAVE : index->newest_virtual_chapter - 1); vdo_log_info("beginning save (vcn %llu)", (unsigned long long) index->last_save); result = uds_save_index_state(index->layout, index); if (result != UDS_SUCCESS) { vdo_log_info("save index failed"); index->last_save = index->prev_save; } else { index->has_saved_open_chapter = true; index->need_to_save = false; vdo_log_info("finished save (vcn %llu)", (unsigned long long) index->last_save); } return result; } int uds_replace_index_storage(struct uds_index *index, struct block_device *bdev) { return uds_replace_volume_storage(index->volume, index->layout, bdev); } /* Accessing statistics should be safe from any thread. */ void uds_get_index_stats(struct uds_index *index, struct uds_index_stats *counters) { struct volume_index_stats stats; uds_get_volume_index_stats(index->volume_index, &stats); counters->entries_indexed = stats.record_count; counters->collisions = stats.collision_count; counters->entries_discarded = stats.discard_count; counters->memory_used = (index->volume_index->memory_size + index->volume->cache_size + index->chapter_writer->memory_size); } void uds_enqueue_request(struct uds_request *request, enum request_stage stage) { struct uds_index *index = request->index; struct uds_request_queue *queue; switch (stage) { case STAGE_TRIAGE: if (index->triage_queue != NULL) { queue = index->triage_queue; break; } fallthrough; case STAGE_INDEX: request->zone_number = uds_get_volume_index_zone(index->volume_index, &request->record_name); fallthrough; case STAGE_MESSAGE: queue = index->zone_queues[request->zone_number]; break; default: VDO_ASSERT_LOG_ONLY(false, "invalid index stage: %d", stage); return; } uds_request_queue_enqueue(queue, request); } vdo-8.3.1.1/utils/uds/index.h000066400000000000000000000044751476467262700157240ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_INDEX_H #define UDS_INDEX_H #include "index-layout.h" #include "index-session.h" #include "open-chapter.h" #include "volume.h" #include "volume-index.h" /* * The index is a high-level structure which represents the totality of the UDS index. It manages * the queues for incoming requests and dispatches them to the appropriate sub-components like the * volume or the volume index. It also manages administrative tasks such as saving and loading the * index. * * The index is divided into a number of independent zones and assigns each request to a zone based * on its name. Most sub-components are similarly divided into zones as well so that requests in * each zone usually operate without interference or coordination between zones. */ typedef void (*index_callback_fn)(struct uds_request *request); struct index_zone { struct uds_index *index; struct open_chapter_zone *open_chapter; struct open_chapter_zone *writing_chapter; u64 oldest_virtual_chapter; u64 newest_virtual_chapter; unsigned int id; }; struct uds_index { bool has_saved_open_chapter; bool need_to_save; struct index_load_context *load_context; struct index_layout *layout; struct volume_index *volume_index; struct volume *volume; unsigned int zone_count; struct index_zone **zones; u64 oldest_virtual_chapter; u64 newest_virtual_chapter; u64 last_save; u64 prev_save; struct chapter_writer *chapter_writer; index_callback_fn callback; struct uds_request_queue *triage_queue; struct uds_request_queue *zone_queues[]; }; enum request_stage { STAGE_TRIAGE, STAGE_INDEX, STAGE_MESSAGE, }; int __must_check uds_make_index(struct uds_configuration *config, enum uds_open_index_type open_type, struct index_load_context *load_context, index_callback_fn callback, struct uds_index **new_index); int __must_check uds_save_index(struct uds_index *index); void uds_free_index(struct uds_index *index); int __must_check uds_replace_index_storage(struct uds_index *index, struct block_device *bdev); void uds_get_index_stats(struct uds_index *index, struct uds_index_stats *counters); void uds_enqueue_request(struct uds_request *request, enum request_stage stage); void uds_wait_for_idle_index(struct uds_index *index); #endif /* UDS_INDEX_H */ vdo-8.3.1.1/utils/uds/indexer.h000066400000000000000000000304461476467262700162500ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef INDEXER_H #define INDEXER_H #include #include #include #include #include #include #include #include #include "funnel-queue.h" /* * UDS public API * * The Universal Deduplication System (UDS) is an efficient name-value store. When used for * deduplicating storage, the names are generally hashes of data blocks and the associated data is * where that block is located on the underlying storage medium. The stored names are expected to * be randomly distributed among the space of possible names. If this assumption is violated, the * UDS index will store fewer names than normal but will otherwise continue to work. The data * associated with each name can be any 16-byte value. * * A client must first create an index session to interact with an index. Once created, the session * can be shared among multiple threads or users. When a session is destroyed, it will also close * and save any associated index. * * To make a request, a client must allocate a uds_request structure and set the required fields * before launching it. UDS will invoke the provided callback to complete the request. After the * callback has been called, the uds_request structure can be freed or reused for a new request. * There are five types of requests: * * A UDS_UPDATE request will associate the provided name with the provided data. Any previous data * associated with that name will be discarded. * * A UDS_QUERY request will return the data associated with the provided name, if any. The entry * for the name will also be marked as most recent, as if the data had been updated. * * A UDS_POST request is a combination of UDS_QUERY and UDS_UPDATE. If there is already data * associated with the provided name, that data is returned. If there is no existing association, * the name is associated with the newly provided data. This request is equivalent to a UDS_QUERY * request followed by a UDS_UPDATE request if no data is found, but it is much more efficient. * * A UDS_QUERY_NO_UPDATE request will return the data associated with the provided name, but will * not change the recency of the entry for the name. This request is primarily useful for testing, * to determine whether an entry exists without changing the internal state of the index. * * A UDS_DELETE request removes any data associated with the provided name. This operation is * generally not necessary, because the index will automatically discard its oldest entries once it * becomes full. */ /* General UDS constants and structures */ enum uds_request_type { /* Create or update the mapping for a name, and make the name most recent. */ UDS_UPDATE, /* Return any mapped data for a name, and make the name most recent. */ UDS_QUERY, /* * Return any mapped data for a name, or map the provided data to the name if there is no * current data, and make the name most recent. */ UDS_POST, /* Return any mapped data for a name without updating its recency. */ UDS_QUERY_NO_UPDATE, /* Remove any mapping for a name. */ UDS_DELETE, }; enum uds_open_index_type { /* Create a new index. */ UDS_CREATE, /* Load an existing index and try to recover if necessary. */ UDS_LOAD, /* Load an existing index, but only if it was saved cleanly. */ UDS_NO_REBUILD, }; enum { /* The record name size in bytes */ UDS_RECORD_NAME_SIZE = 16, /* The maximum record data size in bytes */ UDS_RECORD_DATA_SIZE = 16, }; /* * A type representing a UDS memory configuration which is either a positive integer number of * gigabytes or one of the six special constants for configurations smaller than one gigabyte. */ typedef int uds_memory_config_size_t; enum { /* The maximum configurable amount of memory */ UDS_MEMORY_CONFIG_MAX = 1024, /* Flag indicating that the index has one less chapter than usual */ UDS_MEMORY_CONFIG_REDUCED = 0x1000, UDS_MEMORY_CONFIG_REDUCED_MAX = 1024 + UDS_MEMORY_CONFIG_REDUCED, /* Special values indicating sizes less than 1 GB */ UDS_MEMORY_CONFIG_256MB = -256, UDS_MEMORY_CONFIG_512MB = -512, UDS_MEMORY_CONFIG_768MB = -768, UDS_MEMORY_CONFIG_REDUCED_256MB = -1280, UDS_MEMORY_CONFIG_REDUCED_512MB = -1536, UDS_MEMORY_CONFIG_REDUCED_768MB = -1792, }; struct uds_record_name { unsigned char name[UDS_RECORD_NAME_SIZE]; }; struct uds_record_data { unsigned char data[UDS_RECORD_DATA_SIZE]; }; struct uds_volume_record { struct uds_record_name name; struct uds_record_data data; }; struct uds_parameters { /* The block_device used for storage */ struct block_device *bdev; /* The maximum allowable size of the index on storage */ size_t size; /* The offset where the index should start */ off_t offset; /* The maximum memory allocation, in GB */ uds_memory_config_size_t memory_size; /* Whether the index should include sparse chapters */ bool sparse; /* A 64-bit nonce to validate the index */ u64 nonce; /* The number of threads used to process index requests */ unsigned int zone_count; /* The number of threads used to read volume pages */ unsigned int read_threads; }; /* * These statistics capture characteristics of the current index, including resource usage and * requests processed since the index was opened. */ struct uds_index_stats { /* The total number of records stored in the index */ u64 entries_indexed; /* An estimate of the index's memory usage, in bytes */ u64 memory_used; /* The number of collisions recorded in the volume index */ u64 collisions; /* The number of entries discarded from the index since startup */ u64 entries_discarded; /* The time at which these statistics were fetched */ s64 current_time; /* The number of post calls that found an existing entry */ u64 posts_found; /* The number of post calls that added an entry */ u64 posts_not_found; /* * The number of post calls that found an existing entry that is current enough to only * exist in memory and not have been committed to disk yet */ u64 in_memory_posts_found; /* * The number of post calls that found an existing entry in the dense portion of the index */ u64 dense_posts_found; /* * The number of post calls that found an existing entry in the sparse portion of the index */ u64 sparse_posts_found; /* The number of update calls that updated an existing entry */ u64 updates_found; /* The number of update calls that added a new entry */ u64 updates_not_found; /* The number of delete requests that deleted an existing entry */ u64 deletions_found; /* The number of delete requests that did nothing */ u64 deletions_not_found; /* The number of query calls that found existing entry */ u64 queries_found; /* The number of query calls that did not find an entry */ u64 queries_not_found; /* The total number of requests processed */ u64 requests; }; enum uds_index_region { /* No location information has been determined */ UDS_LOCATION_UNKNOWN = 0, /* The index page entry has been found */ UDS_LOCATION_INDEX_PAGE_LOOKUP, /* The record page entry has been found */ UDS_LOCATION_RECORD_PAGE_LOOKUP, /* The record is not in the index */ UDS_LOCATION_UNAVAILABLE, /* The record was found in the open chapter */ UDS_LOCATION_IN_OPEN_CHAPTER, /* The record was found in the dense part of the index */ UDS_LOCATION_IN_DENSE, /* The record was found in the sparse part of the index */ UDS_LOCATION_IN_SPARSE, } __packed; /* Zone message requests are used to communicate between index zones. */ enum uds_zone_message_type { /* A standard request with no message */ UDS_MESSAGE_NONE = 0, /* Add a chapter to the sparse chapter index cache */ UDS_MESSAGE_SPARSE_CACHE_BARRIER, /* Close a chapter to keep the zone from falling behind */ UDS_MESSAGE_ANNOUNCE_CHAPTER_CLOSED, } __packed; struct uds_zone_message { /* The type of message, determining how it will be processed */ enum uds_zone_message_type type; /* The virtual chapter number to which the message applies */ u64 virtual_chapter; }; struct uds_index_session; struct uds_index; struct uds_request; /* Once this callback has been invoked, the uds_request structure can be reused or freed. */ typedef void (*uds_request_callback_fn)(struct uds_request *request); struct uds_request { /* These input fields must be set before launching a request. */ /* The name of the record to look up or create */ struct uds_record_name record_name; /* New data to associate with the record name, if applicable */ struct uds_record_data new_metadata; /* A callback to invoke when the request is complete */ uds_request_callback_fn callback; /* The index session that will manage this request */ struct uds_index_session *session; /* The type of operation to perform, as describe above */ enum uds_request_type type; /* These output fields are set when a request is complete. */ /* The existing data associated with the request name, if any */ struct uds_record_data old_metadata; /* Either UDS_SUCCESS or an error code for the request */ int status; /* True if the record name had an existing entry in the index */ bool found; /* * The remaining fields are used internally and should not be altered by clients. The index * relies on zone_number being the first field in this section. */ /* The number of the zone which will process this request*/ unsigned int zone_number; /* A link for adding a request to a lock-free queue */ struct funnel_queue_entry queue_link; /* A link for adding a request to a standard linked list */ struct uds_request *next_request; /* A pointer to the index processing this request */ struct uds_index *index; /* Control message for coordinating between zones */ struct uds_zone_message zone_message; /* If true, process request immediately by waking the worker thread */ bool unbatched; /* If true, continue this request before processing newer requests */ bool requeued; /* The virtual chapter containing the record name, if known */ u64 virtual_chapter; /* The region of the index containing the record name */ enum uds_index_region location; }; /* Compute the number of bytes needed to store an index. */ int __must_check uds_compute_index_size(const struct uds_parameters *parameters, u64 *index_size); /* A session is required for most index operations. */ int __must_check uds_create_index_session(struct uds_index_session **session); /* Destroying an index session also closes and saves the associated index. */ int uds_destroy_index_session(struct uds_index_session *session); /* * Create or open an index with an existing session. This operation fails if the index session is * suspended, or if there is already an open index. */ int __must_check uds_open_index(enum uds_open_index_type open_type, const struct uds_parameters *parameters, struct uds_index_session *session); /* * Wait until all callbacks for index operations are complete, and prevent new index operations * from starting. New index operations will fail with EBUSY until the session is resumed. Also * optionally saves the index. */ int __must_check uds_suspend_index_session(struct uds_index_session *session, bool save); /* * Allow new index operations for an index, whether it was suspended or not. If the index is * suspended and the supplied block device differs from the current backing store, the index will * start using the new backing store instead. */ int __must_check uds_resume_index_session(struct uds_index_session *session, struct block_device *bdev); /* Wait until all outstanding index operations are complete. */ int __must_check uds_flush_index_session(struct uds_index_session *session); /* Close an index. This operation fails if the index session is suspended. */ int __must_check uds_close_index(struct uds_index_session *session); /* Get index statistics since the last time the index was opened. */ int __must_check uds_get_index_session_stats(struct uds_index_session *session, struct uds_index_stats *stats); /* This function will fail if any required field of the request is not set. */ int __must_check uds_launch_request(struct uds_request *request); struct cond_var { pthread_cond_t condition; }; void uds_init_cond(struct cond_var *cond); void uds_signal_cond(struct cond_var *cond); void uds_broadcast_cond(struct cond_var *cond); void uds_wait_cond(struct cond_var *cond, struct mutex *mutex); void uds_destroy_cond(struct cond_var *cond); #endif /* INDEXER_H */ vdo-8.3.1.1/utils/uds/io-factory.c000066400000000000000000000240141476467262700166530ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "io-factory.h" #include #include #include #include #include #include "logger.h" #include "memory-alloc.h" #include "numeric.h" /* * The I/O factory object manages access to index storage, which is a contiguous range of blocks on * a block device. * * The factory holds the open device and is responsible for closing it. The factory has methods to * make helper structures that can be used to access sections of the index. */ struct io_factory { struct block_device *bdev; atomic_t ref_count; }; /* The buffered reader allows efficient I/O by reading page-sized segments into a buffer. */ struct buffered_reader { struct io_factory *factory; struct dm_bufio_client *client; struct dm_buffer *buffer; sector_t limit; sector_t block_number; u8 *start; u8 *end; }; #define MAX_READ_AHEAD_BLOCKS 4 /* * The buffered writer allows efficient I/O by buffering writes and committing page-sized segments * to storage. */ struct buffered_writer { struct io_factory *factory; struct dm_bufio_client *client; struct dm_buffer *buffer; sector_t limit; sector_t block_number; u8 *start; u8 *end; int error; }; static void uds_get_io_factory(struct io_factory *factory) { atomic_inc(&factory->ref_count); } int uds_make_io_factory(struct block_device *bdev, struct io_factory **factory_ptr) { int result; struct io_factory *factory; result = vdo_allocate(1, struct io_factory, __func__, &factory); if (result != VDO_SUCCESS) return result; factory->bdev = bdev; atomic_set_release(&factory->ref_count, 1); *factory_ptr = factory; return UDS_SUCCESS; } int uds_replace_storage(struct io_factory *factory, struct block_device *bdev) { factory->bdev = bdev; return UDS_SUCCESS; } /* Free an I/O factory once all references have been released. */ void uds_put_io_factory(struct io_factory *factory) { if (atomic_add_return(-1, &factory->ref_count) <= 0) vdo_free(factory); } size_t uds_get_writable_size(struct io_factory *factory) { return bdev_nr_bytes(factory->bdev); } /* Create a struct dm_bufio_client for an index region starting at offset. */ int uds_make_bufio(struct io_factory *factory, off_t block_offset, size_t block_size, unsigned int reserved_buffers, struct dm_bufio_client **client_ptr) { struct dm_bufio_client *client; #ifdef DM_BUFIO_CLIENT_NO_SLEEP client = dm_bufio_client_create(factory->bdev, block_size, reserved_buffers, 0, NULL, NULL, 0); #else client = dm_bufio_client_create(factory->bdev, block_size, reserved_buffers, 0, NULL, NULL); #endif if (IS_ERR(client)) return -PTR_ERR(client); dm_bufio_set_sector_offset(client, block_offset * SECTORS_PER_BLOCK); *client_ptr = client; return UDS_SUCCESS; } static void read_ahead(struct buffered_reader *reader, sector_t block_number) { if (block_number < reader->limit) { sector_t read_ahead = min((sector_t) MAX_READ_AHEAD_BLOCKS, reader->limit - block_number); dm_bufio_prefetch(reader->client, block_number, read_ahead); } } void uds_free_buffered_reader(struct buffered_reader *reader) { if (reader == NULL) return; if (reader->buffer != NULL) dm_bufio_release(reader->buffer); dm_bufio_client_destroy(reader->client); uds_put_io_factory(reader->factory); vdo_free(reader); } /* Create a buffered reader for an index region starting at offset. */ int uds_make_buffered_reader(struct io_factory *factory, off_t offset, u64 block_count, struct buffered_reader **reader_ptr) { int result; struct dm_bufio_client *client = NULL; struct buffered_reader *reader = NULL; result = uds_make_bufio(factory, offset, UDS_BLOCK_SIZE, 1, &client); if (result != UDS_SUCCESS) return result; result = vdo_allocate(1, struct buffered_reader, "buffered reader", &reader); if (result != VDO_SUCCESS) { dm_bufio_client_destroy(client); return result; } *reader = (struct buffered_reader) { .factory = factory, .client = client, .buffer = NULL, .limit = block_count, .block_number = 0, .start = NULL, .end = NULL, }; read_ahead(reader, 0); uds_get_io_factory(factory); *reader_ptr = reader; return UDS_SUCCESS; } static int position_reader(struct buffered_reader *reader, sector_t block_number, off_t offset) { struct dm_buffer *buffer = NULL; void *data; if ((reader->end == NULL) || (block_number != reader->block_number)) { if (block_number >= reader->limit) return UDS_OUT_OF_RANGE; if (reader->buffer != NULL) dm_bufio_release(vdo_forget(reader->buffer)); data = dm_bufio_read(reader->client, block_number, &buffer); if (IS_ERR(data)) return -PTR_ERR(data); reader->buffer = buffer; reader->start = data; if (block_number == reader->block_number + 1) read_ahead(reader, block_number + 1); } reader->block_number = block_number; reader->end = reader->start + offset; return UDS_SUCCESS; } static size_t bytes_remaining_in_read_buffer(struct buffered_reader *reader) { return (reader->end == NULL) ? 0 : reader->start + UDS_BLOCK_SIZE - reader->end; } static int reset_reader(struct buffered_reader *reader) { sector_t block_number; if (bytes_remaining_in_read_buffer(reader) > 0) return UDS_SUCCESS; block_number = reader->block_number; if (reader->end != NULL) block_number++; return position_reader(reader, block_number, 0); } int uds_read_from_buffered_reader(struct buffered_reader *reader, u8 *data, size_t length) { int result = UDS_SUCCESS; size_t chunk_size; while (length > 0) { result = reset_reader(reader); if (result != UDS_SUCCESS) return result; chunk_size = min(length, bytes_remaining_in_read_buffer(reader)); memcpy(data, reader->end, chunk_size); length -= chunk_size; data += chunk_size; reader->end += chunk_size; } return UDS_SUCCESS; } /* * Verify that the next data on the reader matches the required value. If the value matches, the * matching contents are consumed. If the value does not match, the reader state is unchanged. */ int uds_verify_buffered_data(struct buffered_reader *reader, const u8 *value, size_t length) { int result = UDS_SUCCESS; size_t chunk_size; sector_t start_block_number = reader->block_number; int start_offset = reader->end - reader->start; while (length > 0) { result = reset_reader(reader); if (result != UDS_SUCCESS) { result = UDS_CORRUPT_DATA; break; } chunk_size = min(length, bytes_remaining_in_read_buffer(reader)); if (memcmp(value, reader->end, chunk_size) != 0) { result = UDS_CORRUPT_DATA; break; } length -= chunk_size; value += chunk_size; reader->end += chunk_size; } if (result != UDS_SUCCESS) position_reader(reader, start_block_number, start_offset); return result; } /* Create a buffered writer for an index region starting at offset. */ int uds_make_buffered_writer(struct io_factory *factory, off_t offset, u64 block_count, struct buffered_writer **writer_ptr) { int result; struct dm_bufio_client *client = NULL; struct buffered_writer *writer; result = uds_make_bufio(factory, offset, UDS_BLOCK_SIZE, 1, &client); if (result != UDS_SUCCESS) return result; result = vdo_allocate(1, struct buffered_writer, "buffered writer", &writer); if (result != VDO_SUCCESS) { dm_bufio_client_destroy(client); return result; } *writer = (struct buffered_writer) { .factory = factory, .client = client, .buffer = NULL, .limit = block_count, .start = NULL, .end = NULL, .block_number = 0, .error = UDS_SUCCESS, }; uds_get_io_factory(factory); *writer_ptr = writer; return UDS_SUCCESS; } static size_t get_remaining_write_space(struct buffered_writer *writer) { return writer->start + UDS_BLOCK_SIZE - writer->end; } static int __must_check prepare_next_buffer(struct buffered_writer *writer) { struct dm_buffer *buffer = NULL; void *data; if (writer->block_number >= writer->limit) { writer->error = UDS_OUT_OF_RANGE; return UDS_OUT_OF_RANGE; } data = dm_bufio_new(writer->client, writer->block_number, &buffer); if (IS_ERR(data)) { writer->error = -PTR_ERR(data); return writer->error; } writer->buffer = buffer; writer->start = data; writer->end = data; return UDS_SUCCESS; } static int flush_previous_buffer(struct buffered_writer *writer) { size_t available; if (writer->buffer == NULL) return writer->error; if (writer->error == UDS_SUCCESS) { available = get_remaining_write_space(writer); if (available > 0) memset(writer->end, 0, available); dm_bufio_mark_buffer_dirty(writer->buffer); } dm_bufio_release(writer->buffer); writer->buffer = NULL; writer->start = NULL; writer->end = NULL; writer->block_number++; return writer->error; } void uds_free_buffered_writer(struct buffered_writer *writer) { int result; if (writer == NULL) return; flush_previous_buffer(writer); result = -dm_bufio_write_dirty_buffers(writer->client); if (result != UDS_SUCCESS) vdo_log_warning_strerror(result, "%s: failed to sync storage", __func__); dm_bufio_client_destroy(writer->client); uds_put_io_factory(writer->factory); vdo_free(writer); } /* * Append data to the buffer, writing as needed. If no data is provided, zeros are written instead. * If a write error occurs, it is recorded and returned on every subsequent write attempt. */ int uds_write_to_buffered_writer(struct buffered_writer *writer, const u8 *data, size_t length) { int result = writer->error; size_t chunk_size; while ((length > 0) && (result == UDS_SUCCESS)) { if (writer->buffer == NULL) { result = prepare_next_buffer(writer); continue; } chunk_size = min(length, get_remaining_write_space(writer)); if (data == NULL) { memset(writer->end, 0, chunk_size); } else { memcpy(writer->end, data, chunk_size); data += chunk_size; } length -= chunk_size; writer->end += chunk_size; if (get_remaining_write_space(writer) == 0) result = uds_flush_buffered_writer(writer); } return result; } int uds_flush_buffered_writer(struct buffered_writer *writer) { if (writer->error != UDS_SUCCESS) return writer->error; return flush_previous_buffer(writer); } vdo-8.3.1.1/utils/uds/io-factory.h000066400000000000000000000037301476467262700166620ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_IO_FACTORY_H #define UDS_IO_FACTORY_H #include /* * The I/O factory manages all low-level I/O operations to the underlying storage device. Its main * clients are the index layout and the volume. The buffered reader and buffered writer interfaces * are helpers for accessing data in a contiguous range of storage blocks. */ struct buffered_reader; struct buffered_writer; struct io_factory; enum { UDS_BLOCK_SIZE = 4096, SECTORS_PER_BLOCK = UDS_BLOCK_SIZE >> SECTOR_SHIFT, }; int __must_check uds_make_io_factory(struct block_device *bdev, struct io_factory **factory_ptr); int __must_check uds_replace_storage(struct io_factory *factory, struct block_device *bdev); void uds_put_io_factory(struct io_factory *factory); size_t __must_check uds_get_writable_size(struct io_factory *factory); int __must_check uds_make_bufio(struct io_factory *factory, off_t block_offset, size_t block_size, unsigned int reserved_buffers, struct dm_bufio_client **client_ptr); int __must_check uds_make_buffered_reader(struct io_factory *factory, off_t offset, u64 block_count, struct buffered_reader **reader_ptr); void uds_free_buffered_reader(struct buffered_reader *reader); int __must_check uds_read_from_buffered_reader(struct buffered_reader *reader, u8 *data, size_t length); int __must_check uds_verify_buffered_data(struct buffered_reader *reader, const u8 *value, size_t length); int __must_check uds_make_buffered_writer(struct io_factory *factory, off_t offset, u64 block_count, struct buffered_writer **writer_ptr); void uds_free_buffered_writer(struct buffered_writer *buffer); int __must_check uds_write_to_buffered_writer(struct buffered_writer *writer, const u8 *data, size_t length); int __must_check uds_flush_buffered_writer(struct buffered_writer *writer); #endif /* UDS_IO_FACTORY_H */ vdo-8.3.1.1/utils/uds/linux/000077500000000000000000000000001476467262700155715ustar00rootroot00000000000000vdo-8.3.1.1/utils/uds/linux/atomic.h000066400000000000000000000373721476467262700172320ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef LINUX_ATOMIC_H #define LINUX_ATOMIC_H #include // The atomic interfaces are chosen to exactly match those interfaces defined // by the Linux kernel. The rest of this file is the matching user-mode // implementation. typedef struct { int32_t value; } atomic_t; typedef struct { int64_t value; } atomic64_t; #define ATOMIC_INIT(i) { (i) } /***************************************************************************** * Beginning of the barrier methods. *****************************************************************************/ /** * Stop GCC from moving memory operations across a point in the instruction * stream. This is how the kernel uses this method. **/ static inline void barrier(void) { /* * asm volatile cannot be removed, and the memory clobber tells the * compiler not to move memory accesses past the asm. We don't * actually need any instructions issued on x86_64, as synchronizing * instructions are ordered with respect to both loads and stores, * with some irrelevant-to-us exceptions. */ __asm__ __volatile__("" : : : "memory"); } /** * Provide a memory barrier. * * Generate a full memory fence for the compiler and CPU. Load and store * operations issued before the fence will not be re-ordered with operations * issued after the fence. * * We also use this method in association with the __sync builtins. In earlier * versions of GCC (at least through 4.6), the __sync operations didn't * actually act as the memory barriers the compiler documentation says they * should. Even as of GCC 8, it looks like the Linux kernel developers * disagree with the compiler developers as to what constitutes a barrier at * least on s390x, where the kernel uses explicit barriers after certain * atomic operations and GCC does not. * * Rather than investigate the current status of barriers in GCC (which is an * architecture-specific issue), and since in user mode the performance of * these operations is not critical, we can afford to be cautious and insert * extra barriers, until such time as we have more time to investigate and * gain confidence in the current state of GCC barriers. **/ static inline void smp_mb(void) { #if defined __x86_64__ /* * X86 full fence. Supposedly __sync_synchronize() will do this, but * either the GCC documentation is a lie or GCC is broken. * * FIXME: http://blogs.sun.com/dave/entry/atomic_fetch_and_add_vs says * atomicAdd of zero may be a better way to spell this on current CPUs. */ __asm__ __volatile__("mfence" : : : "memory"); #elif defined __aarch64__ __asm__ __volatile__("dmb ish" : : : "memory"); #elif defined __s390__ __asm__ __volatile__("bcr 14,0" : : : "memory"); #elif defined __PPC__ __asm__ __volatile__("sync" : : : "memory"); #elif defined __riscv __asm__ __volatile__("fence rw,rw" : : : "memory"); #elif defined __loongarch64 __asm__ __volatile__("dbar 0" : : : "memory"); #else #error "no fence defined" #endif } /** * Provide a read memory barrier. * * Memory load operations that precede this fence will be prevented from * changing order with any that follow this fence, by either the compiler or * the CPU. This can be used to ensure that the load operations accessing the * fields of a structure are not re-ordered so they actually take effect before * a pointer to the structure is resolved. **/ static inline void smp_rmb(void) { #if defined __x86_64__ // The implementation on x86 is more aggressive than necessary. __asm__ __volatile__("lfence" : : : "memory"); #elif defined __aarch64__ __asm__ __volatile__("dmb ishld" : : : "memory"); #elif defined __s390__ __asm__ __volatile__("bcr 14,0" : : : "memory"); #elif defined __PPC__ __asm__ __volatile__("lwsync" : : : "memory"); #elif defined __riscv __asm__ __volatile__("fence r,r" : : : "memory"); #elif defined __loongarch64 __asm__ __volatile__("dbar 0" : : : "memory"); #else #error "no fence defined" #endif } /** * Provide a write memory barrier. * * Memory store operations that precede this fence will be prevented from * changing order with any that follow this fence, by either the compiler or * the CPU. This can be used to ensure that the store operations initializing * the fields of a structure are not re-ordered so they actually take effect * after a pointer to the structure is published. **/ static inline void smp_wmb(void) { #if defined __x86_64__ // The implementation on x86 is more aggressive than necessary. __asm__ __volatile__("sfence" : : : "memory"); #elif defined __aarch64__ __asm__ __volatile__("dmb ishst" : : : "memory"); #elif defined __s390__ __asm__ __volatile__("bcr 14,0" : : : "memory"); #elif defined __PPC__ __asm__ __volatile__("lwsync" : : : "memory"); #elif defined __riscv __asm__ __volatile__("fence w,w" : : : "memory"); #elif defined __loongarch64 __asm__ __volatile__("dbar 0" : : : "memory"); #else #error "no fence defined" #endif } /** * Provide a memory barrier before an atomic read-modify-write operation * that does not imply one. **/ static inline void smp_mb__before_atomic(void) { #if defined(__x86_64__) || defined(__s390__) // Atomic operations are already serializing on x86 and s390 barrier(); #else smp_mb(); #endif } /** * Provide a memory barrier after an atomic read-modify-write operation * that does not imply one. **/ static inline void smp_mb__after_atomic(void) { #if defined(__x86_64__) || defined(__s390__) // Atomic operations are already serializing on x86 and s390 barrier(); #else smp_mb(); #endif } /***************************************************************************** * Beginning of the methods for defeating compiler optimization. *****************************************************************************/ #define READ_ONCE(x) (x) #define WRITE_ONCE(x, val) ((x) = (val)) /***************************************************************************** * Beginning of the 32 bit atomic support. *****************************************************************************/ /* * As noted above, there are a lot of explicit barriers here, in places * where we need barriers. Ideally, GCC should just Get It Right on all the * platforms. But there have been bugs in the past, and it looks like there * might be one still (in GCC 8) at least on s390 (no bug report filed yet), * and researching it may take more time than we have available before we have * to ship. It also requires manual inspection for each platform, as there's * no good general way to test whether the compiler gets the barriers correct. */ /** * Add a signed int to a 32-bit atomic variable. The addition is atomic, but * there are no memory barriers implied by this method. * * @param delta the value to be added to (or subtracted from) the variable * @param atom a pointer to the atomic variable **/ static inline void atomic_add(int delta, atomic_t *atom) { /* * According to the kernel documentation, the addition is atomic, but there * are no memory barriers implied by this method. * * The x86 implementation does do memory barriers. */ __sync_add_and_fetch(&atom->value, delta); } /** * Add a signed int to a 32-bit atomic variable. The addition is properly * atomic, and there are memory barriers. * * @param atom a pointer to the atomic variable * @param delta the value to be added (or subtracted) from the variable * * @return the new value of the atom after the add operation **/ static inline int atomic_add_return(int delta, atomic_t *atom) { smp_mb(); int result = __sync_add_and_fetch(&atom->value, delta); smp_mb(); return result; } /** * Compare and exchange a 32-bit atomic variable. The operation is properly * atomic and performs a memory barrier. * * @param atom a pointer to the atomic variable * @param old the value that must be present to perform the swap * @param new the value to be swapped for the required value * * @return the old value **/ static inline int atomic_cmpxchg(atomic_t *atom, int old, int new) { smp_mb(); int result = __sync_val_compare_and_swap(&atom->value, old, new); smp_mb(); return result; } /** * Increment a 32-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable **/ static inline void atomic_inc(atomic_t *atom) { /* * According to the kernel documentation, the addition is atomic, but there * are no memory barriers implied by this method. * * The x86 implementation does do memory barriers. */ __sync_add_and_fetch(&atom->value, 1); } /** * Increment a 32-bit atomic variable. The addition is properly atomic, and * there are memory barriers. * * @param atom a pointer to the atomic variable * * @return the new value of the atom after the increment **/ static inline long atomic_inc_return(atomic_t *atom) { return atomic_add_return(1, atom); } /** * Decrement a 32-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable **/ static inline void atomic_dec(atomic_t *atom) { /* * According to the kernel documentation, the subtraction is atomic, but * there are no memory barriers implied by this method. * * The x86 implementation does do memory barriers. */ __sync_sub_and_fetch(&atom->value, 1); } /** * Read a 32-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable **/ static inline int atomic_read(const atomic_t *atom) { return READ_ONCE(atom->value); } /** * Read a 32-bit atomic variable, with an acquire memory barrier. * * @param atom a pointer to the atomic variable **/ static inline int atomic_read_acquire(const atomic_t *atom) { int value = READ_ONCE(atom->value); smp_mb(); return value; } /** * Set a 32-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable * @param value the value to set it to **/ static inline void atomic_set(atomic_t *atom, int value) { atom->value = value; } /** * Set a 32-bit atomic variable, with a release memory barrier. * * @param atom a pointer to the atomic variable * @param value the value to set it to **/ static inline void atomic_set_release(atomic_t *atom, int value) { smp_mb(); atomic_set(atom, value); } /***************************************************************************** * Beginning of the 64 bit atomic support. *****************************************************************************/ /** * Add a signed long to a 64-bit atomic variable. The addition is atomic, but * there are no memory barriers implied by this method. * * @param delta the value to be added to (or subtracted from) the variable * @param atom a pointer to the atomic variable **/ static inline void atomic64_add(long delta, atomic64_t *atom) { /* * According to the kernel documentation, the addition is atomic, but there * are no memory barriers implied by this method. * * The x86 implementation does do memory barriers. */ __sync_add_and_fetch(&atom->value, delta); } /** * Add a signed long to a 64-bit atomic variable. The addition is properly * atomic, and there are memory barriers. * * @param atom a pointer to the atomic variable * @param delta the value to be added (or subtracted) from the variable * * @return the new value of the atom after the add operation **/ static inline long atomic64_add_return(long delta, atomic64_t *atom) { smp_mb(); long result = __sync_add_and_fetch(&atom->value, delta); smp_mb(); return result; } /** * Compare and exchange a 64-bit atomic variable. The operation is properly * atomic and performs a memory barrier. * * @param atom a pointer to the atomic variable * @param old the value that must be present to perform the swap * @param new the value to be swapped for the required value * * @return the old value **/ static inline long atomic64_cmpxchg(atomic64_t *atom, long old, long new) { smp_mb(); long result = __sync_val_compare_and_swap(&atom->value, old, new); smp_mb(); return result; } /** * Increment a 64-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable **/ static inline void atomic64_inc(atomic64_t *atom) { /* * According to the kernel documentation, the addition is atomic, but there * are no memory barriers implied by this method. * * The x86 implementation does do memory barriers. */ __sync_add_and_fetch(&atom->value, 1); } /** * Increment a 64-bit atomic variable. The addition is properly atomic, and * there are memory barriers. * * @param atom a pointer to the atomic variable * * @return the new value of the atom after the increment **/ static inline long atomic64_inc_return(atomic64_t *atom) { return atomic64_add_return(1, atom); } /** * Read a 64-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable **/ static inline long atomic64_read(const atomic64_t *atom) { return READ_ONCE(atom->value); } /** * Read a 64-bit atomic variable, with an acquire memory barrier. * * @param atom a pointer to the atomic variable **/ static inline long atomic64_read_acquire(const atomic64_t *atom) { long value = READ_ONCE(atom->value); smp_mb(); return value; } /** * Set a 64-bit atomic variable, without any memory barriers. * * @param atom a pointer to the atomic variable * @param value the value to set it to **/ static inline void atomic64_set(atomic64_t *atom, long value) { atom->value = value; } /** * Set a 64-bit atomic variable, with a release memory barrier. * * @param atom a pointer to the atomic variable * @param value the value to set it to **/ static inline void atomic64_set_release(atomic64_t *atom, long value) { smp_mb(); atomic64_set(atom, value); } /***************************************************************************** * Generic exchange support. *****************************************************************************/ /* * Exchange a location's value atomically, with a full memory barrier. * * The location is NOT an "atomic*_t" type, but any primitive type for which * an exchange can be done atomically. (This varies by processor, but * generally a word-sized or pointer-sized value is supported.) As this uses a * type-generic compiler interface, it must be implemented as a macro. * * @param PTR a pointer to the location to be updated * @param NEWVAL the new value to be stored * * @return the old value */ #define xchg(PTR,NEWVAL) \ __extension__ ({ \ __typeof__(*(PTR)) __xchg_result; \ __typeof__(*(PTR)) __xchg_new_value = (NEWVAL); \ smp_mb(); /* paranoia, for old gcc bugs */ \ __xchg_result = __atomic_exchange_n((PTR), __xchg_new_value, \ __ATOMIC_SEQ_CST); \ smp_mb(); /* more paranoia */ \ __xchg_result; \ }) #endif /* LINUX_ATOMIC_H */ vdo-8.3.1.1/utils/uds/linux/bitops.h000066400000000000000000000044441476467262700172500ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * These are the small parts of linux/bits.h that we actually require for * unit testing, reimplemented without all of the architecture specific * macros. * * Copyright 2023 Red Hat * */ #ifndef _TOOLS_LINUX_BITOPS_H_ #define _TOOLS_LINUX_BITOPS_H_ #include #include #include // From vdso/const.h #define UL(x) (_UL(x)) #define ULL(x) (_ULL(x)) #define BITS_PER_LONG 64 #define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG)) #define BIT_WORD(nr) ((nr) / BITS_PER_LONG) #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1))) #define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE) #define BITS_TO_LONGS(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(long)) #define BITS_TO_U64(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u64)) #define BITS_TO_U32(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(u32)) #define BITS_TO_BYTES(nr) __KERNEL_DIV_ROUND_UP(nr, BITS_PER_TYPE(char)) /** * __set_bit - Set a bit in memory * @nr: the bit to set * @addr: the address to start counting from * * Unlike set_bit(), this function is non-atomic and may be reordered. * If it's called on the same region of memory simultaneously, the effect * may be that only one operation succeeds. **/ static inline void __set_bit(int nr, volatile unsigned long *addr) { unsigned long mask = BIT_MASK(nr); addr[BIT_WORD(nr)] |= mask; } /**********************************************************************/ static inline void __clear_bit(int nr, volatile unsigned long *addr) { unsigned long mask = BIT_MASK(nr); addr[BIT_WORD(nr)] &= ~mask; } /** * test_bit - Determine whether a bit is set * @nr: bit number to test * @addr: Address to start counting from **/ static inline int test_bit(int nr, const volatile unsigned long *addr) { return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1))); } /**********************************************************************/ unsigned long __must_check find_next_zero_bit(const unsigned long *addr, unsigned long size, unsigned long offset); /**********************************************************************/ unsigned long __must_check find_first_zero_bit(const unsigned long *addr, unsigned long size); #endif /* _TOOLS_LINUX_BITOPS_H_ */ vdo-8.3.1.1/utils/uds/linux/bits.h000066400000000000000000000004321476467262700167020ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_LINUX_BITS_H #define UDS_LINUX_BITS_H /* * User-space header providing necessary define provided by the same kernel-space header. */ #define BITS_PER_BYTE 8 #endif /* UDS_LINUX_BITS_H */ vdo-8.3.1.1/utils/uds/linux/blkdev.h000066400000000000000000000045451476467262700172210ustar00rootroot00000000000000/* * Unit test requirements from linux/blkdev.h and other kernel headers. */ #ifndef LINUX_BLKDEV_H #define LINUX_BLKDEV_H #include #include #include #include #define SECTOR_SHIFT 9 #define SECTOR_SIZE 512 #define BDEVNAME_SIZE 32 /* Largest string for a blockdev identifier */ /* Defined in linux/kdev_t.h */ #define MINORBITS 20 #define MINORMASK ((1U << MINORBITS) - 1) #define MAJOR(dev) ((unsigned int) ((dev) >> MINORBITS)) #define MINOR(dev) ((unsigned int) ((dev) & MINORMASK)) #define format_dev_t(buffer, dev) \ sprintf(buffer, "%u:%u", MAJOR(dev), MINOR(dev)) /* Defined in linux/blk_types.h */ typedef u32 __bitwise blk_opf_t; typedef unsigned int blk_qc_t; typedef u8 __bitwise blk_status_t; #define BLK_STS_OK 0 #define BLK_STS_NOSPC ((blk_status_t)3) #define BLK_STS_RESOURCE ((blk_status_t)9) #define BLK_STS_IOERR ((blk_status_t)10) /* hack for vdo, don't use elsewhere */ #define BLK_STS_VDO_INJECTED ((blk_status_t)31) struct bio; struct block_device { int fd; dev_t bd_dev; /* This is only here for bdev_nr_bytes(). */ loff_t size; }; /* Defined in linux/blk-core.c */ static const struct { int error; const char *name; } blk_errors[] = { [BLK_STS_OK] = { 0, "" }, [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, /* error specifically for VDO unit tests */ [BLK_STS_VDO_INJECTED] = { 31, "vdo injected error" }, /* everything else not covered above: */ [BLK_STS_IOERR] = { -EIO, "I/O" }, }; /**********************************************************************/ static inline int blk_status_to_errno(blk_status_t status) { int idx = (int) status; return blk_errors[idx].error; } /**********************************************************************/ static inline blk_status_t errno_to_blk_status(int error) { unsigned int i; for (i = 0; i < ARRAY_SIZE(blk_errors); i++) { if (blk_errors[i].error == error) return (blk_status_t)i; } return BLK_STS_IOERR; } /**********************************************************************/ void submit_bio_noacct(struct bio *bio); /**********************************************************************/ static inline loff_t bdev_nr_bytes(struct block_device *bdev) { return bdev->size; } #endif // LINUX_BLKDEV_H vdo-8.3.1.1/utils/uds/linux/build_bug.h000066400000000000000000000012121476467262700176720ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef _LINUX_BUILD_BUG_H #define _LINUX_BUILD_BUG_H #define BUILD_BUG_ON(condition) \ BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) #define _compiletime_assert(condition, msg, prefix, suffix) \ __compiletime_assert(condition, msg, prefix, suffix) #define compiletime_assert(condition, msg) \ _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) #define __compiletime_assert(condition, msg, prefix, suffix) do { } while (0) #endif /* _LINUX_BUILD_BUG_H */ vdo-8.3.1.1/utils/uds/linux/cache.h000066400000000000000000000011011476467262700167760ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0 */ /* * User space definitions for things in linux/cache.h * * Copyright 2023 Red Hat */ #ifndef __LINUX_CACHE_H #define __LINUX_CACHE_H #include #if defined(__PPC__) /* N.B.: Some PPC processors have smaller cache lines. */ #define L1_CACHE_BYTES 128 #elif defined(__s390x__) #define L1_CACHE_BYTES 256 #elif defined(__x86_64__) || defined(__aarch64__) || defined(__riscv) || defined (__loongarch64) #define L1_CACHE_BYTES 64 #else #error "unknown cache line size" #endif #endif /* __LINUX_CACHE_H */ vdo-8.3.1.1/utils/uds/linux/compiler.h000066400000000000000000000017301476467262700175550ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef LINUX_COMPILER_H #define LINUX_COMPILER_H #include /* * CPU Branch-prediction hints, courtesy of GCC. Defining these as inline functions instead of * macros spoils their magic, sadly. */ #define likely(expr) __builtin_expect(!!(expr), 1) #define unlikely(expr) __builtin_expect(!!(expr), 0) /* * Count the elements in a static array while attempting to catch some type errors. (See * http://stackoverflow.com/a/1598827 for an explanation.) */ #define ARRAY_SIZE(x) ((sizeof(x) / sizeof(0[x])) / ((size_t)(!(sizeof(x) % sizeof(0[x]))))) /* Defined in linux/container_of.h */ #define container_of(ptr, type, member) \ __extension__({ \ __typeof__(((type *) 0)->member) * __mptr = (ptr); \ (type *) ((char *) __mptr - offsetof(type, member)); \ }) #endif /* LINUX_COMPILER_H */ vdo-8.3.1.1/utils/uds/linux/compiler_attributes.h000066400000000000000000000015471476467262700220310ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0 */ /* * User space attribute defines to match the ones in the kernel's * linux/compiler_attributes.h * * Copyright 2023 Red Hat */ #ifndef LINUX_COMPILER_ATTRIBUTES_H #define LINUX_COMPILER_ATTRIBUTES_H #define __always_unused __attribute__((unused)) #define __maybe_unused __attribute__((unused)) #define __must_check __attribute__((warn_unused_result)) #define noinline __attribute__((__noinline__)) #define __packed __attribute__((packed)) #define __printf(a, b) __attribute__((__format__(printf, a, b))) #define __aligned(x) __attribute__((__aligned__(x))) #define __must_hold(x) #define __releases(x) #if __has_attribute(__fallthrough__) #define fallthrough __attribute__((__fallthrough__)) #else #define fallthrough do {} while (0) /* fallthrough */ #endif #endif /* LINUX_COMPILER_ATTRIBUTES_H */ vdo-8.3.1.1/utils/uds/linux/const.h000066400000000000000000000021051476467262700170660ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Definitions from linux/const.h that we actually require for * unit testing. * * Copyright 2023 Red Hat * */ #ifndef _UAPI_LINUX_CONST_H #define _UAPI_LINUX_CONST_H /* Some constant macros are used in both assembler and * C code. Therefore we cannot annotate them always with * 'UL' and other type specifiers unilaterally. We * use the following macros to deal with this. * * Similarly, _AT() will cast an expression with a type in C, but * leave it unchanged in asm. */ /* Macros for dealing with constants. */ #ifdef __ASSEMBLY__ #define _AC(X,Y) X #define _AT(T,X) X #else #define __AC(X,Y) (X##Y) #define _AC(X,Y) __AC(X,Y) #define _AT(T,X) ((T)(X)) #endif #define _UL(x) (_AC(x, UL)) #define _ULL(x) (_AC(x, ULL)) #define _BITUL(x) (_UL(1) << (x)) #define _BITULL(x) (_ULL(1) << (x)) #define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1) #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) #endif /* _UAPI_LINUX_CONST_H */ vdo-8.3.1.1/utils/uds/linux/dm-bufio.h000066400000000000000000000041531476467262700174470ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef LINUX_DM_BUFIO_H #define LINUX_DM_BUFIO_H #include /* These are just the parts of dm-bufio interface that UDS uses. */ struct dm_bufio_client; struct dm_buffer; /* * Flags for dm_bufio_client_create */ #define DM_BUFIO_CLIENT_NO_SLEEP 0x1 struct dm_bufio_client * dm_bufio_client_create(struct block_device *bdev, unsigned block_size, unsigned reserved_buffers, unsigned aux_size, void (*alloc_callback)(struct dm_buffer *), void (*write_callback)(struct dm_buffer *), unsigned int flags); void dm_bufio_client_destroy(struct dm_bufio_client *client); void dm_bufio_set_sector_offset(struct dm_bufio_client *client, sector_t start); void *dm_bufio_new(struct dm_bufio_client *client, sector_t block, struct dm_buffer **buffer_ptr); void *dm_bufio_read(struct dm_bufio_client *client, sector_t block, struct dm_buffer **buffer_ptr); void dm_bufio_prefetch(struct dm_bufio_client *client, sector_t block, unsigned block_count); void dm_bufio_release(struct dm_buffer *buffer); void dm_bufio_release_move(struct dm_buffer *buffer, sector_t new_block); void dm_bufio_mark_buffer_dirty(struct dm_buffer *buffer); int dm_bufio_write_dirty_buffers(struct dm_bufio_client *client); void *dm_bufio_get_block_data(struct dm_buffer *buffer); #endif /* LINUX_DM_BUFIO_H */ vdo-8.3.1.1/utils/uds/linux/err.h000066400000000000000000000011051476467262700165270ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef LINUX_ERR_H #define LINUX_ERR_H #include #include #define MAX_ERRNO 4095 #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO) static inline void * __must_check ERR_PTR(long error) { return (void *) error; } static inline long __must_check PTR_ERR(const void *ptr) { return (long) ptr; } static inline bool __must_check IS_ERR(const void *ptr) { return IS_ERR_VALUE((unsigned long)ptr); } #endif /* LINUX_ERR_H */ vdo-8.3.1.1/utils/uds/linux/limits.h000066400000000000000000000011251476467262700172420ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_LINUX_LIMITS_H #define UDS_LINUX_LIMITS_H #include #include #define U8_MAX ((u8)~0ul) #define S8_MAX ((s8)(U8_MAX >> 1)) #define U16_MAX ((u16)~0ul) #define S16_MAX ((s16)(U16_MAX >> 1)) #define U32_MAX ((u32)~0ul) #define S32_MAX ((s32)(U32_MAX >> 1)) #define U64_MAX ((u64)~0ul) #define S64_MAX ((s64)(U64_MAX >> 1)) /* * NAME_MAX and PATH_MAX were copied from /usr/include/limits/linux.h. */ #define NAME_MAX 255 #define PATH_MAX 4096 #endif /* UDS_LINUX_LIMITS_H */ vdo-8.3.1.1/utils/uds/linux/log2.h000066400000000000000000000025321476467262700166070ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef _LINUX_LOG2_H #define _LINUX_LOG2_H #include "permassert.h" /* Compute the number of bits to represent n */ static inline unsigned int bits_per(unsigned int n) { unsigned int bits = 1; while (n > 1) { n >>= 1; bits++; } return bits; } /** * is_power_of_2() - Return true if and only if a number is a power of two. */ static inline bool is_power_of_2(uint64_t n) { return (n > 0) && ((n & (n - 1)) == 0); } /** * ilog2() - Efficiently calculate the base-2 logarithm of a number truncated * to an integer value. * @n: The input value. * * This also happens to be the bit index of the highest-order non-zero bit in * the binary representation of the number, which can easily be used to * calculate the bit shift corresponding to a bit mask or an array capacity, * or to calculate the binary floor or ceiling (next lowest or highest power * of two). * * Return: The integer log2 of the value, or -1 if the value is zero. */ static inline int ilog2(uint64_t n) { VDO_ASSERT_LOG_ONLY(n != 0, "ilog2() may not be passed 0"); /* * Many CPUs, including x86, directly support this calculation, so use * the GCC function for counting the number of leading high-order zero * bits. */ return 63 - __builtin_clzll(n); } #endif /* _LINUX_LOG2_H */ vdo-8.3.1.1/utils/uds/linux/mutex.h000066400000000000000000000011061476467262700171020ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Wrap our own mutexes to mimic the kernel. * * Copyright 2023 Red Hat * */ #ifndef LINUX_MUTEX_H #define LINUX_MUTEX_H #include "thread-utils.h" #define DEFINE_MUTEX(mutexname) \ struct mutex mutexname = UDS_MUTEX_INITIALIZER #define mutex_destroy(mutex) uds_destroy_mutex(mutex) #define mutex_init(mutex) \ VDO_ASSERT_LOG_ONLY(uds_init_mutex(mutex) == UDS_SUCCESS, \ "mutex init succeeds") #define mutex_lock(mutex) uds_lock_mutex(mutex) #define mutex_unlock(mutex) uds_unlock_mutex(mutex) #endif // LINUX_MUTEX_H vdo-8.3.1.1/utils/uds/linux/random.h000066400000000000000000000003471476467262700172260ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef LINUX_RANDOM_H #define LINUX_RANDOM_H #include void get_random_bytes(void *buffer, size_t byte_count); #endif /* LINUX_RANDOM_H */ vdo-8.3.1.1/utils/uds/linux/types.h000066400000000000000000000020171476467262700171060ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_LINUX_TYPES_H #define UDS_LINUX_TYPES_H /* * General system type definitions. */ #include #include #include #include #include typedef int8_t s8; typedef uint8_t u8; typedef int16_t s16; typedef uint16_t u16; typedef int32_t s32; typedef uint32_t u32; typedef int64_t s64; typedef uint64_t u64; typedef s8 __s8; typedef u8 __u8; typedef s16 __s16; typedef u16 __u16; typedef s32 __s32; typedef u32 __u32; typedef s64 __s64; typedef u64 __u64; #define __bitwise typedef __u16 __bitwise __le16; typedef __u16 __bitwise __be16; typedef __u32 __bitwise __le32; typedef __u32 __bitwise __be32; typedef __u64 __bitwise __le64; typedef __u64 __bitwise __be64; #define __aligned_u64 __u64 __attribute__((aligned(8))) typedef unsigned int fmode_t; #define FMODE_READ (fmode_t) 0x1 #define FMODE_WRITE (fmode_t) 0x2 typedef int pid_t; typedef u64 sector_t; #endif /* UDS_LINUX_TYPES_H */ vdo-8.3.1.1/utils/uds/linux/unaligned.h000066400000000000000000000074341476467262700177200ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef LINUX_UNALIGNED_H #define LINUX_UNALIGNED_H #include #include /* Type safe comparison macros, similar to the ones in linux/minmax.h. */ /* * If pointers to types are comparable (without dereferencing them and * potentially causing side effects) then types are the same. */ #define __typecheck(x, y) \ (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) /* * Hack for VDO to replace use of the kernel's __is_constexpr() in __cmp_ macros. * VDO cannot use __is_constexpr() due to it relying on a GCC extension to allow sizeof(void). */ #define __constcheck(x, y) \ (__builtin_constant_p(x) && __builtin_constant_p(y)) /* It takes two levels of macro expansion to compose the unique temp names. */ #define ___PASTE(a,b) a##b #define __PASTE(a,b) ___PASTE(a,b) #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) /* Defined in linux/minmax.h */ #define __cmp_op_min < #define __cmp_op_max > #define __cmp(op, x, y) ((x) __cmp_op_##op (y) ? (x) : (y)) #define __cmp_once(op, x, y, unique_x, unique_y) \ __extension__({ \ typeof(x) unique_x = (x); \ typeof(y) unique_y = (y); \ __cmp(op, unique_x, unique_y); \ }) #define __careful_cmp(op, x, y) \ __builtin_choose_expr( \ (__typecheck(x, y) && __constcheck(x, y)), \ __cmp(op, x, y), \ __cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))) #define min(x, y) __careful_cmp(min, x, y) #define max(x, y) __careful_cmp(max, x, y) /* Defined in linux/minmax.h */ #define swap(a, b) \ do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0) /* Defined in linux/math.h */ #define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d)) /* Defined in asm/unaligned.h */ static inline uint16_t get_unaligned_le16(const void *p) { return __le16_to_cpup((const __le16 *)p); } static inline uint32_t get_unaligned_le32(const void *p) { return __le32_to_cpup((const __le32 *)p); } static inline uint64_t get_unaligned_le64(const void *p) { return __le64_to_cpup((const __le64 *)p); } static inline uint16_t get_unaligned_be16(const void *p) { return __be16_to_cpup((const __be16 *)p); } static inline uint32_t get_unaligned_be32(const void *p) { return __be32_to_cpup((const __be32 *)p); } static inline uint64_t get_unaligned_be64(const void *p) { return __be64_to_cpup((const __be64 *)p); } static inline void put_unaligned_le16(uint16_t val, void *p) { *((__le16 *)p) = __cpu_to_le16(val); } static inline void put_unaligned_le32(uint32_t val, void *p) { *((__le32 *)p) = __cpu_to_le32(val); } static inline void put_unaligned_le64(uint64_t val, void *p) { *((__le64 *)p) = __cpu_to_le64(val); } static inline void put_unaligned_be16(uint16_t val, void *p) { *((__be16 *)p) = __cpu_to_be16(val); } static inline void put_unaligned_be32(uint32_t val, void *p) { *((__be32 *)p) = __cpu_to_be32(val); } static inline void put_unaligned_be64(uint64_t val, void *p) { *((__be64 *)p) = __cpu_to_be64(val); } #endif /* LINUX_UNALIGNED_H */ vdo-8.3.1.1/utils/uds/logger.c000066400000000000000000000217741476467262700160700ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "logger.h" #include #include #include #include #include "fileUtils.h" #include "memory-alloc.h" #include "string-utils.h" #include "thread-utils.h" typedef struct { const char *name; const int priority; } PriorityName; static const PriorityName PRIORITIES[] = { { "ALERT", VDO_LOG_ALERT }, { "CRITICAL", VDO_LOG_CRIT }, { "CRIT", VDO_LOG_CRIT }, { "DEBUG", VDO_LOG_DEBUG }, { "EMERGENCY", VDO_LOG_EMERG }, { "EMERG", VDO_LOG_EMERG }, { "ERROR", VDO_LOG_ERR }, { "ERR", VDO_LOG_ERR }, { "INFO", VDO_LOG_INFO }, { "NOTICE", VDO_LOG_NOTICE }, { "PANIC", VDO_LOG_EMERG }, { "WARN", VDO_LOG_WARNING }, { "WARNING", VDO_LOG_WARNING }, { NULL, -1 }, }; static const char *const PRIORITY_STRINGS[] = { "EMERGENCY", "ALERT", "CRITICAL", "ERROR", "WARN", "NOTICE", "INFO", "DEBUG", }; static int log_level = VDO_LOG_INFO; const char TIMESTAMPS_ENVIRONMENT_VARIABLE[] = "UDS_LOG_TIMESTAMPS"; const char IDS_ENVIRONMENT_VARIABLE[] = "UDS_LOG_IDS"; static const char IDENTITY[] = "UDS"; static atomic_t logger_once = ATOMIC_INIT(0); static unsigned int opened = 0; static FILE *fp = NULL; static bool timestamps = true; static bool ids = true; /**********************************************************************/ int vdo_get_log_level(void) { return log_level; } /**********************************************************************/ static void vdo_set_log_level(int new_log_level) { log_level = new_log_level; } /**********************************************************************/ int vdo_log_string_to_priority(const char *string) { int i; for (i = 0; PRIORITIES[i].name != NULL; i++) if (strcasecmp(string, PRIORITIES[i].name) == 0) return PRIORITIES[i].priority; return VDO_LOG_INFO; } /**********************************************************************/ const char *vdo_log_priority_to_string(int priority) { if ((priority < 0) || (priority >= (int) ARRAY_SIZE(PRIORITY_STRINGS))) return "unknown"; return PRIORITY_STRINGS[priority]; } /**********************************************************************/ static void init_logger(void) { const char *vdo_log_level = getenv("UDS_LOG_LEVEL"); if (vdo_log_level != NULL) vdo_set_log_level(vdo_log_string_to_priority(vdo_log_level)); else vdo_set_log_level(VDO_LOG_INFO); char *timestamps_string = getenv(TIMESTAMPS_ENVIRONMENT_VARIABLE); if (timestamps_string != NULL && strcmp(timestamps_string, "0") == 0) timestamps = false; char *ids_string = getenv(IDS_ENVIRONMENT_VARIABLE); if (ids_string != NULL && strcmp(ids_string, "0") == 0) ids = false; int error = 0; char *log_file = getenv("UDS_LOGFILE"); bool is_abs_path = false; if (log_file != NULL) { is_abs_path = (make_abs_path(log_file, &log_file) == UDS_SUCCESS); errno = 0; fp = fopen(log_file, "a"); if (fp != NULL) { if (is_abs_path) vdo_free(log_file); opened = 1; return; } error = errno; } char *identity; if (vdo_alloc_sprintf(NULL, &identity, "%s/%s", IDENTITY, program_invocation_short_name) == VDO_SUCCESS) { mini_openlog(identity, LOG_PID | LOG_NDELAY | LOG_CONS, LOG_USER); vdo_free(identity); } else { mini_openlog(IDENTITY, LOG_PID | LOG_NDELAY | LOG_CONS, LOG_USER); vdo_log_error("Could not include program name in log"); } if (error != 0) vdo_log_error_strerror(error, "Couldn't open log file %s", log_file); if (is_abs_path) vdo_free(log_file); opened = 1; } /** * Initialize the user space logger using optional environment * variables to set the default log level and log file. Can be called * more than once, but only the first call affects logging by user * space programs. For testing purposes, when the logging environment * needs to be changed, see reinit_vdo_logger. The kernel module uses * kernel logging facilities and therefore doesn't need an open_vdo_logger * method. **/ void open_vdo_logger(void) { vdo_perform_once(&logger_once, init_logger); } /**********************************************************************/ static void format_current_time(char *buffer, size_t buffer_size) { *buffer = 0; ktime_t now = current_time_ns(CLOCK_REALTIME); struct tm tmp; const time_t seconds = now / NSEC_PER_SEC; if (localtime_r(&seconds, &tmp) == NULL) return; if (strftime(buffer, buffer_size, "%Y-%m-%d %H:%M:%S", &tmp) == 0) { *buffer = 0; return; } size_t current_length = strlen(buffer); if (current_length > (buffer_size - 5)) // Not enough room to add milliseconds but we do have a time // string. return; snprintf(buffer + current_length, buffer_size - current_length, ".%03d", (int) ((now % NSEC_PER_SEC) / NSEC_PER_MSEC)); } /** * Log a message embedded within another message. * * @param priority the priority at which to log the message * @param module the name of the module doing the logging * @param prefix optional string prefix to message, may be NULL * @param fmt1 format of message first part (required) * @param args1 arguments for message first part (required) * @param fmt2 format of message second part **/ void vdo_log_embedded_message(int priority, const char *module __always_unused, const char *prefix, const char *fmt1, va_list args1, const char *fmt2, ...) { va_list args2; open_vdo_logger(); if (priority > vdo_get_log_level()) return; va_start(args2, fmt2); // Preserve errno since the caller cares more about their own error // state than about errors in the logging code. int error = errno; if (fp == NULL) { mini_syslog_pack(priority, prefix, fmt1, args1, fmt2, args2); } else { char tname[16]; uds_get_thread_name(tname); flockfile(fp); if (timestamps) { char time_buffer[32]; format_current_time(time_buffer, sizeof(time_buffer)); fprintf(fp, "%s ", time_buffer); } fputs(program_invocation_short_name, fp); if (ids) fprintf(fp, "[%u]", getpid()); fprintf(fp, ": %-6s (%s", vdo_log_priority_to_string(priority), tname); if (ids) fprintf(fp, "/%d", uds_get_thread_id()); fputs(") ", fp); if (prefix != NULL) fputs(prefix, fp); if (fmt1 != NULL) vfprintf(fp, fmt1, args1); if (fmt2 != NULL) vfprintf(fp, fmt2, args2); fputs("\n", fp); fflush(fp); funlockfile(fp); } // Reset errno errno = error; va_end(args2); } /**********************************************************************/ int vdo_vlog_strerror(int priority, int errnum, const char *module, const char *format, va_list args) { char errbuf[VDO_MAX_ERROR_MESSAGE_SIZE]; const char *message = uds_string_error(errnum, errbuf, sizeof(errbuf)); vdo_log_embedded_message(priority, module, NULL, format, args, ": %s (%d)", message, errnum); return errnum; } /**********************************************************************/ int __vdo_log_strerror(int priority, int errnum, const char *module, const char *format, ...) { va_list args; va_start(args, format); vdo_vlog_strerror(priority, errnum, module, format, args); va_end(args); return errnum; } #if defined(TEST_INTERNAL) || defined(INTERNAL) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wmissing-format-attribute" #endif /**********************************************************************/ void vdo_log_message(int priority, const char *format, ...) { va_list args; va_start(args, format); vdo_log_embedded_message(priority, NULL, NULL, format, args, "%s", ""); va_end(args); } #if defined(TEST_INTERNAL) || defined(INTERNAL) #pragma GCC diagnostic pop #endif /** * Log the contents of /proc/self/maps so that we can decode the addresses * in a stack trace. * * @param priority The priority at which to log **/ static void log_proc_maps(int priority) { FILE *maps_file = fopen("/proc/self/maps", "r"); if (maps_file == NULL) return; vdo_log_message(priority, "maps file"); char buffer[1024]; char *map_line; while ((map_line = fgets(buffer, 1024, maps_file)) != NULL) { char *newline = strchr(map_line, '\n'); if (newline != NULL) *newline = '\0'; vdo_log_message(priority, " %s", map_line); } vdo_log_message(priority, "end of maps file"); fclose(maps_file); } enum { NUM_STACK_FRAMES = 32 }; /**********************************************************************/ void vdo_log_backtrace(int priority) { vdo_log_message(priority, "[Call Trace:]"); void *trace[NUM_STACK_FRAMES]; int trace_size = backtrace(trace, NUM_STACK_FRAMES); char **messages = backtrace_symbols(trace, trace_size); if (messages == NULL) { vdo_log_message(priority, "backtrace failed"); } else { for (int i = 0; i < trace_size; i++) vdo_log_message(priority, " %s", messages[i]); // "messages" is malloc'ed indirectly by backtrace_symbols free(messages); log_proc_maps(priority); } } /**********************************************************************/ void vdo_pause_for_logger(void) { // User-space logger can't be overrun, so this is a no-op. } vdo-8.3.1.1/utils/uds/logger.h000066400000000000000000000047211476467262700160660ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_LOGGER_H #define VDO_LOGGER_H #include #include "minisyslog.h" /* Custom logging utilities for UDS */ #define VDO_LOG_EMERG LOG_EMERG #define VDO_LOG_ALERT LOG_ALERT #define VDO_LOG_CRIT LOG_CRIT #define VDO_LOG_ERR LOG_ERR #define VDO_LOG_WARNING LOG_WARNING #define VDO_LOG_NOTICE LOG_NOTICE #define VDO_LOG_INFO LOG_INFO #define VDO_LOG_DEBUG LOG_DEBUG #define VDO_LOGGING_MODULE_NAME "vdo" /* Apply a rate limiter to a log method call. */ #define vdo_log_ratelimit(log_fn, ...) log_fn(__VA_ARGS__) int vdo_get_log_level(void); int vdo_log_string_to_priority(const char *string); const char *vdo_log_priority_to_string(int priority); void vdo_log_embedded_message(int priority, const char *module, const char *prefix, const char *fmt1, va_list args1, const char *fmt2, ...) __printf(4, 0) __printf(6, 7); void vdo_log_backtrace(int priority); /* All log functions will preserve the caller's value of errno. */ #define vdo_log_strerror(priority, errnum, ...) \ __vdo_log_strerror(priority, errnum, VDO_LOGGING_MODULE_NAME, __VA_ARGS__) int __vdo_log_strerror(int priority, int errnum, const char *module, const char *format, ...) __printf(4, 5); int vdo_vlog_strerror(int priority, int errnum, const char *module, const char *format, va_list args) __printf(4, 0); /* Log an error prefixed with the string associated with the errnum. */ #define vdo_log_error_strerror(errnum, ...) \ vdo_log_strerror(VDO_LOG_ERR, errnum, __VA_ARGS__) #define vdo_log_debug_strerror(errnum, ...) \ vdo_log_strerror(VDO_LOG_DEBUG, errnum, __VA_ARGS__) #define vdo_log_info_strerror(errnum, ...) \ vdo_log_strerror(VDO_LOG_INFO, errnum, __VA_ARGS__) #define vdo_log_warning_strerror(errnum, ...) \ vdo_log_strerror(VDO_LOG_WARNING, errnum, __VA_ARGS__) #define vdo_log_fatal_strerror(errnum, ...) \ vdo_log_strerror(VDO_LOG_CRIT, errnum, __VA_ARGS__) void vdo_log_message(int priority, const char *format, ...) __printf(2, 3); #define vdo_log_debug(...) vdo_log_message(VDO_LOG_DEBUG, __VA_ARGS__) #define vdo_log_info(...) vdo_log_message(VDO_LOG_INFO, __VA_ARGS__) #define vdo_log_warning(...) vdo_log_message(VDO_LOG_WARNING, __VA_ARGS__) #define vdo_log_error(...) vdo_log_message(VDO_LOG_ERR, __VA_ARGS__) #define vdo_log_fatal(...) vdo_log_message(VDO_LOG_CRIT, __VA_ARGS__) void vdo_pause_for_logger(void); void open_vdo_logger(void); #endif /* VDO_LOGGER_H */ vdo-8.3.1.1/utils/uds/memory-alloc.h000066400000000000000000000121061476467262700172030ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_MEMORY_ALLOC_H #define VDO_MEMORY_ALLOC_H #include #include #include #include "permassert.h" /* Custom memory allocation function that tracks memory usage */ int __must_check vdo_allocate_memory(size_t size, size_t align, const char *what, void *ptr); /* * Allocate storage based on element counts, sizes, and alignment. * * This is a generalized form of our allocation use case: It allocates an array of objects, * optionally preceded by one object of another type (i.e., a struct with trailing variable-length * array), with the alignment indicated. * * Why is this inline? The sizes and alignment will always be constant, when invoked through the * macros below, and often the count will be a compile-time constant 1 or the number of extra bytes * will be a compile-time constant 0. So at least some of the arithmetic can usually be optimized * away, and the run-time selection between allocation functions always can. In many cases, it'll * boil down to just a function call with a constant size. * * @count: The number of objects to allocate * @size: The size of an object * @extra: The number of additional bytes to allocate * @align: The required alignment * @what: What is being allocated (for error logging) * @ptr: A pointer to hold the allocated memory * * Return: VDO_SUCCESS or an error code */ static inline int __vdo_do_allocation(size_t count, size_t size, size_t extra, size_t align, const char *what, void *ptr) { size_t total_size = count * size + extra; /* Overflow check: */ if ((size > 0) && (count > ((SIZE_MAX - extra) / size))) { /* * This is kind of a hack: We rely on the fact that SIZE_MAX would cover the entire * address space (minus one byte) and thus the system can never allocate that much * and the call will always fail. So we can report an overflow as "out of memory" * by asking for "merely" SIZE_MAX bytes. */ total_size = SIZE_MAX; } return vdo_allocate_memory(total_size, align, what, ptr); } /* * Allocate one or more elements of the indicated type, logging an error if the allocation fails. * The memory will be zeroed. * * @COUNT: The number of objects to allocate * @TYPE: The type of objects to allocate. This type determines the alignment of the allocation. * @WHAT: What is being allocated (for error logging) * @PTR: A pointer to hold the allocated memory * * Return: VDO_SUCCESS or an error code */ #define vdo_allocate(COUNT, TYPE, WHAT, PTR) \ __vdo_do_allocation(COUNT, sizeof(TYPE), 0, __alignof__(TYPE), WHAT, PTR) /* * Allocate one object of an indicated type, followed by one or more elements of a second type, * logging an error if the allocation fails. The memory will be zeroed. * * @TYPE1: The type of the primary object to allocate. This type determines the alignment of the * allocated memory. * @COUNT: The number of objects to allocate * @TYPE2: The type of array objects to allocate * @WHAT: What is being allocated (for error logging) * @PTR: A pointer to hold the allocated memory * * Return: VDO_SUCCESS or an error code */ #define vdo_allocate_extended(TYPE1, COUNT, TYPE2, WHAT, PTR) \ __extension__({ \ int _result; \ TYPE1 **_ptr = (PTR); \ BUILD_BUG_ON(__alignof__(TYPE1) < __alignof__(TYPE2)); \ _result = __vdo_do_allocation(COUNT, \ sizeof(TYPE2), \ sizeof(TYPE1), \ __alignof__(TYPE1), \ WHAT, \ _ptr); \ _result; \ }) /* * Allocate memory starting on a cache line boundary, logging an error if the allocation fails. The * memory will be zeroed. * * @size: The number of bytes to allocate * @what: What is being allocated (for error logging) * @ptr: A pointer to hold the allocated memory * * Return: VDO_SUCCESS or an error code */ static inline int __must_check vdo_allocate_cache_aligned(size_t size, const char *what, void *ptr) { return vdo_allocate_memory(size, L1_CACHE_BYTES, what, ptr); } /* * Allocate one element of the indicated type immediately, failing if the required memory is not * immediately available. * * @size: The number of bytes to allocate * @what: What is being allocated (for error logging) * * Return: pointer to the memory, or NULL if the memory is not available. */ void *__must_check vdo_allocate_memory_nowait(size_t size, const char *what); int __must_check vdo_reallocate_memory(void *ptr, size_t old_size, size_t size, const char *what, void *new_ptr); int __must_check vdo_duplicate_string(const char *string, const char *what, char **new_string); /* Free memory allocated with vdo_allocate(). */ void vdo_free(void *ptr); static inline void *__vdo_forget(void **ptr_ptr) { void *ptr = *ptr_ptr; *ptr_ptr = NULL; return ptr; } /* * Null out a pointer and return a copy to it. This macro should be used when passing a pointer to * a function for which it is not safe to access the pointer once the function returns. */ #define vdo_forget(ptr) __vdo_forget((void **) &(ptr)) #endif /* VDO_MEMORY_ALLOC_H */ vdo-8.3.1.1/utils/uds/memoryAlloc.c000066400000000000000000000065661476467262700170760ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include #include #include "logger.h" #include "memory-alloc.h" enum { DEFAULT_MALLOC_ALIGNMENT = 2 * sizeof(size_t) }; // glibc malloc /** * Allocate storage based on memory size and alignment, logging an error if * the allocation fails. The memory will be zeroed. * * @param size The size of an object * @param align The required alignment * @param what What is being allocated (for error logging) * @param ptr A pointer to hold the allocated memory * * @return VDO_SUCCESS or an error code **/ int vdo_allocate_memory(size_t size, size_t align, const char *what, void *ptr) { int result; void *p; if (ptr == NULL) return UDS_INVALID_ARGUMENT; if (size == 0) { // We can skip the malloc call altogether. *((void **) ptr) = NULL; return VDO_SUCCESS; } if (align > DEFAULT_MALLOC_ALIGNMENT) { result = posix_memalign(&p, align, size); if (result != 0) { if (what != NULL) vdo_log_error_strerror(result, "failed to posix_memalign %s (%zu bytes)", what, size); return -result; } } else { p = malloc(size); if (p == NULL) { result = errno; if (what != NULL) vdo_log_error_strerror(result, "failed to allocate %s (%zu bytes)", what, size); return -result; } } memset(p, 0, size); *((void **) ptr) = p; return VDO_SUCCESS; } /* * Allocate storage based on memory size, failing immediately if the required * memory is not available. The memory will be zeroed. * * @param size The size of an object. * @param what What is being allocated (for error logging) * * @return pointer to the allocated memory, or NULL if the required space is * not available. */ void *vdo_allocate_memory_nowait(size_t size, const char *what) { void *p = NULL; vdo_allocate(size, char *, what, &p); return p; } /**********************************************************************/ void vdo_free(void *ptr) { free(ptr); } /** * Reallocate dynamically allocated memory. There are no alignment guarantees * for the reallocated memory. If the new memory is larger than the old memory, * the new space will be zeroed. * * @param ptr The memory to reallocate. * @param old_size The old size of the memory * @param size The new size to allocate * @param what What is being allocated (for error logging) * @param new_ptr A pointer to hold the reallocated pointer * * @return VDO_SUCCESS or an error code **/ int vdo_reallocate_memory(void *ptr, size_t old_size, size_t size, const char *what, void *new_ptr) { char *new = realloc(ptr, size); if ((new == NULL) && (size != 0)) return vdo_log_error_strerror(-errno, "failed to reallocate %s (%zu bytes)", what, size); if (size > old_size) memset(new + old_size, 0, size - old_size); *((void **) new_ptr) = new; return VDO_SUCCESS; } /**********************************************************************/ int vdo_duplicate_string(const char *string, const char *what, char **new_string) { int result; u8 *dup = NULL; result = vdo_allocate(strlen(string) + 1, u8, what, &dup); if (result != VDO_SUCCESS) return result; memcpy(dup, string, strlen(string) + 1); *new_string = (char *) dup; return VDO_SUCCESS; } vdo-8.3.1.1/utils/uds/minisyslog.c000066400000000000000000000126161476467262700170010ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include #include #include #include #include #include "logger.h" #include "memory-alloc.h" #include "minisyslog.h" #include "string-utils.h" #include "thread-utils.h" #include "time-utils.h" static struct mutex mutex = UDS_MUTEX_INITIALIZER; static int log_socket = -1; static char *log_ident; static int log_option; static int default_facility = LOG_USER; /**********************************************************************/ static void close_locked(void) { if (log_socket != -1) { close(log_socket); log_socket = -1; } } /**********************************************************************/ static void open_socket_locked(void) { if (log_socket != -1) return; struct sockaddr_un sun; memset(&sun, 0, sizeof(sun)); sun.sun_family = AF_UNIX; strncpy(sun.sun_path, _PATH_LOG, sizeof(sun.sun_path)); /* * We can't log from here, we'll deadlock, so we can't use * loggingSocket(), loggingConnect(), or tryCloseFile(). */ log_socket = socket(PF_UNIX, SOCK_DGRAM, 0); if (log_socket < 0) return; if (connect(log_socket, (const struct sockaddr *) &sun, sizeof(sun)) != 0) close_locked(); } /**********************************************************************/ void mini_openlog(const char *ident, int option, int facility) { uds_lock_mutex(&mutex); close_locked(); vdo_free(log_ident); if (vdo_duplicate_string(ident, NULL, &log_ident) != VDO_SUCCESS) // on failure, NULL is okay log_ident = NULL; log_option = option; default_facility = facility; if (log_option & LOG_NDELAY) open_socket_locked(); uds_unlock_mutex(&mutex); } /**********************************************************************/ void mini_syslog(int priority, const char *format, ...) { va_list args; va_start(args, format); mini_vsyslog(priority, format, args); va_end(args); } /**********************************************************************/ static bool write_msg(int fd, const char *msg) { size_t bytes_to_write = strlen(msg); ssize_t bytes_written = write(fd, msg, bytes_to_write); if (bytes_written == (ssize_t) bytes_to_write) { bytes_to_write += 1; bytes_written += write(fd, "\n", 1); } return bytes_written != (ssize_t) bytes_to_write; } /**********************************************************************/ __printf(3, 0) #ifdef __clang__ // Clang insists on annotating both printf style format strings, but // gcc doesn't understand the second. __printf(5, 0) #endif //__clang__ static void log_it(int priority, const char *prefix, const char *format1, va_list args1, const char *format2, va_list args2) { const char *priority_str = vdo_log_priority_to_string(priority); char buffer[1024]; char *buf_end = buffer + sizeof(buffer); char *bufp = buffer; time_t t = ktime_to_seconds(current_time_ns(CLOCK_REALTIME)); struct tm tm; char timestamp[64]; timestamp[0] = '\0'; if (localtime_r(&t, &tm) != NULL) if (strftime(timestamp, sizeof(timestamp), "%b %e %H:%M:%S", &tm) == 0) timestamp[0] = '\0'; if (LOG_FAC(priority) == 0) priority |= default_facility; bufp = vdo_append_to_buffer(bufp, buf_end, "<%d>%s", priority, timestamp); const char *stderr_msg = bufp; bufp = vdo_append_to_buffer(bufp, buf_end, " %s", log_ident == NULL ? "" : log_ident); if (log_option & LOG_PID) { char tname[16]; uds_get_thread_name(tname); bufp = vdo_append_to_buffer(bufp, buf_end, "[%u]: %-6s (%s/%d) ", getpid(), priority_str, tname, uds_get_thread_id()); } else { bufp = vdo_append_to_buffer(bufp, buf_end, ": "); } if ((bufp + sizeof("...")) >= buf_end) return; if (prefix != NULL) bufp = vdo_append_to_buffer(bufp, buf_end, "%s", prefix); if (format1 != NULL) { int ret = vsnprintf(bufp, buf_end - bufp, format1, args1); if (ret < (buf_end - bufp)) bufp += ret; else bufp = buf_end; } if (format2 != NULL) { int ret = vsnprintf(bufp, buf_end - bufp, format2, args2); if (ret < (buf_end - bufp)) bufp += ret; else bufp = buf_end; } if (bufp == buf_end) strcpy(buf_end - sizeof("..."), "..."); bool failure = false; if (log_option & LOG_PERROR) failure |= write_msg(STDERR_FILENO, stderr_msg); open_socket_locked(); failure |= (log_socket == -1); if (log_socket != -1) { size_t bytes_to_write = bufp - buffer; ssize_t bytes_written = send(log_socket, buffer, bytes_to_write, MSG_NOSIGNAL); failure |= (bytes_written != (ssize_t) bytes_to_write); } if (failure && (log_option & LOG_CONS)) { int console = open(_PATH_CONSOLE, O_WRONLY); if (console != -1) { write_msg(console, stderr_msg); close(console); } } } void mini_syslog_pack(int priority, const char *prefix, const char *fmt1, va_list args1, const char *fmt2, va_list args2) { uds_lock_mutex(&mutex); log_it(priority, prefix, fmt1, args1, fmt2, args2); uds_unlock_mutex(&mutex); } void mini_vsyslog(int priority, const char *format, va_list ap) { va_list dummy; memset(&dummy, 0, sizeof(dummy)); uds_lock_mutex(&mutex); log_it(priority, NULL, format, ap, NULL, dummy); uds_unlock_mutex(&mutex); } void mini_closelog(void) { uds_lock_mutex(&mutex); close_locked(); vdo_free(log_ident); log_ident = NULL; log_option = 0; default_facility = LOG_USER; uds_unlock_mutex(&mutex); } vdo-8.3.1.1/utils/uds/minisyslog.h000066400000000000000000000053131476467262700170020ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef MINISYSLOG_H #define MINISYSLOG_H #include #include #include /** * @file * * Replacements for the syslog library functions so that the library * calls do not conflict with the application calling syslog. **/ /** * Open the logger. The function mimics the openlog() c-library function. * * @param ident The identity string to prepended to all log messages * @param option The logger options (see the openlog(3) man page). * @param facility The type of program logging the message. **/ void mini_openlog(const char *ident, int option, int facility); /** * Log a message. This function mimics the syslog() c-library function. * * @param priority The priority level of the message * @param format A printf style message format **/ void mini_syslog(int priority, const char *format, ...) __printf(2, 3); /** * Log a message. This function mimics the vsyslog() c-library function. * * @param priority The priority level of the message * @param format A printf style message format * @param ap An argument list obtained from stdarg() **/ void mini_vsyslog(int priority, const char *format, va_list ap) __printf(2, 0); /** * Log a message pack consisting of multiple variable sections. * * @param priority the priority at which to log the message * @param prefix optional string prefix to message, may be NULL * @param fmt1 format of message first part, may be NULL * @param args1 arguments for message first part * @param fmt2 format of message second part, may be NULL * @param args2 arguments for message second part **/ void mini_syslog_pack(int priority, const char *prefix, const char *fmt1, va_list args1, const char *fmt2, va_list args2) __printf(3, 0) __printf(5, 0); /** * Close a logger. This function mimics the closelog() c-library function. **/ void mini_closelog(void); #endif /* MINI_SYSLOG_H */ vdo-8.3.1.1/utils/uds/murmurhash3.c000066400000000000000000000051401476467262700170540ustar00rootroot00000000000000// SPDX-License-Identifier: LGPL-2.1+ /* * MurmurHash3 was written by Austin Appleby, and is placed in the public * domain. The author hereby disclaims copyright to this source code. * * Adapted by John Wiele (jwiele@redhat.com). */ #include "murmurhash3.h" #include static inline u64 rotl64(u64 x, s8 r) { return (x << r) | (x >> (64 - r)); } #define ROTL64(x, y) rotl64(x, y) /* Finalization mix - force all bits of a hash block to avalanche */ static __always_inline u64 fmix64(u64 k) { k ^= k >> 33; k *= 0xff51afd7ed558ccdLLU; k ^= k >> 33; k *= 0xc4ceb9fe1a85ec53LLU; k ^= k >> 33; return k; } void murmurhash3_128(const void *key, const int len, const u32 seed, void *out) { const u8 *data = key; const int nblocks = len / 16; u64 h1 = seed; u64 h2 = seed; const u64 c1 = 0x87c37b91114253d5LLU; const u64 c2 = 0x4cf5ad432745937fLLU; u64 *hash_out = out; /* body */ int i; for (i = 0; i < nblocks; i++) { u64 k1 = get_unaligned_le64(&data[i * 16]); u64 k2 = get_unaligned_le64(&data[i * 16 + 8]); k1 *= c1; k1 = ROTL64(k1, 31); k1 *= c2; h1 ^= k1; h1 = ROTL64(h1, 27); h1 += h2; h1 = h1 * 5 + 0x52dce729; k2 *= c2; k2 = ROTL64(k2, 33); k2 *= c1; h2 ^= k2; h2 = ROTL64(h2, 31); h2 += h1; h2 = h2 * 5 + 0x38495ab5; } /* tail */ { const u8 *tail = (const u8 *)(data + nblocks * 16); u64 k1 = 0; u64 k2 = 0; switch (len & 15) { case 15: k2 ^= ((u64)tail[14]) << 48; fallthrough; case 14: k2 ^= ((u64)tail[13]) << 40; fallthrough; case 13: k2 ^= ((u64)tail[12]) << 32; fallthrough; case 12: k2 ^= ((u64)tail[11]) << 24; fallthrough; case 11: k2 ^= ((u64)tail[10]) << 16; fallthrough; case 10: k2 ^= ((u64)tail[9]) << 8; fallthrough; case 9: k2 ^= ((u64)tail[8]) << 0; k2 *= c2; k2 = ROTL64(k2, 33); k2 *= c1; h2 ^= k2; fallthrough; case 8: k1 ^= ((u64)tail[7]) << 56; fallthrough; case 7: k1 ^= ((u64)tail[6]) << 48; fallthrough; case 6: k1 ^= ((u64)tail[5]) << 40; fallthrough; case 5: k1 ^= ((u64)tail[4]) << 32; fallthrough; case 4: k1 ^= ((u64)tail[3]) << 24; fallthrough; case 3: k1 ^= ((u64)tail[2]) << 16; fallthrough; case 2: k1 ^= ((u64)tail[1]) << 8; fallthrough; case 1: k1 ^= ((u64)tail[0]) << 0; k1 *= c1; k1 = ROTL64(k1, 31); k1 *= c2; h1 ^= k1; break; default: break; } } /* finalization */ h1 ^= len; h2 ^= len; h1 += h2; h2 += h1; h1 = fmix64(h1); h2 = fmix64(h2); h1 += h2; h2 += h1; put_unaligned_le64(h1, &hash_out[0]); put_unaligned_le64(h2, &hash_out[1]); } vdo-8.3.1.1/utils/uds/murmurhash3.h000066400000000000000000000006131476467262700170610ustar00rootroot00000000000000/* SPDX-License-Identifier: LGPL-2.1+ */ /* * MurmurHash3 was written by Austin Appleby, and is placed in the public * domain. The author hereby disclaims copyright to this source code. */ #ifndef _MURMURHASH3_H_ #define _MURMURHASH3_H_ #include #include void murmurhash3_128(const void *key, int len, u32 seed, void *out); #endif /* _MURMURHASH3_H_ */ vdo-8.3.1.1/utils/uds/numeric.h000066400000000000000000000036161476467262700162530ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_NUMERIC_H #define UDS_NUMERIC_H #include #include /* * These utilities encode or decode a number from an offset in a larger data buffer and then * advance the offset pointer to the next field in the buffer. */ static inline void decode_s64_le(const u8 *buffer, size_t *offset, s64 *decoded) { *decoded = get_unaligned_le64(buffer + *offset); *offset += sizeof(s64); } static inline void encode_s64_le(u8 *data, size_t *offset, s64 to_encode) { put_unaligned_le64(to_encode, data + *offset); *offset += sizeof(s64); } static inline void decode_u64_le(const u8 *buffer, size_t *offset, u64 *decoded) { *decoded = get_unaligned_le64(buffer + *offset); *offset += sizeof(u64); } static inline void encode_u64_le(u8 *data, size_t *offset, u64 to_encode) { put_unaligned_le64(to_encode, data + *offset); *offset += sizeof(u64); } static inline void decode_s32_le(const u8 *buffer, size_t *offset, s32 *decoded) { *decoded = get_unaligned_le32(buffer + *offset); *offset += sizeof(s32); } static inline void encode_s32_le(u8 *data, size_t *offset, s32 to_encode) { put_unaligned_le32(to_encode, data + *offset); *offset += sizeof(s32); } static inline void decode_u32_le(const u8 *buffer, size_t *offset, u32 *decoded) { *decoded = get_unaligned_le32(buffer + *offset); *offset += sizeof(u32); } static inline void encode_u32_le(u8 *data, size_t *offset, u32 to_encode) { put_unaligned_le32(to_encode, data + *offset); *offset += sizeof(u32); } static inline void decode_u16_le(const u8 *buffer, size_t *offset, u16 *decoded) { *decoded = get_unaligned_le16(buffer + *offset); *offset += sizeof(u16); } static inline void encode_u16_le(u8 *data, size_t *offset, u16 to_encode) { put_unaligned_le16(to_encode, data + *offset); *offset += sizeof(u16); } #endif /* UDS_NUMERIC_H */ vdo-8.3.1.1/utils/uds/open-chapter.c000066400000000000000000000322531476467262700171700ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "open-chapter.h" #include #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "permassert.h" #include "config.h" #include "hash-utils.h" /* * Each index zone has a dedicated open chapter zone structure which gets an equal share of the * open chapter space. Records are assigned to zones based on their record name. Within each zone, * records are stored in an array in the order they arrive. Additionally, a reference to each * record is stored in a hash table to help determine if a new record duplicates an existing one. * If new metadata for an existing name arrives, the record is altered in place. The array of * records is 1-based so that record number 0 can be used to indicate an unused hash slot. * * Deleted records are marked with a flag rather than actually removed to simplify hash table * management. The array of deleted flags overlays the array of hash slots, but the flags are * indexed by record number instead of by record name. The number of hash slots will always be a * power of two that is greater than the number of records to be indexed, guaranteeing that hash * insertion cannot fail, and that there are sufficient flags for all records. * * Once any open chapter zone fills its available space, the chapter is closed. The records from * each zone are interleaved to attempt to preserve temporal locality and assigned to record pages. * Empty or deleted records are replaced by copies of a valid record so that the record pages only * contain valid records. The chapter then constructs a delta index which maps each record name to * the record page on which that record can be found, which is split into index pages. These * structures are then passed to the volume to be recorded on storage. * * When the index is saved, the open chapter records are saved in a single array, once again * interleaved to attempt to preserve temporal locality. When the index is reloaded, there may be a * different number of zones than previously, so the records must be parcelled out to their new * zones. In addition, depending on the distribution of record names, a new zone may have more * records than it has space. In this case, the latest records for that zone will be discarded. */ static const u8 OPEN_CHAPTER_MAGIC[] = "ALBOC"; static const u8 OPEN_CHAPTER_VERSION[] = "02.00"; #define OPEN_CHAPTER_MAGIC_LENGTH (sizeof(OPEN_CHAPTER_MAGIC) - 1) #define OPEN_CHAPTER_VERSION_LENGTH (sizeof(OPEN_CHAPTER_VERSION) - 1) #define LOAD_RATIO 2 static inline size_t records_size(const struct open_chapter_zone *open_chapter) { return sizeof(struct uds_volume_record) * (1 + open_chapter->capacity); } static inline size_t slots_size(size_t slot_count) { return sizeof(struct open_chapter_zone_slot) * slot_count; } int uds_make_open_chapter(const struct index_geometry *geometry, unsigned int zone_count, struct open_chapter_zone **open_chapter_ptr) { int result; struct open_chapter_zone *open_chapter; size_t capacity = geometry->records_per_chapter / zone_count; size_t slot_count = (1 << bits_per(capacity * LOAD_RATIO)); result = vdo_allocate_extended(struct open_chapter_zone, slot_count, struct open_chapter_zone_slot, "open chapter", &open_chapter); if (result != VDO_SUCCESS) return result; open_chapter->slot_count = slot_count; open_chapter->capacity = capacity; result = vdo_allocate_cache_aligned(records_size(open_chapter), "record pages", &open_chapter->records); if (result != VDO_SUCCESS) { uds_free_open_chapter(open_chapter); return result; } *open_chapter_ptr = open_chapter; return UDS_SUCCESS; } void uds_reset_open_chapter(struct open_chapter_zone *open_chapter) { open_chapter->size = 0; open_chapter->deletions = 0; memset(open_chapter->records, 0, records_size(open_chapter)); memset(open_chapter->slots, 0, slots_size(open_chapter->slot_count)); } static unsigned int probe_chapter_slots(struct open_chapter_zone *open_chapter, const struct uds_record_name *name) { struct uds_volume_record *record; unsigned int slot_count = open_chapter->slot_count; unsigned int slot = uds_name_to_hash_slot(name, slot_count); unsigned int record_number; unsigned int attempts = 1; while (true) { record_number = open_chapter->slots[slot].record_number; /* * If the hash slot is empty, we've reached the end of a chain without finding the * record and should terminate the search. */ if (record_number == 0) return slot; /* * If the name of the record referenced by the slot matches and has not been * deleted, then we've found the requested name. */ record = &open_chapter->records[record_number]; if ((memcmp(&record->name, name, UDS_RECORD_NAME_SIZE) == 0) && !open_chapter->slots[record_number].deleted) return slot; /* * Quadratic probing: advance the probe by 1, 2, 3, etc. and try again. This * performs better than linear probing and works best for 2^N slots. */ slot = (slot + attempts++) % slot_count; } } void uds_search_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name, struct uds_record_data *metadata, bool *found) { unsigned int slot; unsigned int record_number; slot = probe_chapter_slots(open_chapter, name); record_number = open_chapter->slots[slot].record_number; if (record_number == 0) { *found = false; } else { *found = true; *metadata = open_chapter->records[record_number].data; } } /* Add a record to the open chapter zone and return the remaining space. */ int uds_put_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name, const struct uds_record_data *metadata) { unsigned int slot; unsigned int record_number; struct uds_volume_record *record; if (open_chapter->size >= open_chapter->capacity) return 0; slot = probe_chapter_slots(open_chapter, name); record_number = open_chapter->slots[slot].record_number; if (record_number == 0) { record_number = ++open_chapter->size; open_chapter->slots[slot].record_number = record_number; } record = &open_chapter->records[record_number]; record->name = *name; record->data = *metadata; return open_chapter->capacity - open_chapter->size; } void uds_remove_from_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name) { unsigned int slot; unsigned int record_number; slot = probe_chapter_slots(open_chapter, name); record_number = open_chapter->slots[slot].record_number; if (record_number > 0) { open_chapter->slots[record_number].deleted = true; open_chapter->deletions += 1; } } void uds_free_open_chapter(struct open_chapter_zone *open_chapter) { if (open_chapter != NULL) { vdo_free(open_chapter->records); vdo_free(open_chapter); } } /* Map each record name to its record page number in the delta chapter index. */ static int fill_delta_chapter_index(struct open_chapter_zone **chapter_zones, unsigned int zone_count, struct open_chapter_index *index, struct uds_volume_record *collated_records) { int result; unsigned int records_per_chapter; unsigned int records_per_page; unsigned int record_index; unsigned int records = 0; u32 page_number; unsigned int z; int overflow_count = 0; struct uds_volume_record *fill_record = NULL; /* * The record pages should not have any empty space, so find a record with which to fill * the chapter zone if it was closed early, and also to replace any deleted records. The * last record in any filled zone is guaranteed to not have been deleted, so use one of * those. */ for (z = 0; z < zone_count; z++) { struct open_chapter_zone *zone = chapter_zones[z]; if (zone->size == zone->capacity) { fill_record = &zone->records[zone->size]; break; } } records_per_chapter = index->geometry->records_per_chapter; records_per_page = index->geometry->records_per_page; for (records = 0; records < records_per_chapter; records++) { struct uds_volume_record *record = &collated_records[records]; struct open_chapter_zone *open_chapter; /* The record arrays in the zones are 1-based. */ record_index = 1 + (records / zone_count); page_number = records / records_per_page; open_chapter = chapter_zones[records % zone_count]; /* Use the fill record in place of an unused record. */ if (record_index > open_chapter->size || open_chapter->slots[record_index].deleted) { *record = *fill_record; continue; } *record = open_chapter->records[record_index]; result = uds_put_open_chapter_index_record(index, &record->name, page_number); switch (result) { case UDS_SUCCESS: break; case UDS_OVERFLOW: overflow_count++; break; default: vdo_log_error_strerror(result, "failed to build open chapter index"); return result; } } if (overflow_count > 0) vdo_log_warning("Failed to add %d entries to chapter index", overflow_count); return UDS_SUCCESS; } int uds_close_open_chapter(struct open_chapter_zone **chapter_zones, unsigned int zone_count, struct volume *volume, struct open_chapter_index *chapter_index, struct uds_volume_record *collated_records, u64 virtual_chapter_number) { int result; uds_empty_open_chapter_index(chapter_index, virtual_chapter_number); result = fill_delta_chapter_index(chapter_zones, zone_count, chapter_index, collated_records); if (result != UDS_SUCCESS) return result; return uds_write_chapter(volume, chapter_index, collated_records); } int uds_save_open_chapter(struct uds_index *index, struct buffered_writer *writer) { int result; struct open_chapter_zone *open_chapter; struct uds_volume_record *record; u8 record_count_data[sizeof(u32)]; u32 record_count = 0; unsigned int record_index; unsigned int z; result = uds_write_to_buffered_writer(writer, OPEN_CHAPTER_MAGIC, OPEN_CHAPTER_MAGIC_LENGTH); if (result != UDS_SUCCESS) return result; result = uds_write_to_buffered_writer(writer, OPEN_CHAPTER_VERSION, OPEN_CHAPTER_VERSION_LENGTH); if (result != UDS_SUCCESS) return result; for (z = 0; z < index->zone_count; z++) { open_chapter = index->zones[z]->open_chapter; record_count += open_chapter->size - open_chapter->deletions; } put_unaligned_le32(record_count, record_count_data); result = uds_write_to_buffered_writer(writer, record_count_data, sizeof(record_count_data)); if (result != UDS_SUCCESS) return result; record_index = 1; while (record_count > 0) { for (z = 0; z < index->zone_count; z++) { open_chapter = index->zones[z]->open_chapter; if (record_index > open_chapter->size) continue; if (open_chapter->slots[record_index].deleted) continue; record = &open_chapter->records[record_index]; result = uds_write_to_buffered_writer(writer, (u8 *) record, sizeof(*record)); if (result != UDS_SUCCESS) return result; record_count--; } record_index++; } return uds_flush_buffered_writer(writer); } u64 uds_compute_saved_open_chapter_size(struct index_geometry *geometry) { unsigned int records_per_chapter = geometry->records_per_chapter; return OPEN_CHAPTER_MAGIC_LENGTH + OPEN_CHAPTER_VERSION_LENGTH + sizeof(u32) + records_per_chapter * sizeof(struct uds_volume_record); } static int load_version20(struct uds_index *index, struct buffered_reader *reader) { int result; u32 record_count; u8 record_count_data[sizeof(u32)]; struct uds_volume_record record; /* * Track which zones cannot accept any more records. If the open chapter had a different * number of zones previously, some new zones may have more records than they have space * for. These overflow records will be discarded. */ bool full_flags[MAX_ZONES] = { false, }; result = uds_read_from_buffered_reader(reader, (u8 *) &record_count_data, sizeof(record_count_data)); if (result != UDS_SUCCESS) return result; record_count = get_unaligned_le32(record_count_data); while (record_count-- > 0) { unsigned int zone = 0; result = uds_read_from_buffered_reader(reader, (u8 *) &record, sizeof(record)); if (result != UDS_SUCCESS) return result; if (index->zone_count > 1) zone = uds_get_volume_index_zone(index->volume_index, &record.name); if (!full_flags[zone]) { struct open_chapter_zone *open_chapter; unsigned int remaining; open_chapter = index->zones[zone]->open_chapter; remaining = uds_put_open_chapter(open_chapter, &record.name, &record.data); /* Do not allow any zone to fill completely. */ full_flags[zone] = (remaining <= 1); } } return UDS_SUCCESS; } int uds_load_open_chapter(struct uds_index *index, struct buffered_reader *reader) { u8 version[OPEN_CHAPTER_VERSION_LENGTH]; int result; result = uds_verify_buffered_data(reader, OPEN_CHAPTER_MAGIC, OPEN_CHAPTER_MAGIC_LENGTH); if (result != UDS_SUCCESS) return result; result = uds_read_from_buffered_reader(reader, version, sizeof(version)); if (result != UDS_SUCCESS) return result; if (memcmp(OPEN_CHAPTER_VERSION, version, sizeof(version)) != 0) { return vdo_log_error_strerror(UDS_CORRUPT_DATA, "Invalid open chapter version: %.*s", (int) sizeof(version), version); } return load_version20(index, reader); } vdo-8.3.1.1/utils/uds/open-chapter.h000066400000000000000000000050611476467262700171720ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_OPEN_CHAPTER_H #define UDS_OPEN_CHAPTER_H #include "chapter-index.h" #include "geometry.h" #include "index.h" #include "volume.h" /* * The open chapter tracks the newest records in memory. Like the index as a whole, each open * chapter is divided into a number of independent zones which are interleaved when the chapter is * committed to the volume. */ enum { OPEN_CHAPTER_RECORD_NUMBER_BITS = 23, }; struct open_chapter_zone_slot { /* If non-zero, the record number addressed by this hash slot */ unsigned int record_number : OPEN_CHAPTER_RECORD_NUMBER_BITS; /* If true, the record at the index of this hash slot was deleted */ bool deleted : 1; } __packed; struct open_chapter_zone { /* The maximum number of records that can be stored */ unsigned int capacity; /* The number of records stored */ unsigned int size; /* The number of deleted records */ unsigned int deletions; /* Array of chunk records, 1-based */ struct uds_volume_record *records; /* The number of slots in the hash table */ unsigned int slot_count; /* The hash table slots, referencing virtual record numbers */ struct open_chapter_zone_slot slots[]; }; int __must_check uds_make_open_chapter(const struct index_geometry *geometry, unsigned int zone_count, struct open_chapter_zone **open_chapter_ptr); void uds_reset_open_chapter(struct open_chapter_zone *open_chapter); void uds_search_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name, struct uds_record_data *metadata, bool *found); int __must_check uds_put_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name, const struct uds_record_data *metadata); void uds_remove_from_open_chapter(struct open_chapter_zone *open_chapter, const struct uds_record_name *name); void uds_free_open_chapter(struct open_chapter_zone *open_chapter); int __must_check uds_close_open_chapter(struct open_chapter_zone **chapter_zones, unsigned int zone_count, struct volume *volume, struct open_chapter_index *chapter_index, struct uds_volume_record *collated_records, u64 virtual_chapter_number); int __must_check uds_save_open_chapter(struct uds_index *index, struct buffered_writer *writer); int __must_check uds_load_open_chapter(struct uds_index *index, struct buffered_reader *reader); u64 uds_compute_saved_open_chapter_size(struct index_geometry *geometry); #endif /* UDS_OPEN_CHAPTER_H */ vdo-8.3.1.1/utils/uds/permassert.c000066400000000000000000000036421476467262700167700ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "permassert.h" #include "errors.h" #include "logger.h" #ifdef NDEBUG #define DEBUGGING_OFF #undef NDEBUG #endif /* NDEBUG */ #include #include #include #include #include "string-utils.h" #include "thread-utils.h" #ifdef DEBUGGING_OFF static bool exit_on_assertion_failure; #else /* not DEBUGGING_OFF */ static bool exit_on_assertion_failure = true; #endif /* DEBUGGING_OFF */ static const char *EXIT_ON_ASSERTION_FAILURE_VARIABLE = "UDS_EXIT_ON_ASSERTION_FAILURE"; static atomic_t init_once = ATOMIC_INIT(0); static struct mutex mutex = UDS_MUTEX_INITIALIZER; static void initialize(void) { uds_initialize_mutex(&mutex, !UDS_DO_ASSERTIONS); char *exit_on_assertion_failure_string = getenv(EXIT_ON_ASSERTION_FAILURE_VARIABLE); if (exit_on_assertion_failure_string != NULL) { exit_on_assertion_failure = (strcasecmp(exit_on_assertion_failure_string, "true") == 0); } } bool set_exit_on_assertion_failure(bool should_exit) { bool previous_setting; vdo_perform_once(&init_once, initialize); uds_lock_mutex(&mutex); previous_setting = exit_on_assertion_failure; exit_on_assertion_failure = should_exit; uds_unlock_mutex(&mutex); return previous_setting; } int vdo_assertion_failed(const char *expression_string, const char *file_name, int line_number, const char *format, ...) { va_list args; va_start(args, format); vdo_log_embedded_message(VDO_LOG_ERR, VDO_LOGGING_MODULE_NAME, "assertion \"", format, args, "\" (%s) failed at %s:%d", expression_string, file_name, line_number); vdo_log_backtrace(VDO_LOG_ERR); vdo_perform_once(&init_once, initialize); uds_lock_mutex(&mutex); if (exit_on_assertion_failure) { __assert_fail(expression_string, file_name, line_number, __ASSERT_FUNCTION); } uds_unlock_mutex(&mutex); va_end(args); return UDS_ASSERTION_FAILED; } vdo-8.3.1.1/utils/uds/permassert.h000066400000000000000000000035731476467262700170000ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef PERMASSERT_H #define PERMASSERT_H #include #include #include "errors.h" /* Utilities for asserting that certain conditions are met */ #define STRINGIFY(X) #X /* * A hack to apply the "warn if unused" attribute to an integral expression. * * Since GCC doesn't propagate the warn_unused_result attribute to conditional expressions * incorporating calls to functions with that attribute, this function can be used to wrap such an * expression. With optimization enabled, this function contributes no additional instructions, but * the warn_unused_result attribute still applies to the code calling it. */ static inline int __must_check vdo_must_use(int value) { return value; } /* Assert that an expression is true and return an error if it is not. */ #define VDO_ASSERT(expr, ...) vdo_must_use(__VDO_ASSERT(expr, __VA_ARGS__)) /* Log a message if the expression is not true. */ #define VDO_ASSERT_LOG_ONLY(expr, ...) __VDO_ASSERT(expr, __VA_ARGS__) #define __VDO_ASSERT(expr, ...) \ (likely(expr) ? VDO_SUCCESS \ : vdo_assertion_failed(STRINGIFY(expr), __FILE__, __LINE__, __VA_ARGS__)) /* Log an assertion failure message. */ int vdo_assertion_failed(const char *expression_string, const char *file_name, int line_number, const char *format, ...) __printf(4, 5); #define STATIC_ASSERT(expr) \ do { \ switch (0) { \ case 0: \ ; \ fallthrough; \ case expr: \ ; \ fallthrough; \ default: \ break; \ } \ } while (0) #define STATIC_ASSERT_SIZEOF(type, expected_size) STATIC_ASSERT(sizeof(type) == (expected_size)) /* Set whether or not to exit on an assertion failure, for tests. */ bool set_exit_on_assertion_failure(bool should_exit); #endif /* PERMASSERT_H */ vdo-8.3.1.1/utils/uds/radix-sort.c000066400000000000000000000223231476467262700166740ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "radix-sort.h" #include #include #include "memory-alloc.h" #include "string-utils.h" /* * This implementation allocates one large object to do the sorting, which can be reused as many * times as desired. The amount of memory required is logarithmically proportional to the number of * keys to be sorted. */ /* Piles smaller than this are handled with a simple insertion sort. */ #define INSERTION_SORT_THRESHOLD 12 /* Sort keys are pointers to immutable fixed-length arrays of bytes. */ typedef const u8 *sort_key_t; /* * The keys are separated into piles based on the byte in each keys at the current offset, so the * number of keys with each byte must be counted. */ struct histogram { /* The number of non-empty bins */ u16 used; /* The index (key byte) of the first non-empty bin */ u16 first; /* The index (key byte) of the last non-empty bin */ u16 last; /* The number of occurrences of each specific byte */ u32 size[256]; }; /* * Sub-tasks are manually managed on a stack, both for performance and to put a logarithmic bound * on the stack space needed. */ struct task { /* Pointer to the first key to sort. */ sort_key_t *first_key; /* Pointer to the last key to sort. */ sort_key_t *last_key; /* The offset into the key at which to continue sorting. */ u16 offset; /* The number of bytes remaining in the sort keys. */ u16 length; }; struct radix_sorter { unsigned int count; struct histogram bins; sort_key_t *pile[256]; struct task *end_of_stack; struct task insertion_list[256]; struct task stack[]; }; /* Compare a segment of two fixed-length keys starting at an offset. */ static inline int compare(sort_key_t key1, sort_key_t key2, u16 offset, u16 length) { return memcmp(&key1[offset], &key2[offset], length); } /* Insert the next unsorted key into an array of sorted keys. */ static inline void insert_key(const struct task task, sort_key_t *next) { /* Pull the unsorted key out, freeing up the array slot. */ sort_key_t unsorted = *next; /* Compare the key to the preceding sorted entries, shifting down ones that are larger. */ while ((--next >= task.first_key) && (compare(unsorted, next[0], task.offset, task.length) < 0)) next[1] = next[0]; /* Insert the key into the last slot that was cleared, sorting it. */ next[1] = unsorted; } /* * Sort a range of key segments using an insertion sort. This simple sort is faster than the * 256-way radix sort when the number of keys to sort is small. */ static inline void insertion_sort(const struct task task) { sort_key_t *next; for (next = task.first_key + 1; next <= task.last_key; next++) insert_key(task, next); } /* Push a sorting task onto a task stack. */ static inline void push_task(struct task **stack_pointer, sort_key_t *first_key, u32 count, u16 offset, u16 length) { struct task *task = (*stack_pointer)++; task->first_key = first_key; task->last_key = &first_key[count - 1]; task->offset = offset; task->length = length; } static inline void swap_keys(sort_key_t *a, sort_key_t *b) { sort_key_t c = *a; *a = *b; *b = c; } /* * Count the number of times each byte value appears in the arrays of keys to sort at the current * offset, keeping track of the number of non-empty bins, and the index of the first and last * non-empty bin. */ static inline void measure_bins(const struct task task, struct histogram *bins) { sort_key_t *key_ptr; /* * Subtle invariant: bins->used and bins->size[] are zero because the sorting code clears * it all out as it goes. Even though this structure is re-used, we don't need to pay to * zero it before starting a new tally. */ bins->first = U8_MAX; bins->last = 0; for (key_ptr = task.first_key; key_ptr <= task.last_key; key_ptr++) { /* Increment the count for the byte in the key at the current offset. */ u8 bin = (*key_ptr)[task.offset]; u32 size = ++bins->size[bin]; /* Track non-empty bins. */ if (size == 1) { bins->used += 1; if (bin < bins->first) bins->first = bin; if (bin > bins->last) bins->last = bin; } } } /* * Convert the bin sizes to pointers to where each pile goes. * * pile[0] = first_key + bin->size[0], * pile[1] = pile[0] + bin->size[1], etc. * * After the keys are moved to the appropriate pile, we'll need to sort each of the piles by the * next radix position. A new task is put on the stack for each pile containing lots of keys, or a * new task is put on the list for each pile containing few keys. * * @stack: pointer the top of the stack * @end_of_stack: the end of the stack * @list: pointer the head of the list * @pile: array for pointers to the end of each pile * @bins: the histogram of the sizes of each pile * @first_key: the first key of the stack * @offset: the next radix position to sort by * @length: the number of bytes remaining in the sort keys * * Return: UDS_SUCCESS or an error code */ static inline int push_bins(struct task **stack, struct task *end_of_stack, struct task **list, sort_key_t *pile[], struct histogram *bins, sort_key_t *first_key, u16 offset, u16 length) { sort_key_t *pile_start = first_key; int bin; for (bin = bins->first; ; bin++) { u32 size = bins->size[bin]; /* Skip empty piles. */ if (size == 0) continue; /* There's no need to sort empty keys. */ if (length > 0) { if (size > INSERTION_SORT_THRESHOLD) { if (*stack >= end_of_stack) return UDS_BAD_STATE; push_task(stack, pile_start, size, offset, length); } else if (size > 1) { push_task(list, pile_start, size, offset, length); } } pile_start += size; pile[bin] = pile_start; if (--bins->used == 0) break; } return UDS_SUCCESS; } int uds_make_radix_sorter(unsigned int count, struct radix_sorter **sorter) { int result; unsigned int stack_size = count / INSERTION_SORT_THRESHOLD; struct radix_sorter *radix_sorter; result = vdo_allocate_extended(struct radix_sorter, stack_size, struct task, __func__, &radix_sorter); if (result != VDO_SUCCESS) return result; radix_sorter->count = count; radix_sorter->end_of_stack = radix_sorter->stack + stack_size; *sorter = radix_sorter; return UDS_SUCCESS; } void uds_free_radix_sorter(struct radix_sorter *sorter) { vdo_free(sorter); } /* * Sort pointers to fixed-length keys (arrays of bytes) using a radix sort. The sort implementation * is unstable, so the relative ordering of equal keys is not preserved. */ int uds_radix_sort(struct radix_sorter *sorter, const unsigned char *keys[], unsigned int count, unsigned short length) { struct task start; struct histogram *bins = &sorter->bins; sort_key_t **pile = sorter->pile; struct task *task_stack = sorter->stack; /* All zero-length keys are identical and therefore already sorted. */ if ((count == 0) || (length == 0)) return UDS_SUCCESS; /* The initial task is to sort the entire length of all the keys. */ start = (struct task) { .first_key = keys, .last_key = &keys[count - 1], .offset = 0, .length = length, }; if (count <= INSERTION_SORT_THRESHOLD) { insertion_sort(start); return UDS_SUCCESS; } if (count > sorter->count) return UDS_INVALID_ARGUMENT; /* * Repeatedly consume a sorting task from the stack and process it, pushing new sub-tasks * onto the stack for each radix-sorted pile. When all tasks and sub-tasks have been * processed, the stack will be empty and all the keys in the starting task will be fully * sorted. */ for (*task_stack = start; task_stack >= sorter->stack; task_stack--) { const struct task task = *task_stack; struct task *insertion_task_list; int result; sort_key_t *fence; sort_key_t *end; measure_bins(task, bins); /* * Now that we know how large each bin is, generate pointers for each of the piles * and push a new task to sort each pile by the next radix byte. */ insertion_task_list = sorter->insertion_list; result = push_bins(&task_stack, sorter->end_of_stack, &insertion_task_list, pile, bins, task.first_key, task.offset + 1, task.length - 1); if (result != UDS_SUCCESS) { memset(bins, 0, sizeof(*bins)); return result; } /* Now bins->used is zero again. */ /* * Don't bother processing the last pile: when piles 0..N-1 are all in place, then * pile N must also be in place. */ end = task.last_key - bins->size[bins->last]; bins->size[bins->last] = 0; for (fence = task.first_key; fence <= end; ) { u8 bin; sort_key_t key = *fence; /* * The radix byte of the key tells us which pile it belongs in. Swap it for * an unprocessed item just below that pile, and repeat. */ while (--pile[bin = key[task.offset]] > fence) swap_keys(pile[bin], &key); /* * The pile reached the fence. Put the key at the bottom of that pile, * completing it, and advance the fence to the next pile. */ *fence = key; fence += bins->size[bin]; bins->size[bin] = 0; } /* Now bins->size[] is all zero again. */ /* * When the number of keys in a task gets small enough, it is faster to use an * insertion sort than to keep subdividing into tiny piles. */ while (--insertion_task_list >= sorter->insertion_list) insertion_sort(*insertion_task_list); } return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/radix-sort.h000066400000000000000000000015051476467262700167000ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_RADIX_SORT_H #define UDS_RADIX_SORT_H #include /* * Radix sort is implemented using an American Flag sort, an unstable, in-place 8-bit radix * exchange sort. This is adapted from the algorithm in the paper by Peter M. McIlroy, Keith * Bostic, and M. Douglas McIlroy, "Engineering Radix Sort". * * http://www.usenix.org/publications/compsystems/1993/win_mcilroy.pdf */ struct radix_sorter; int __must_check uds_make_radix_sorter(unsigned int count, struct radix_sorter **sorter); void uds_free_radix_sorter(struct radix_sorter *sorter); int __must_check uds_radix_sort(struct radix_sorter *sorter, const unsigned char *keys[], unsigned int count, unsigned short length); #endif /* UDS_RADIX_SORT_H */ vdo-8.3.1.1/utils/uds/random.c000066400000000000000000000010771476467262700160630ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include #include "random.h" void get_random_bytes(void *buffer, size_t byte_count) { uint64_t rand_num = 0; uint64_t rand_mask = 0; const uint64_t multiplier = (uint64_t) RAND_MAX + 1; u8 *data = buffer; size_t i; for (i = 0; i < byte_count; i++) { if (rand_mask < 0xff) { rand_num = rand_num * multiplier + random(); rand_mask = rand_mask * multiplier + RAND_MAX; } data[i] = rand_num & 0xff; rand_num >>= 8; rand_mask >>= 8; } } vdo-8.3.1.1/utils/uds/random.h000066400000000000000000000002341476467262700160620ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef RANDOM_H #define RANDOM_H #include #endif /* RANDOM_H */ vdo-8.3.1.1/utils/uds/requestQueue.c000066400000000000000000000226641476467262700173050ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "funnel-requestqueue.h" #include #include #include "event-count.h" #include "funnel-queue.h" #include "logger.h" #include "memory-alloc.h" #include "thread-utils.h" #include "time-utils.h" /* * This queue will attempt to handle requests in reasonably sized batches * instead of reacting immediately to each new request. The wait time between * batches is dynamically adjusted up or down to try to balance responsiveness * against wasted thread run time. * * If the wait time becomes long enough, the queue will become dormant and must * be explicitly awoken when a new request is enqueued. The enqueue operation * updates "newest" in the funnel queue via xchg (which is a memory barrier), * and later checks "dormant" to decide whether to do a wakeup of the worker * thread. * * When deciding to go to sleep, the worker thread sets "dormant" and then * examines "newest" to decide if the funnel queue is idle. In dormant mode, * the last examination of "newest" before going to sleep is done inside the * wait_event_interruptible macro(), after a point where one or more memory * barriers have been issued. (Preparing to sleep uses spin locks.) Even if the * funnel queue's "next" field update isn't visible yet to make the entry * accessible, its existence will kick the worker thread out of dormant mode * and back into timer-based mode. * * Unbatched requests are used to communicate between different zone threads * and will also cause the queue to awaken immediately. */ enum { NANOSECOND = 1, MICROSECOND = 1000 * NANOSECOND, MILLISECOND = 1000 * MICROSECOND, DEFAULT_WAIT_TIME = 10 * MICROSECOND, MINIMUM_WAIT_TIME = DEFAULT_WAIT_TIME / 2, MAXIMUM_WAIT_TIME = MILLISECOND, MINIMUM_BATCH = 32, MAXIMUM_BATCH = 64, }; struct uds_request_queue { /* The name of queue */ const char *name; /* Function to process a request */ uds_request_queue_processor_fn processor; /* Queue of new incoming requests */ struct funnel_queue *main_queue; /* Queue of old requests to retry */ struct funnel_queue *retry_queue; /* Signal to wake the worker thread */ struct event_count *work_event; /* The thread id of the worker thread */ struct thread *thread; /* True if the worker was started */ bool started; /* When true, requests can be enqueued */ bool running; /* A flag set when the worker is waiting without a timeout */ atomic_t dormant; /* * The following fields are mutable state private to the worker thread. * The first field is aligned to avoid cache line sharing with * preceding fields. */ /* Requests processed since last wait */ uint64_t current_batch __aligned(L1_CACHE_BYTES); /* The amount of time to wait to accumulate a batch of requests */ uint64_t wait_nanoseconds; /* The relative time at which to wake when waiting with a timeout */ ktime_t wake_rel_time; }; /**********************************************************************/ static void adjust_wait_time(struct uds_request_queue *queue) { uint64_t delta = queue->wait_nanoseconds / 4; if (queue->current_batch < MINIMUM_BATCH) queue->wait_nanoseconds += delta; else if (queue->current_batch > MAXIMUM_BATCH) queue->wait_nanoseconds -= delta; } /** * Decide if the queue should wait with a timeout or enter the dormant mode * of waiting without a timeout. If timing out, returns an relative wake * time to pass to the wait call, otherwise returns NULL. (wake_rel_time is a * queue field to make it easy for this function to return NULL). * * @param queue the request queue * * @return a pointer the relative wake time, or NULL if there is no timeout **/ static ktime_t *get_wake_time(struct uds_request_queue *queue) { if (queue->wait_nanoseconds >= MAXIMUM_WAIT_TIME) { if (atomic_read(&queue->dormant)) { /* The thread is going dormant. */ queue->wait_nanoseconds = DEFAULT_WAIT_TIME; return NULL; } queue->wait_nanoseconds = MAXIMUM_WAIT_TIME; atomic_set_release(&queue->dormant, true); } else if (queue->wait_nanoseconds < MINIMUM_WAIT_TIME) { queue->wait_nanoseconds = MINIMUM_WAIT_TIME; } queue->wake_rel_time = queue->wait_nanoseconds; return &queue->wake_rel_time; } /** * Poll the underlying lock-free queues for a request to process. Requests in * the retry queue have higher priority, so that queue is polled first. * * @param queue the request queue being serviced * * @return a dequeued request, or NULL if no request was available **/ static struct uds_request *poll_queues(struct uds_request_queue *queue) { struct funnel_queue_entry *entry; entry = vdo_funnel_queue_poll(queue->retry_queue); if (entry != NULL) return container_of(entry, struct uds_request, queue_link); entry = vdo_funnel_queue_poll(queue->main_queue); if (entry != NULL) return container_of(entry, struct uds_request, queue_link); return NULL; } /* * Remove the next request to be processed from the queue, waiting for a * request if necessary. */ static struct uds_request *dequeue_request(struct uds_request_queue *queue) { for (;;) { struct uds_request *request; event_token_t wait_token; ktime_t *wake_time; bool shutting_down; queue->current_batch++; request = poll_queues(queue); if (request != NULL) return request; /* Prepare to wait for more work to arrive. */ wait_token = event_count_prepare(queue->work_event); shutting_down = !READ_ONCE(queue->running); if (shutting_down) /* * Ensure that we see any remaining requests that were * enqueued before shutting down. The corresponding * write barrier is in uds_request_queue_finish(). */ smp_rmb(); /* * Poll again in case a request was enqueued just before we got * the event key. */ request = poll_queues(queue); if ((request != NULL) || shutting_down) { event_count_cancel(queue->work_event, wait_token); return request; } /* Wait for more work to arrive. */ adjust_wait_time(queue); wake_time = get_wake_time(queue); event_count_wait(queue->work_event, wait_token, wake_time); if (wake_time == NULL) { /* * The queue has been roused from dormancy. Clear the * flag so enqueuers can stop broadcasting. No fence is * needed for this transition. */ atomic_set(&queue->dormant, false); queue->wait_nanoseconds = DEFAULT_WAIT_TIME; } queue->current_batch = 0; } } /**********************************************************************/ static void request_queue_worker(void *arg) { struct uds_request_queue *queue = (struct uds_request_queue *) arg; struct uds_request *request; vdo_log_debug("%s queue starting", queue->name); while ((request = dequeue_request(queue)) != NULL) queue->processor(request); vdo_log_debug("%s queue done", queue->name); } /**********************************************************************/ int uds_make_request_queue(const char *queue_name, uds_request_queue_processor_fn processor, struct uds_request_queue **queue_ptr) { int result; struct uds_request_queue *queue; result = vdo_allocate(1, struct uds_request_queue, __func__, &queue); if (result != VDO_SUCCESS) return result; queue->name = queue_name; queue->processor = processor; queue->running = true; queue->current_batch = 0; queue->wait_nanoseconds = DEFAULT_WAIT_TIME; result = vdo_make_funnel_queue(&queue->main_queue); if (result != UDS_SUCCESS) { uds_request_queue_finish(queue); return result; } result = vdo_make_funnel_queue(&queue->retry_queue); if (result != UDS_SUCCESS) { uds_request_queue_finish(queue); return result; } result = make_event_count(&queue->work_event); if (result != UDS_SUCCESS) { uds_request_queue_finish(queue); return result; } result = vdo_create_thread(request_queue_worker, queue, queue_name, &queue->thread); if (result != VDO_SUCCESS) { uds_request_queue_finish(queue); return result; } queue->started = true; smp_mb(); *queue_ptr = queue; return UDS_SUCCESS; } /**********************************************************************/ static inline void wake_up_worker(struct uds_request_queue *queue) { event_count_broadcast(queue->work_event); } /**********************************************************************/ void uds_request_queue_enqueue(struct uds_request_queue *queue, struct uds_request *request) { struct funnel_queue *sub_queue; bool unbatched = request->unbatched; sub_queue = request->requeued ? queue->retry_queue : queue->main_queue; vdo_funnel_queue_put(sub_queue, &request->queue_link); /* * We must wake the worker thread when it is dormant. A read fence * isn't needed here since we know the queue operation acts as one. */ if (atomic_read(&queue->dormant) || unbatched) wake_up_worker(queue); } /**********************************************************************/ void uds_request_queue_finish(struct uds_request_queue *queue) { if (queue == NULL) return; /* * This memory barrier ensures that any requests we queued will be * seen. The point is that when dequeue_request() sees the following * update to the running flag, it will also be able to see any change * we made to a next field in the funnel queue entry. The corresponding * read barrier is in dequeue_request(). */ smp_wmb(); WRITE_ONCE(queue->running, false); if (queue->started) { wake_up_worker(queue); vdo_join_threads(queue->thread); } free_event_count(queue->work_event); vdo_free_funnel_queue(queue->main_queue); vdo_free_funnel_queue(queue->retry_queue); vdo_free(queue); } vdo-8.3.1.1/utils/uds/sparse-cache.c000066400000000000000000000464661476467262700171540ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "sparse-cache.h" #include #include #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "thread-utils.h" #include "chapter-index.h" #include "config.h" #include "index.h" /* * Since the cache is small, it is implemented as a simple array of cache entries. Searching for a * specific virtual chapter is implemented as a linear search. The cache replacement policy is * least-recently-used (LRU). Again, the small size of the cache allows the LRU order to be * maintained by shifting entries in an array list. * * Changing the contents of the cache requires the coordinated participation of all zone threads * via the careful use of barrier messages sent to all the index zones by the triage queue worker * thread. The critical invariant for coordination is that the cache membership must not change * between updates, so that all calls to uds_sparse_cache_contains() from the zone threads must all * receive the same results for every virtual chapter number. To ensure that critical invariant, * state changes such as "that virtual chapter is no longer in the volume" and "skip searching that * chapter because it has had too many cache misses" are represented separately from the cache * membership information (the virtual chapter number). * * As a result of this invariant, we have the guarantee that every zone thread will call * uds_update_sparse_cache() once and exactly once to request a chapter that is not in the cache, * and the serialization of the barrier requests from the triage queue ensures they will all * request the same chapter number. This means the only synchronization we need can be provided by * a pair of thread barriers used only in the uds_update_sparse_cache() call, providing a critical * section where a single zone thread can drive the cache update while all the other zone threads * are known to be blocked, waiting in the second barrier. Outside that critical section, all the * zone threads implicitly hold a shared lock. Inside it, the thread for zone zero holds an * exclusive lock. No other threads may access or modify the cache entries. * * Chapter statistics must only be modified by a single thread, which is also the zone zero thread. * All fields that might be frequently updated by that thread are kept in separate cache-aligned * structures so they will not cause cache contention via "false sharing" with the fields that are * frequently accessed by all of the zone threads. * * The LRU order is managed independently by each zone thread, and each zone uses its own list for * searching and cache membership queries. The zone zero list is used to decide which chapter to * evict when the cache is updated, and its search list is copied to the other threads at that * time. * * The virtual chapter number field of the cache entry is the single field indicating whether a * chapter is a member of the cache or not. The value NO_CHAPTER is used to represent a null or * undefined chapter number. When present in the virtual chapter number field of a * cached_chapter_index, it indicates that the cache entry is dead, and all the other fields of * that entry (other than immutable pointers to cache memory) are undefined and irrelevant. Any * cache entry that is not marked as dead is fully defined and a member of the cache, and * uds_sparse_cache_contains() will always return true for any virtual chapter number that appears * in any of the cache entries. * * A chapter index that is a member of the cache may be excluded from searches between calls to * uds_update_sparse_cache() in two different ways. First, when a chapter falls off the end of the * volume, its virtual chapter number will be less that the oldest virtual chapter number. Since * that chapter is no longer part of the volume, there's no point in continuing to search that * chapter index. Once invalidated, that virtual chapter will still be considered a member of the * cache, but it will no longer be searched for matching names. * * The second mechanism is a heuristic based on keeping track of the number of consecutive search * misses in a given chapter index. Once that count exceeds a threshold, the skip_search flag will * be set to true, causing the chapter to be skipped when searching the entire cache, but still * allowing it to be found when searching for a hook in that specific chapter. Finding a hook will * clear the skip_search flag, once again allowing the non-hook searches to use that cache entry. * Again, regardless of the state of the skip_search flag, the virtual chapter must still * considered to be a member of the cache for uds_sparse_cache_contains(). */ #define SKIP_SEARCH_THRESHOLD 20000 #define ZONE_ZERO 0 /* * These counters are essentially fields of the struct cached_chapter_index, but are segregated * into this structure because they are frequently modified. They are grouped and aligned to keep * them on different cache lines from the chapter fields that are accessed far more often than they * are updated. */ struct __aligned(L1_CACHE_BYTES) cached_index_counters { u64 consecutive_misses; }; struct __aligned(L1_CACHE_BYTES) cached_chapter_index { /* * The virtual chapter number of the cached chapter index. NO_CHAPTER means this cache * entry is unused. This field must only be modified in the critical section in * uds_update_sparse_cache(). */ u64 virtual_chapter; u32 index_pages_count; /* * These pointers are immutable during the life of the cache. The contents of the arrays * change when the cache entry is replaced. */ struct delta_index_page *index_pages; struct dm_buffer **page_buffers; /* * If set, skip the chapter when searching the entire cache. This flag is just a * performance optimization. This flag is mutable between cache updates, but it rarely * changes and is frequently accessed, so it groups with the immutable fields. */ bool skip_search; /* * The cache-aligned counters change often and are placed at the end of the structure to * prevent false sharing with the more stable fields above. */ struct cached_index_counters counters; }; /* * A search_list represents an ordering of the sparse chapter index cache entry array, from most * recently accessed to least recently accessed, which is the order in which the indexes should be * searched and the reverse order in which they should be evicted from the cache. * * Cache entries that are dead or empty are kept at the end of the list, avoiding the need to even * iterate over them to search, and ensuring that dead entries are replaced before any live entries * are evicted. * * The search list is instantiated for each zone thread, avoiding any need for synchronization. The * structure is allocated on a cache boundary to avoid false sharing of memory cache lines between * zone threads. */ struct search_list { u8 capacity; u8 first_dead_entry; struct cached_chapter_index *entries[]; }; struct sparse_cache { const struct index_geometry *geometry; unsigned int capacity; unsigned int zone_count; unsigned int skip_threshold; struct search_list *search_lists[MAX_ZONES]; struct cached_chapter_index **scratch_entries; struct threads_barrier begin_update_barrier; struct threads_barrier end_update_barrier; struct cached_chapter_index chapters[]; }; static int __must_check initialize_cached_chapter_index(struct cached_chapter_index *chapter, const struct index_geometry *geometry) { int result; chapter->virtual_chapter = NO_CHAPTER; chapter->index_pages_count = geometry->index_pages_per_chapter; result = vdo_allocate(chapter->index_pages_count, struct delta_index_page, __func__, &chapter->index_pages); if (result != VDO_SUCCESS) return result; return vdo_allocate(chapter->index_pages_count, struct dm_buffer *, "sparse index volume pages", &chapter->page_buffers); } static int __must_check make_search_list(struct sparse_cache *cache, struct search_list **list_ptr) { struct search_list *list; unsigned int bytes; u8 i; int result; bytes = (sizeof(struct search_list) + (cache->capacity * sizeof(struct cached_chapter_index *))); result = vdo_allocate_cache_aligned(bytes, "search list", &list); if (result != VDO_SUCCESS) return result; list->capacity = cache->capacity; list->first_dead_entry = 0; for (i = 0; i < list->capacity; i++) list->entries[i] = &cache->chapters[i]; *list_ptr = list; return UDS_SUCCESS; } int uds_make_sparse_cache(const struct index_geometry *geometry, unsigned int capacity, unsigned int zone_count, struct sparse_cache **cache_ptr) { int result; unsigned int i; struct sparse_cache *cache; unsigned int bytes; bytes = (sizeof(struct sparse_cache) + (capacity * sizeof(struct cached_chapter_index))); result = vdo_allocate_cache_aligned(bytes, "sparse cache", &cache); if (result != VDO_SUCCESS) return result; cache->geometry = geometry; cache->capacity = capacity; cache->zone_count = zone_count; /* * Scale down the skip threshold since the cache only counts cache misses in zone zero, but * requests are being handled in all zones. */ cache->skip_threshold = (SKIP_SEARCH_THRESHOLD / zone_count); initialize_threads_barrier(&cache->begin_update_barrier, zone_count); initialize_threads_barrier(&cache->end_update_barrier, zone_count); for (i = 0; i < capacity; i++) { result = initialize_cached_chapter_index(&cache->chapters[i], geometry); if (result != UDS_SUCCESS) goto out; } for (i = 0; i < zone_count; i++) { result = make_search_list(cache, &cache->search_lists[i]); if (result != UDS_SUCCESS) goto out; } /* purge_search_list() needs some temporary lists for sorting. */ result = vdo_allocate(capacity * 2, struct cached_chapter_index *, "scratch entries", &cache->scratch_entries); if (result != VDO_SUCCESS) goto out; *cache_ptr = cache; return UDS_SUCCESS; out: uds_free_sparse_cache(cache); return result; } static inline void set_skip_search(struct cached_chapter_index *chapter, bool skip_search) { /* Check before setting to reduce cache line contention. */ if (READ_ONCE(chapter->skip_search) != skip_search) WRITE_ONCE(chapter->skip_search, skip_search); } static void score_search_hit(struct cached_chapter_index *chapter) { chapter->counters.consecutive_misses = 0; set_skip_search(chapter, false); } static void score_search_miss(struct sparse_cache *cache, struct cached_chapter_index *chapter) { chapter->counters.consecutive_misses++; if (chapter->counters.consecutive_misses > cache->skip_threshold) set_skip_search(chapter, true); } static void release_cached_chapter_index(struct cached_chapter_index *chapter) { unsigned int i; chapter->virtual_chapter = NO_CHAPTER; if (chapter->page_buffers == NULL) return; for (i = 0; i < chapter->index_pages_count; i++) { if (chapter->page_buffers[i] != NULL) dm_bufio_release(vdo_forget(chapter->page_buffers[i])); } } void uds_free_sparse_cache(struct sparse_cache *cache) { unsigned int i; if (cache == NULL) return; vdo_free(cache->scratch_entries); for (i = 0; i < cache->zone_count; i++) vdo_free(cache->search_lists[i]); for (i = 0; i < cache->capacity; i++) { release_cached_chapter_index(&cache->chapters[i]); vdo_free(cache->chapters[i].index_pages); vdo_free(cache->chapters[i].page_buffers); } vdo_free(cache); } /* * Take the indicated element of the search list and move it to the start, pushing the pointers * previously before it back down the list. */ static inline void set_newest_entry(struct search_list *search_list, u8 index) { struct cached_chapter_index *newest; if (index > 0) { newest = search_list->entries[index]; memmove(&search_list->entries[1], &search_list->entries[0], index * sizeof(struct cached_chapter_index *)); search_list->entries[0] = newest; } /* * This function may have moved a dead chapter to the front of the list for reuse, in which * case the set of dead chapters becomes smaller. */ if (search_list->first_dead_entry <= index) search_list->first_dead_entry++; } bool uds_sparse_cache_contains(struct sparse_cache *cache, u64 virtual_chapter, unsigned int zone_number) { struct search_list *search_list; struct cached_chapter_index *chapter; u8 i; /* * The correctness of the barriers depends on the invariant that between calls to * uds_update_sparse_cache(), the answers this function returns must never vary: the result * for a given chapter must be identical across zones. That invariant must be maintained * even if the chapter falls off the end of the volume, or if searching it is disabled * because of too many search misses. */ search_list = cache->search_lists[zone_number]; for (i = 0; i < search_list->first_dead_entry; i++) { chapter = search_list->entries[i]; if (virtual_chapter == chapter->virtual_chapter) { if (zone_number == ZONE_ZERO) score_search_hit(chapter); set_newest_entry(search_list, i); return true; } } return false; } /* * Re-sort cache entries into three sets (active, skippable, and dead) while maintaining the LRU * ordering that already existed. This operation must only be called during the critical section in * uds_update_sparse_cache(). */ static void purge_search_list(struct search_list *search_list, struct sparse_cache *cache, u64 oldest_virtual_chapter) { struct cached_chapter_index **entries; struct cached_chapter_index **skipped; struct cached_chapter_index **dead; struct cached_chapter_index *chapter; unsigned int next_alive = 0; unsigned int next_skipped = 0; unsigned int next_dead = 0; unsigned int i; entries = &search_list->entries[0]; skipped = &cache->scratch_entries[0]; dead = &cache->scratch_entries[search_list->capacity]; for (i = 0; i < search_list->first_dead_entry; i++) { chapter = search_list->entries[i]; if ((chapter->virtual_chapter < oldest_virtual_chapter) || (chapter->virtual_chapter == NO_CHAPTER)) dead[next_dead++] = chapter; else if (chapter->skip_search) skipped[next_skipped++] = chapter; else entries[next_alive++] = chapter; } memcpy(&entries[next_alive], skipped, next_skipped * sizeof(struct cached_chapter_index *)); memcpy(&entries[next_alive + next_skipped], dead, next_dead * sizeof(struct cached_chapter_index *)); search_list->first_dead_entry = next_alive + next_skipped; } static int __must_check cache_chapter_index(struct cached_chapter_index *chapter, u64 virtual_chapter, const struct volume *volume) { int result; release_cached_chapter_index(chapter); result = uds_read_chapter_index_from_volume(volume, virtual_chapter, chapter->page_buffers, chapter->index_pages); if (result != UDS_SUCCESS) return result; chapter->counters.consecutive_misses = 0; chapter->virtual_chapter = virtual_chapter; chapter->skip_search = false; return UDS_SUCCESS; } static inline void copy_search_list(const struct search_list *source, struct search_list *target) { *target = *source; memcpy(target->entries, source->entries, source->capacity * sizeof(struct cached_chapter_index *)); } /* * Update the sparse cache to contain a chapter index. This function must be called by all the zone * threads with the same chapter number to correctly enter the thread barriers used to synchronize * the cache updates. */ int uds_update_sparse_cache(struct index_zone *zone, u64 virtual_chapter) { int result = UDS_SUCCESS; const struct uds_index *index = zone->index; struct sparse_cache *cache = index->volume->sparse_cache; if (uds_sparse_cache_contains(cache, virtual_chapter, zone->id)) return UDS_SUCCESS; /* * Wait for every zone thread to reach its corresponding barrier request and invoke this * function before starting to modify the cache. */ enter_threads_barrier(&cache->begin_update_barrier); /* * This is the start of the critical section: the zone zero thread is captain, effectively * holding an exclusive lock on the sparse cache. All the other zone threads must do * nothing between the two barriers. They will wait at the end_update_barrier again for the * captain to finish the update. */ if (zone->id == ZONE_ZERO) { unsigned int z; struct search_list *list = cache->search_lists[ZONE_ZERO]; purge_search_list(list, cache, zone->oldest_virtual_chapter); if (virtual_chapter >= index->oldest_virtual_chapter) { set_newest_entry(list, list->capacity - 1); result = cache_chapter_index(list->entries[0], virtual_chapter, index->volume); } for (z = 1; z < cache->zone_count; z++) copy_search_list(list, cache->search_lists[z]); } /* * This is the end of the critical section. All cache invariants must have been restored. */ enter_threads_barrier(&cache->end_update_barrier); return result; } void uds_invalidate_sparse_cache(struct sparse_cache *cache) { unsigned int i; for (i = 0; i < cache->capacity; i++) release_cached_chapter_index(&cache->chapters[i]); } static inline bool should_skip_chapter(struct cached_chapter_index *chapter, u64 oldest_chapter, u64 requested_chapter) { if ((chapter->virtual_chapter == NO_CHAPTER) || (chapter->virtual_chapter < oldest_chapter)) return true; if (requested_chapter != NO_CHAPTER) return requested_chapter != chapter->virtual_chapter; else return READ_ONCE(chapter->skip_search); } static int __must_check search_cached_chapter_index(struct cached_chapter_index *chapter, const struct index_geometry *geometry, const struct index_page_map *index_page_map, const struct uds_record_name *name, u16 *record_page_ptr) { u32 physical_chapter = uds_map_to_physical_chapter(geometry, chapter->virtual_chapter); u32 index_page_number = uds_find_index_page_number(index_page_map, name, physical_chapter); struct delta_index_page *index_page = &chapter->index_pages[index_page_number]; return uds_search_chapter_index_page(index_page, geometry, name, record_page_ptr); } int uds_search_sparse_cache(struct index_zone *zone, const struct uds_record_name *name, u64 *virtual_chapter_ptr, u16 *record_page_ptr) { int result; struct volume *volume = zone->index->volume; struct sparse_cache *cache = volume->sparse_cache; struct cached_chapter_index *chapter; struct search_list *search_list; u8 i; /* Search the entire cache unless a specific chapter was requested. */ bool search_one = (*virtual_chapter_ptr != NO_CHAPTER); *record_page_ptr = NO_CHAPTER_INDEX_ENTRY; search_list = cache->search_lists[zone->id]; for (i = 0; i < search_list->first_dead_entry; i++) { chapter = search_list->entries[i]; if (should_skip_chapter(chapter, zone->oldest_virtual_chapter, *virtual_chapter_ptr)) continue; result = search_cached_chapter_index(chapter, cache->geometry, volume->index_page_map, name, record_page_ptr); if (result != UDS_SUCCESS) return result; if (*record_page_ptr != NO_CHAPTER_INDEX_ENTRY) { /* * In theory, this might be a false match while a true match exists in * another chapter, but that's a very rare case and not worth the extra * search complexity. */ set_newest_entry(search_list, i); if (zone->id == ZONE_ZERO) score_search_hit(chapter); *virtual_chapter_ptr = chapter->virtual_chapter; return UDS_SUCCESS; } if (zone->id == ZONE_ZERO) score_search_miss(cache, chapter); if (search_one) break; } return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/sparse-cache.h000066400000000000000000000034021476467262700171400ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_SPARSE_CACHE_H #define UDS_SPARSE_CACHE_H #include "geometry.h" #include "indexer.h" /* * The sparse cache is a cache of entire chapter indexes from sparse chapters used for searching * for names after all other search paths have failed. It contains only complete chapter indexes; * record pages from sparse chapters and single index pages used for resolving hooks are kept in * the regular page cache in the volume. * * The most important property of this cache is the absence of synchronization for read operations. * Safe concurrent access to the cache by the zone threads is controlled by the triage queue and * the barrier requests it issues to the zone queues. The set of cached chapters does not and must * not change between the carefully coordinated calls to uds_update_sparse_cache() from the zone * threads. Outside of updates, every zone will get the same result when calling * uds_sparse_cache_contains() as every other zone. */ struct index_zone; struct sparse_cache; int __must_check uds_make_sparse_cache(const struct index_geometry *geometry, unsigned int capacity, unsigned int zone_count, struct sparse_cache **cache_ptr); void uds_free_sparse_cache(struct sparse_cache *cache); bool uds_sparse_cache_contains(struct sparse_cache *cache, u64 virtual_chapter, unsigned int zone_number); int __must_check uds_update_sparse_cache(struct index_zone *zone, u64 virtual_chapter); void uds_invalidate_sparse_cache(struct sparse_cache *cache); int __must_check uds_search_sparse_cache(struct index_zone *zone, const struct uds_record_name *name, u64 *virtual_chapter_ptr, u16 *record_page_ptr); #endif /* UDS_SPARSE_CACHE_H */ vdo-8.3.1.1/utils/uds/string-utils.c000066400000000000000000000017341476467262700172470ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "string-utils.h" #include "errors.h" #include "logger.h" #include "memory-alloc.h" int vdo_alloc_sprintf(const char *what, char **strp, const char *fmt, ...) { va_list args; int result; int count; if (strp == NULL) return UDS_INVALID_ARGUMENT; va_start(args, fmt); count = vsnprintf(NULL, 0, fmt, args) + 1; va_end(args); result = vdo_allocate(count, char, what, strp); if (result == VDO_SUCCESS) { va_start(args, fmt); vsnprintf(*strp, count, fmt, args); va_end(args); } if ((result != VDO_SUCCESS) && (what != NULL)) vdo_log_error("cannot allocate %s", what); return result; } char *vdo_append_to_buffer(char *buffer, char *buf_end, const char *fmt, ...) { va_list args; size_t n; va_start(args, fmt); n = vsnprintf(buffer, buf_end - buffer, fmt, args); if (n >= (size_t) (buf_end - buffer)) buffer = buf_end; else buffer += n; va_end(args); return buffer; } vdo-8.3.1.1/utils/uds/string-utils.h000066400000000000000000000014401476467262700172460ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_STRING_UTILS_H #define VDO_STRING_UTILS_H #include #include #include #include #include /* Utilities related to string manipulation */ static inline const char *vdo_bool_to_string(bool value) { return value ? "true" : "false"; } /* * Allocate memory to contain a formatted string. The caller is responsible for * freeing the allocated memory. */ int __must_check vdo_alloc_sprintf(const char *what, char **strp, const char *fmt, ...) __printf(3, 4); /* Append a formatted string to the end of a buffer. */ char *vdo_append_to_buffer(char *buffer, char *buf_end, const char *fmt, ...) __printf(3, 4); #endif /* VDO_STRING_UTILS_H */ vdo-8.3.1.1/utils/uds/syscalls.c000066400000000000000000000072241476467262700164400ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "syscalls.h" #include #include #include "permassert.h" /**********************************************************************/ int logging_read(int fd, void *buf, size_t count, const char *context, ssize_t *bytes_read_ptr) { int result; do { result = check_io_errors(read(fd, buf, count), __func__, context, bytes_read_ptr); } while (result == EINTR); return result; } /**********************************************************************/ static int logging_pread_interruptible(int fd, void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_read_ptr) { return check_io_errors(pread(fd, buf, count, offset), __func__, context, bytes_read_ptr); } /**********************************************************************/ int logging_pread(int fd, void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_read_ptr) { int result; do { result = logging_pread_interruptible(fd, buf, count, offset, context, bytes_read_ptr); } while (result == EINTR); return result; } /**********************************************************************/ int logging_write(int fd, const void *buf, size_t count, const char *context, ssize_t *bytes_written_ptr) { int result; do { result = check_io_errors(write(fd, buf, count), __func__, context, bytes_written_ptr); } while (result == EINTR); return result; } /**********************************************************************/ static int logging_pwrite_interruptible(int fd, const void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_written_ptr) { return check_io_errors(pwrite(fd, buf, count, offset), __func__, context, bytes_written_ptr); } /**********************************************************************/ int logging_pwrite(int fd, const void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_written_ptr) { int result; do { result = logging_pwrite_interruptible(fd, buf, count, offset, context, bytes_written_ptr); } while (result == EINTR); return result; } /**********************************************************************/ int logging_close(int fd, const char *context) { return check_system_call(close(fd), __func__, context); } /**********************************************************************/ int process_control(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5) { int result = prctl(option, arg2, arg3, arg4, arg5); VDO_ASSERT_LOG_ONLY(result >= 0, "option: %d, arg2: %lu, arg3: %lu, arg4: %lu, arg5: %lu", option, arg2, arg3, arg4, arg5); return errno; } vdo-8.3.1.1/utils/uds/syscalls.h000066400000000000000000000105661476467262700164500ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef SYSCALLS_H #define SYSCALLS_H #include #include #include "errors.h" #include "logger.h" /** * Wrap the read(2) system call, looping as long as errno is EINTR. * * @param fd The descriptor from which to read * @param buf The buffer to read into * @param count The maximum number of bytes to read * @param context The calling context (for logging) * @param bytes_read_ptr A pointer to hold the number of bytes read * * @return UDS_SUCCESS or an error code **/ int __must_check logging_read(int fd, void *buf, size_t count, const char *context, ssize_t *bytes_read_ptr); /** * Wrap the pread(2) system call, looping as long as errno is EINTR. * * @param fd The descriptor from which to read * @param buf The buffer to read into * @param count The maximum number of bytes to read * @param offset The offset into the file at which to read * @param context The calling context (for logging) * @param bytes_read_ptr A pointer to hold the number of bytes read * * @return UDS_SUCCESS or an error code **/ int __must_check logging_pread(int fd, void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_read_ptr); /** * Wrap the write(2) system call, looping as long as errno is EINTR. * * @param fd The descriptor from which to write * @param buf The buffer to write from * @param count The maximum number of bytes to write * @param context The calling context (for logging) * @param bytes_written_ptr A pointer to hold the number of bytes written; * on error, -1 is returned * * @return UDS_SUCCESS or an error code **/ int __must_check logging_write(int fd, const void *buf, size_t count, const char *context, ssize_t *bytes_written_ptr); /** * Wrap the pwrite(2) system call, looping as long as errno is EINTR. * * @param fd The descriptor from which to write * @param buf The buffer to write into * @param count The maximum number of bytes to write * @param offset The offset into the file at which to write * @param context The calling context (for logging) * @param bytes_written_ptr A pointer to hold the number of bytes written; * on error, -1 is returned * * @return UDS_SUCCESS or an error code **/ int __must_check logging_pwrite(int fd, const void *buf, size_t count, off_t offset, const char *context, ssize_t *bytes_written_ptr); /** * Wrap the close(2) system call. * * @param fd The descriptor to close * @param context The calling context (for logging) * * @return UDS_SUCCESS or an error code **/ int __must_check logging_close(int fd, const char *context); /** * Perform operations on a process. * This wraps the prctl(2) function, q.v. * * @param option The operation to perform. * @param arg2 Specific to option * @param arg3 Specific to option * @param arg4 Specific to option * @param arg5 Specific to option * * @return UDS_SUCCESS or an error code **/ int process_control(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); /**********************************************************************/ static inline int log_system_call_errno(const char *function, const char *context) { return vdo_log_strerror(((errno == EINTR) ? VDO_LOG_DEBUG : VDO_LOG_ERR), errno, "%s failed in %s", function, context); } /**********************************************************************/ static inline int check_system_call(int result, const char *function, const char *context) { return (result == 0) ? UDS_SUCCESS : log_system_call_errno(function, context); } /**********************************************************************/ static inline int check_io_errors(ssize_t bytes, const char *function, const char *context, ssize_t *bytes_ptr) { if (bytes_ptr != NULL) *bytes_ptr = bytes; if (bytes < 0) return log_system_call_errno(function, context); return UDS_SUCCESS; } #endif /* SYSCALLS_H */ vdo-8.3.1.1/utils/uds/thread-utils.c000066400000000000000000000104211476467262700172010ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "thread-utils.h" #include #include #include #include #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "syscalls.h" enum { ONCE_NOT_DONE = 0, ONCE_IN_PROGRESS = 1, ONCE_COMPLETE = 2, }; /**********************************************************************/ unsigned int num_online_cpus(void) { cpu_set_t cpu_set; unsigned int n_cpus = 0; unsigned int i; if (sched_getaffinity(0, sizeof(cpu_set), &cpu_set) != 0) { vdo_log_warning_strerror(errno, "sched_getaffinity() failed, using 1 as number of cores."); return 1; } for (i = 0; i < CPU_SETSIZE; i++) n_cpus += CPU_ISSET(i, &cpu_set); return n_cpus; } /**********************************************************************/ void uds_get_thread_name(char *name) { process_control(PR_GET_NAME, (unsigned long) name, 0, 0, 0); } /**********************************************************************/ pid_t uds_get_thread_id(void) { return syscall(SYS_gettid); } /* Run a function once only, and record that fact in the atomic value. */ void vdo_perform_once(atomic_t *once, void (*function)(void)) { for (;;) { switch (atomic_cmpxchg(once, ONCE_NOT_DONE, ONCE_IN_PROGRESS)) { case ONCE_NOT_DONE: function(); atomic_set_release(once, ONCE_COMPLETE); return; case ONCE_IN_PROGRESS: sched_yield(); break; case ONCE_COMPLETE: return; default: return; } } } struct thread_start_info { void (*thread_function)(void *); void *thread_data; const char *name; }; /**********************************************************************/ static void *thread_starter(void *arg) { struct thread_start_info *info = arg; void (*thread_function)(void *) = info->thread_function; void *thread_data = info->thread_data; /* * The name is just advisory for humans examining it, so we don't * care much if this fails. */ process_control(PR_SET_NAME, (unsigned long) info->name, 0, 0, 0); vdo_free(info); thread_function(thread_data); return NULL; } /**********************************************************************/ int vdo_create_thread(void (*thread_function)(void *), void *thread_data, const char *name, struct thread **new_thread) { int result; struct thread_start_info *info; struct thread *thread; result = vdo_allocate(1, struct thread_start_info, __func__, &info); if (result != VDO_SUCCESS) return result; info->thread_function = thread_function; info->thread_data = thread_data; info->name = name; result = vdo_allocate(1, struct thread, __func__, &thread); if (result != VDO_SUCCESS) { vdo_log_warning("Error allocating memory for %s", name); vdo_free(info); return result; } result = pthread_create(&thread->thread, NULL, thread_starter, info); if (result != 0) { result = -errno; vdo_log_error_strerror(result, "could not create %s thread", name); vdo_free(thread); vdo_free(info); return result; } *new_thread = thread; return VDO_SUCCESS; } /**********************************************************************/ void vdo_join_threads(struct thread *thread) { int result; pthread_t pthread; result = pthread_join(thread->thread, NULL); pthread = thread->thread; vdo_free(thread); VDO_ASSERT_LOG_ONLY((result == 0), "thread: %p", (void *) pthread); } /**********************************************************************/ void initialize_threads_barrier(struct threads_barrier *barrier, unsigned int thread_count) { int result; result = pthread_barrier_init(&barrier->barrier, NULL, thread_count); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_barrier_init error"); } /**********************************************************************/ void destroy_threads_barrier(struct threads_barrier *barrier) { int result; result = pthread_barrier_destroy(&barrier->barrier); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_barrier_destroy error"); } /**********************************************************************/ void enter_threads_barrier(struct threads_barrier *barrier) { int result; result = pthread_barrier_wait(&barrier->barrier); if (result == PTHREAD_BARRIER_SERIAL_THREAD) return; VDO_ASSERT_LOG_ONLY((result == 0), "pthread_barrier_wait error"); } vdo-8.3.1.1/utils/uds/thread-utils.h000066400000000000000000000041101476467262700172040ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef THREAD_UTILS_H #define THREAD_UTILS_H #include #include #include #include #include #include "errors.h" #include "time-utils.h" /* Thread and synchronization utilities */ struct mutex { pthread_mutex_t mutex; }; struct semaphore { sem_t semaphore; }; struct thread { pthread_t thread; }; struct threads_barrier { pthread_barrier_t barrier; }; #ifndef NDEBUG #define UDS_MUTEX_INITIALIZER { .mutex = PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP } #else #define UDS_MUTEX_INITIALIZER { .mutex = PTHREAD_MUTEX_INITIALIZER } #endif extern const bool UDS_DO_ASSERTIONS; unsigned int num_online_cpus(void); pid_t __must_check uds_get_thread_id(void); void vdo_perform_once(atomic_t *once_state, void (*function) (void)); int __must_check vdo_create_thread(void (*thread_function)(void *), void *thread_data, const char *name, struct thread **new_thread); void vdo_join_threads(struct thread *thread); void uds_get_thread_name(char *name); static inline void cond_resched(void) { /* * On Linux sched_yield always succeeds so the result can be * safely ignored. */ (void) sched_yield(); } int uds_initialize_mutex(struct mutex *mutex, bool assert_on_error); int __must_check uds_init_mutex(struct mutex *mutex); int uds_destroy_mutex(struct mutex *mutex); void uds_lock_mutex(struct mutex *mutex); void uds_unlock_mutex(struct mutex *mutex); void initialize_threads_barrier(struct threads_barrier *barrier, unsigned int thread_count); void destroy_threads_barrier(struct threads_barrier *barrier); void enter_threads_barrier(struct threads_barrier *barrier); int __must_check uds_initialize_semaphore(struct semaphore *semaphore, unsigned int value); int uds_destroy_semaphore(struct semaphore *semaphore); void uds_acquire_semaphore(struct semaphore *semaphore); bool __must_check uds_attempt_semaphore(struct semaphore *semaphore, ktime_t timeout); void uds_release_semaphore(struct semaphore *semaphore); #endif /* UDS_THREADS_H */ vdo-8.3.1.1/utils/uds/threadCondVar.c000066400000000000000000000040571476467262700173300ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "indexer.h" #include "permassert.h" /**********************************************************************/ void uds_init_cond(struct cond_var *cond) { int result; result = pthread_cond_init(&cond->condition, NULL); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_cond_init error"); } /**********************************************************************/ void uds_signal_cond(struct cond_var *cond) { int result; result = pthread_cond_signal(&cond->condition); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_cond_signal error"); } /**********************************************************************/ void uds_broadcast_cond(struct cond_var *cond) { int result; result = pthread_cond_broadcast(&cond->condition); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_cond_broadcast error"); } /**********************************************************************/ void uds_wait_cond(struct cond_var *cond, struct mutex *mutex) { int result; result = pthread_cond_wait(&cond->condition, &mutex->mutex); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_cond_wait error"); } /**********************************************************************/ void uds_destroy_cond(struct cond_var *cond) { int result; result = pthread_cond_destroy(&cond->condition); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_cond_destroy error"); } vdo-8.3.1.1/utils/uds/threadMutex.c000066400000000000000000000063671476467262700171040ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include #include "permassert.h" #include "string-utils.h" #include "thread-utils.h" static enum mutex_kind { fast_adaptive, error_checking } hidden_mutex_kind = error_checking; const bool UDS_DO_ASSERTIONS = true; /**********************************************************************/ static void initialize_mutex_kind(void) { static const char UDS_MUTEX_KIND_ENV[] = "UDS_MUTEX_KIND"; const char *mutex_kind_string = getenv(UDS_MUTEX_KIND_ENV); #ifdef NDEBUG /* * Enabling error checking on mutexes enables a great performance loss, * so we only enable it in certain circumstances. */ hidden_mutex_kind = fast_adaptive; #endif if (mutex_kind_string != NULL) { if (strcmp(mutex_kind_string, "error-checking") == 0) hidden_mutex_kind = error_checking; else if (strcmp(mutex_kind_string, "fast-adaptive") == 0) hidden_mutex_kind = fast_adaptive; else VDO_ASSERT_LOG_ONLY(false, "environment variable %s had unexpected value '%s'", UDS_MUTEX_KIND_ENV, mutex_kind_string); } } /**********************************************************************/ static enum mutex_kind get_mutex_kind(void) { static atomic_t once_state = ATOMIC_INIT(0); vdo_perform_once(&once_state, initialize_mutex_kind); return hidden_mutex_kind; } /* * This function should only be called directly in places where making * assertions is not safe. */ int uds_initialize_mutex(struct mutex *mutex, bool assert_on_error) { pthread_mutexattr_t attr; int result; int result2; result = pthread_mutexattr_init(&attr); if (result != 0) { VDO_ASSERT_LOG_ONLY((result == 0), "pthread_mutexattr_init error"); return result; } if (get_mutex_kind() == error_checking) pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK); result = pthread_mutex_init(&mutex->mutex, &attr); if ((result != 0) && assert_on_error) VDO_ASSERT_LOG_ONLY((result == 0), "pthread_mutex_init error"); result2 = pthread_mutexattr_destroy(&attr); if (result2 != 0) { VDO_ASSERT_LOG_ONLY((result2 == 0), "pthread_mutexattr_destroy error"); if (result == UDS_SUCCESS) result = result2; } return result; } /**********************************************************************/ int uds_init_mutex(struct mutex *mutex) { return uds_initialize_mutex(mutex, UDS_DO_ASSERTIONS); } /**********************************************************************/ int uds_destroy_mutex(struct mutex *mutex) { int result; result = pthread_mutex_destroy(&mutex->mutex); VDO_ASSERT_LOG_ONLY((result == 0), "pthread_mutex_destroy error"); return result; } /**********************************************************************/ void uds_lock_mutex(struct mutex *mutex) { int result __attribute__((unused)); result = pthread_mutex_lock(&mutex->mutex); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((result == 0), "pthread_mutex_lock error %d", result); #endif } /**********************************************************************/ void uds_unlock_mutex(struct mutex *mutex) { int result __attribute__((unused)); result = pthread_mutex_unlock(&mutex->mutex); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((result == 0), "pthread_mutex_unlock error %d", result); #endif } vdo-8.3.1.1/utils/uds/threadSemaphore.c000066400000000000000000000040271476467262700177140ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include #include "logger.h" #include "permassert.h" #include "thread-utils.h" #include "time-utils.h" /**********************************************************************/ int uds_initialize_semaphore(struct semaphore *semaphore, unsigned int value) { int result; result = sem_init(&semaphore->semaphore, false, value); VDO_ASSERT_LOG_ONLY((result == 0), "sem_init error"); return result; } /**********************************************************************/ int uds_destroy_semaphore(struct semaphore *semaphore) { int result; result = sem_destroy(&semaphore->semaphore); VDO_ASSERT_LOG_ONLY((result == 0), "sem_destroy error"); return result; } /**********************************************************************/ void uds_acquire_semaphore(struct semaphore *semaphore) { int result; do { result = sem_wait(&semaphore->semaphore); } while ((result == -1) && (errno == EINTR)); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((result == 0), "sem_wait error %d", errno); #endif } /**********************************************************************/ bool uds_attempt_semaphore(struct semaphore *semaphore, ktime_t timeout) { if (timeout > 0) { struct timespec ts = future_time(timeout); do { if (sem_timedwait(&semaphore->semaphore, &ts) == 0) return true; } while (errno == EINTR); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((errno == ETIMEDOUT), "sem_timedwait error %d", errno); #endif } else { do { if (sem_trywait(&semaphore->semaphore) == 0) return true; } while (errno == EINTR); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((errno == EAGAIN), "sem_trywait error %d", errno); #endif } return false; } /**********************************************************************/ void uds_release_semaphore(struct semaphore *semaphore) { int result __attribute__((unused)); result = sem_post(&semaphore->semaphore); #ifndef NDEBUG VDO_ASSERT_LOG_ONLY((result == 0), "sem_post error %d", errno); #endif } vdo-8.3.1.1/utils/uds/time-utils.c000066400000000000000000000011171476467262700166720ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "time-utils.h" ktime_t current_time_ns(clockid_t clock) { struct timespec ts; if (clock_gettime(clock, &ts) != 0) ts = (struct timespec) { 0, 0 }; return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec; } struct timespec future_time(ktime_t offset) { ktime_t future = current_time_ns(CLOCK_REALTIME) + offset; return (struct timespec) { .tv_sec = future / NSEC_PER_SEC, .tv_nsec = future % NSEC_PER_SEC, }; } int64_t current_time_us(void) { return current_time_ns(CLOCK_REALTIME) / NSEC_PER_USEC; } vdo-8.3.1.1/utils/uds/time-utils.h000066400000000000000000000022301476467262700166740ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_TIME_UTILS_H #define UDS_TIME_UTILS_H #include #include #include #include /* Some constants that are defined in kernel headers. */ #define NSEC_PER_SEC 1000000000L #define NSEC_PER_MSEC 1000000L #define NSEC_PER_USEC 1000L typedef s64 ktime_t; static inline s64 ktime_to_seconds(ktime_t reltime) { return reltime / NSEC_PER_SEC; } ktime_t __must_check current_time_ns(clockid_t clock); ktime_t __must_check current_time_us(void); /* Return a timespec for the current time plus an offset. */ struct timespec future_time(ktime_t offset); static inline ktime_t ktime_sub(ktime_t a, ktime_t b) { return a - b; } static inline s64 ktime_to_ms(ktime_t abstime) { return abstime / NSEC_PER_MSEC; } static inline ktime_t ms_to_ktime(u64 milliseconds) { return (ktime_t) milliseconds * NSEC_PER_MSEC; } static inline s64 ktime_to_us(ktime_t reltime) { return reltime / NSEC_PER_USEC; } static inline ktime_t us_to_ktime(u64 microseconds) { return (ktime_t) microseconds * NSEC_PER_USEC; } #endif /* UDS_TIME_UTILS_H */ vdo-8.3.1.1/utils/uds/volume-index.c000066400000000000000000001254671476467262700172310ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "volume-index.h" #include #include #include #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "numeric.h" #include "permassert.h" #include "thread-utils.h" #include "config.h" #include "geometry.h" #include "hash-utils.h" #include "indexer.h" /* * The volume index is a combination of two separate subindexes, one containing sparse hook entries * (retained for all chapters), and one containing the remaining entries (retained only for the * dense chapters). If there are no sparse chapters, only the non-hook sub index is used, and it * will contain all records for all chapters. * * The volume index is also divided into zones, with one thread operating on each zone. Each * incoming request is dispatched to the appropriate thread, and then to the appropriate subindex. * Each delta list is handled by a single zone. To ensure that the distribution of delta lists to * zones doesn't underflow (leaving some zone with no delta lists), the minimum number of delta * lists must be the square of the maximum zone count for both subindexes. * * Each subindex zone is a delta index where the payload is a chapter number. The volume index can * compute the delta list number, address, and zone number from the record name in order to * dispatch record handling to the correct structures. * * Most operations that use all the zones take place either before request processing is allowed, * or after all requests have been flushed in order to shut down. The only multi-threaded operation * supported during normal operation is the uds_lookup_volume_index_name() method, used to determine * whether a new chapter should be loaded into the sparse index cache. This operation only uses the * sparse hook subindex, and the zone mutexes are used to make this operation safe. * * There are three ways of expressing chapter numbers in the volume index: virtual, index, and * rolling. The interface to the volume index uses virtual chapter numbers, which are 64 bits long. * Internally the subindex stores only the minimal number of bits necessary by masking away the * high-order bits. When the index needs to deal with ordering of index chapter numbers, as when * flushing entries from older chapters, it rolls the index chapter number around so that the * smallest one in use is mapped to 0. See convert_index_to_virtual() or flush_invalid_entries() * for an example of this technique. * * For efficiency, when older chapter numbers become invalid, the index does not immediately remove * the invalidated entries. Instead it lazily removes them from a given delta list the next time it * walks that list during normal operation. Because of this, the index size must be increased * somewhat to accommodate all the invalid entries that have not yet been removed. For the standard * index sizes, this requires about 4 chapters of old entries per 1024 chapters of valid entries in * the index. */ struct sub_index_parameters { /* The number of bits in address mask */ u8 address_bits; /* The number of bits in chapter number */ u8 chapter_bits; /* The mean delta */ u32 mean_delta; /* The number of delta lists */ u64 list_count; /* The number of chapters used */ u32 chapter_count; /* The number of bits per chapter */ size_t chapter_size_in_bits; /* The number of bytes of delta list memory */ size_t memory_size; /* The number of bytes the index should keep free at all times */ size_t target_free_bytes; }; struct split_config { /* The hook subindex configuration */ struct uds_configuration hook_config; struct index_geometry hook_geometry; /* The non-hook subindex configuration */ struct uds_configuration non_hook_config; struct index_geometry non_hook_geometry; }; struct chapter_range { u32 chapter_start; u32 chapter_count; }; #define MAGIC_SIZE 8 static const char MAGIC_START_5[] = "MI5-0005"; struct sub_index_data { char magic[MAGIC_SIZE]; /* MAGIC_START_5 */ u64 volume_nonce; u64 virtual_chapter_low; u64 virtual_chapter_high; u32 first_list; u32 list_count; }; static const char MAGIC_START_6[] = "MI6-0001"; struct volume_index_data { char magic[MAGIC_SIZE]; /* MAGIC_START_6 */ u32 sparse_sample_rate; }; static inline u32 extract_address(const struct volume_sub_index *sub_index, const struct uds_record_name *name) { return uds_extract_volume_index_bytes(name) & sub_index->address_mask; } static inline u32 extract_dlist_num(const struct volume_sub_index *sub_index, const struct uds_record_name *name) { u64 bits = uds_extract_volume_index_bytes(name); return (bits >> sub_index->address_bits) % sub_index->list_count; } static inline const struct volume_sub_index_zone * get_zone_for_record(const struct volume_index_record *record) { return &record->sub_index->zones[record->zone_number]; } static inline u64 convert_index_to_virtual(const struct volume_index_record *record, u32 index_chapter) { const struct volume_sub_index_zone *volume_index_zone = get_zone_for_record(record); u32 rolling_chapter = ((index_chapter - volume_index_zone->virtual_chapter_low) & record->sub_index->chapter_mask); return volume_index_zone->virtual_chapter_low + rolling_chapter; } static inline u32 convert_virtual_to_index(const struct volume_sub_index *sub_index, u64 virtual_chapter) { return virtual_chapter & sub_index->chapter_mask; } static inline bool is_virtual_chapter_indexed(const struct volume_index_record *record, u64 virtual_chapter) { const struct volume_sub_index_zone *volume_index_zone = get_zone_for_record(record); return ((virtual_chapter >= volume_index_zone->virtual_chapter_low) && (virtual_chapter <= volume_index_zone->virtual_chapter_high)); } static inline bool has_sparse(const struct volume_index *volume_index) { return volume_index->sparse_sample_rate > 0; } bool uds_is_volume_index_sample(const struct volume_index *volume_index, const struct uds_record_name *name) { if (!has_sparse(volume_index)) return false; return (uds_extract_sampling_bytes(name) % volume_index->sparse_sample_rate) == 0; } static inline const struct volume_sub_index * get_volume_sub_index(const struct volume_index *volume_index, const struct uds_record_name *name) { return (uds_is_volume_index_sample(volume_index, name) ? &volume_index->vi_hook : &volume_index->vi_non_hook); } static unsigned int get_volume_sub_index_zone(const struct volume_sub_index *sub_index, const struct uds_record_name *name) { return extract_dlist_num(sub_index, name) / sub_index->delta_index.lists_per_zone; } unsigned int uds_get_volume_index_zone(const struct volume_index *volume_index, const struct uds_record_name *name) { return get_volume_sub_index_zone(get_volume_sub_index(volume_index, name), name); } #define DELTA_LIST_SIZE 256 static int compute_volume_sub_index_parameters(const struct uds_configuration *config, struct sub_index_parameters *params) { u64 entries_in_volume_index, address_span; u32 chapters_in_volume_index, invalid_chapters; u32 rounded_chapters; u64 delta_list_records; u32 address_count; u64 index_size_in_bits; size_t expected_index_size; u64 min_delta_lists = MAX_ZONES * MAX_ZONES; struct index_geometry *geometry = config->geometry; u64 records_per_chapter = geometry->records_per_chapter; params->chapter_count = geometry->chapters_per_volume; /* * Make sure that the number of delta list records in the volume index does not change when * the volume is reduced by one chapter. This preserves the mapping from name to volume * index delta list. */ rounded_chapters = params->chapter_count; if (uds_is_reduced_index_geometry(geometry)) rounded_chapters += 1; delta_list_records = records_per_chapter * rounded_chapters; address_count = config->volume_index_mean_delta * DELTA_LIST_SIZE; params->list_count = max(delta_list_records / DELTA_LIST_SIZE, min_delta_lists); params->address_bits = bits_per(address_count - 1); params->chapter_bits = bits_per(rounded_chapters - 1); if ((u32) params->list_count != params->list_count) { return vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "cannot initialize volume index with %llu delta lists", (unsigned long long) params->list_count); } if (params->address_bits > 31) { return vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "cannot initialize volume index with %u address bits", params->address_bits); } /* * The probability that a given delta list is not touched during the writing of an entire * chapter is: * * double p_not_touched = pow((double) (params->list_count - 1) / params->list_count, * records_per_chapter); * * For the standard index sizes, about 78% of the delta lists are not touched, and * therefore contain old index entries that have not been eliminated by the lazy LRU * processing. Then the number of old index entries that accumulate over the entire index, * in terms of full chapters worth of entries, is: * * double invalid_chapters = p_not_touched / (1.0 - p_not_touched); * * For the standard index sizes, the index needs about 3.5 chapters of space for the old * entries in a 1024 chapter index, so round this up to use 4 chapters per 1024 chapters in * the index. */ invalid_chapters = max(rounded_chapters / 256, 2U); chapters_in_volume_index = rounded_chapters + invalid_chapters; entries_in_volume_index = records_per_chapter * chapters_in_volume_index; address_span = params->list_count << params->address_bits; params->mean_delta = address_span / entries_in_volume_index; /* * Compute the expected size of a full index, then set the total memory to be 6% larger * than that expected size. This number should be large enough that there are not many * rebalances when the index is full. */ params->chapter_size_in_bits = uds_compute_delta_index_size(records_per_chapter, params->mean_delta, params->chapter_bits); index_size_in_bits = params->chapter_size_in_bits * chapters_in_volume_index; expected_index_size = index_size_in_bits / BITS_PER_BYTE; params->memory_size = expected_index_size * 106 / 100; params->target_free_bytes = expected_index_size / 20; return UDS_SUCCESS; } static void uninitialize_volume_sub_index(struct volume_sub_index *sub_index) { vdo_free(vdo_forget(sub_index->flush_chapters)); vdo_free(vdo_forget(sub_index->zones)); uds_uninitialize_delta_index(&sub_index->delta_index); } void uds_free_volume_index(struct volume_index *volume_index) { if (volume_index == NULL) return; if (volume_index->zones != NULL) { unsigned int zone; for (zone = 0; zone < volume_index->zone_count; zone++) mutex_destroy(&volume_index->zones[zone].hook_mutex); vdo_free(vdo_forget(volume_index->zones)); } uninitialize_volume_sub_index(&volume_index->vi_non_hook); uninitialize_volume_sub_index(&volume_index->vi_hook); vdo_free(volume_index); } static int compute_volume_sub_index_save_bytes(const struct uds_configuration *config, size_t *bytes) { struct sub_index_parameters params = { .address_bits = 0 }; int result; result = compute_volume_sub_index_parameters(config, ¶ms); if (result != UDS_SUCCESS) return result; *bytes = (sizeof(struct sub_index_data) + params.list_count * sizeof(u64) + uds_compute_delta_index_save_bytes(params.list_count, params.memory_size)); return UDS_SUCCESS; } /* This function is only useful if the configuration includes sparse chapters. */ static void split_configuration(const struct uds_configuration *config, struct split_config *split) { u64 sample_rate, sample_records; u64 dense_chapters, sparse_chapters; /* Start with copies of the base configuration. */ split->hook_config = *config; split->hook_geometry = *config->geometry; split->hook_config.geometry = &split->hook_geometry; split->non_hook_config = *config; split->non_hook_geometry = *config->geometry; split->non_hook_config.geometry = &split->non_hook_geometry; sample_rate = config->sparse_sample_rate; sparse_chapters = config->geometry->sparse_chapters_per_volume; dense_chapters = config->geometry->chapters_per_volume - sparse_chapters; sample_records = config->geometry->records_per_chapter / sample_rate; /* Adjust the number of records indexed for each chapter. */ split->hook_geometry.records_per_chapter = sample_records; split->non_hook_geometry.records_per_chapter -= sample_records; /* Adjust the number of chapters indexed. */ split->hook_geometry.sparse_chapters_per_volume = 0; split->non_hook_geometry.sparse_chapters_per_volume = 0; split->non_hook_geometry.chapters_per_volume = dense_chapters; } static int compute_volume_index_save_bytes(const struct uds_configuration *config, size_t *bytes) { size_t hook_bytes, non_hook_bytes; struct split_config split; int result; if (!uds_is_sparse_index_geometry(config->geometry)) return compute_volume_sub_index_save_bytes(config, bytes); split_configuration(config, &split); result = compute_volume_sub_index_save_bytes(&split.hook_config, &hook_bytes); if (result != UDS_SUCCESS) return result; result = compute_volume_sub_index_save_bytes(&split.non_hook_config, &non_hook_bytes); if (result != UDS_SUCCESS) return result; *bytes = sizeof(struct volume_index_data) + hook_bytes + non_hook_bytes; return UDS_SUCCESS; } int uds_compute_volume_index_save_blocks(const struct uds_configuration *config, size_t block_size, u64 *block_count) { size_t bytes; int result; result = compute_volume_index_save_bytes(config, &bytes); if (result != UDS_SUCCESS) return result; bytes += sizeof(struct delta_list_save_info); *block_count = DIV_ROUND_UP(bytes, block_size) + MAX_ZONES; return UDS_SUCCESS; } /* Flush invalid entries while walking the delta list. */ static inline int flush_invalid_entries(struct volume_index_record *record, struct chapter_range *flush_range, u32 *next_chapter_to_invalidate) { int result; result = uds_next_delta_index_entry(&record->delta_entry); if (result != UDS_SUCCESS) return result; while (!record->delta_entry.at_end) { u32 index_chapter = uds_get_delta_entry_value(&record->delta_entry); u32 relative_chapter = ((index_chapter - flush_range->chapter_start) & record->sub_index->chapter_mask); if (likely(relative_chapter >= flush_range->chapter_count)) { if (relative_chapter < *next_chapter_to_invalidate) *next_chapter_to_invalidate = relative_chapter; break; } result = uds_remove_delta_index_entry(&record->delta_entry); if (result != UDS_SUCCESS) return result; } return UDS_SUCCESS; } /* Find the matching record, or the list offset where the record would go. */ static int get_volume_index_entry(struct volume_index_record *record, u32 list_number, u32 key, struct chapter_range *flush_range) { struct volume_index_record other_record; const struct volume_sub_index *sub_index = record->sub_index; u32 next_chapter_to_invalidate = sub_index->chapter_mask; int result; result = uds_start_delta_index_search(&sub_index->delta_index, list_number, 0, &record->delta_entry); if (result != UDS_SUCCESS) return result; do { result = flush_invalid_entries(record, flush_range, &next_chapter_to_invalidate); if (result != UDS_SUCCESS) return result; } while (!record->delta_entry.at_end && (key > record->delta_entry.key)); result = uds_remember_delta_index_offset(&record->delta_entry); if (result != UDS_SUCCESS) return result; /* Check any collision records for a more precise match. */ other_record = *record; if (!other_record.delta_entry.at_end && (key == other_record.delta_entry.key)) { for (;;) { u8 collision_name[UDS_RECORD_NAME_SIZE]; result = flush_invalid_entries(&other_record, flush_range, &next_chapter_to_invalidate); if (result != UDS_SUCCESS) return result; if (other_record.delta_entry.at_end || !other_record.delta_entry.is_collision) break; result = uds_get_delta_entry_collision(&other_record.delta_entry, collision_name); if (result != UDS_SUCCESS) return result; if (memcmp(collision_name, record->name, UDS_RECORD_NAME_SIZE) == 0) { *record = other_record; break; } } } while (!other_record.delta_entry.at_end) { result = flush_invalid_entries(&other_record, flush_range, &next_chapter_to_invalidate); if (result != UDS_SUCCESS) return result; } next_chapter_to_invalidate += flush_range->chapter_start; next_chapter_to_invalidate &= sub_index->chapter_mask; flush_range->chapter_start = next_chapter_to_invalidate; flush_range->chapter_count = 0; return UDS_SUCCESS; } static int get_volume_sub_index_record(struct volume_sub_index *sub_index, const struct uds_record_name *name, struct volume_index_record *record) { int result; const struct volume_sub_index_zone *volume_index_zone; u32 address = extract_address(sub_index, name); u32 delta_list_number = extract_dlist_num(sub_index, name); u64 flush_chapter = sub_index->flush_chapters[delta_list_number]; record->sub_index = sub_index; record->mutex = NULL; record->name = name; record->zone_number = delta_list_number / sub_index->delta_index.lists_per_zone; volume_index_zone = get_zone_for_record(record); if (flush_chapter < volume_index_zone->virtual_chapter_low) { struct chapter_range range; u64 flush_count = volume_index_zone->virtual_chapter_low - flush_chapter; range.chapter_start = convert_virtual_to_index(sub_index, flush_chapter); range.chapter_count = (flush_count > sub_index->chapter_mask ? sub_index->chapter_mask + 1 : flush_count); result = get_volume_index_entry(record, delta_list_number, address, &range); flush_chapter = convert_index_to_virtual(record, range.chapter_start); if (flush_chapter > volume_index_zone->virtual_chapter_high) flush_chapter = volume_index_zone->virtual_chapter_high; sub_index->flush_chapters[delta_list_number] = flush_chapter; } else { result = uds_get_delta_index_entry(&sub_index->delta_index, delta_list_number, address, name->name, &record->delta_entry); } if (result != UDS_SUCCESS) return result; record->is_found = (!record->delta_entry.at_end && (record->delta_entry.key == address)); if (record->is_found) { u32 index_chapter = uds_get_delta_entry_value(&record->delta_entry); record->virtual_chapter = convert_index_to_virtual(record, index_chapter); } record->is_collision = record->delta_entry.is_collision; return UDS_SUCCESS; } int uds_get_volume_index_record(struct volume_index *volume_index, const struct uds_record_name *name, struct volume_index_record *record) { int result; if (uds_is_volume_index_sample(volume_index, name)) { /* * Other threads cannot be allowed to call uds_lookup_volume_index_name() while * this thread is finding the volume index record. Due to the lazy LRU flushing of * the volume index, uds_get_volume_index_record() is not a read-only operation. */ unsigned int zone = get_volume_sub_index_zone(&volume_index->vi_hook, name); struct mutex *mutex = &volume_index->zones[zone].hook_mutex; mutex_lock(mutex); result = get_volume_sub_index_record(&volume_index->vi_hook, name, record); mutex_unlock(mutex); /* Remember the mutex so that other operations on the index record can use it. */ record->mutex = mutex; } else { result = get_volume_sub_index_record(&volume_index->vi_non_hook, name, record); } return result; } int uds_put_volume_index_record(struct volume_index_record *record, u64 virtual_chapter) { int result; u32 address; const struct volume_sub_index *sub_index = record->sub_index; if (!is_virtual_chapter_indexed(record, virtual_chapter)) { u64 low = get_zone_for_record(record)->virtual_chapter_low; u64 high = get_zone_for_record(record)->virtual_chapter_high; return vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "cannot put record into chapter number %llu that is out of the valid range %llu to %llu", (unsigned long long) virtual_chapter, (unsigned long long) low, (unsigned long long) high); } address = extract_address(sub_index, record->name); if (unlikely(record->mutex != NULL)) mutex_lock(record->mutex); result = uds_put_delta_index_entry(&record->delta_entry, address, convert_virtual_to_index(sub_index, virtual_chapter), record->is_found ? record->name->name : NULL); if (unlikely(record->mutex != NULL)) mutex_unlock(record->mutex); switch (result) { case UDS_SUCCESS: record->virtual_chapter = virtual_chapter; record->is_collision = record->delta_entry.is_collision; record->is_found = true; break; case UDS_OVERFLOW: vdo_log_ratelimit(vdo_log_warning_strerror, UDS_OVERFLOW, "Volume index entry dropped due to overflow condition"); uds_log_delta_index_entry(&record->delta_entry); break; default: break; } return result; } int uds_remove_volume_index_record(struct volume_index_record *record) { int result; if (!record->is_found) return vdo_log_warning_strerror(UDS_BAD_STATE, "illegal operation on new record"); /* Mark the record so that it cannot be used again */ record->is_found = false; if (unlikely(record->mutex != NULL)) mutex_lock(record->mutex); result = uds_remove_delta_index_entry(&record->delta_entry); if (unlikely(record->mutex != NULL)) mutex_unlock(record->mutex); return result; } static void set_volume_sub_index_zone_open_chapter(struct volume_sub_index *sub_index, unsigned int zone_number, u64 virtual_chapter) { u64 used_bits = 0; struct volume_sub_index_zone *zone = &sub_index->zones[zone_number]; struct delta_zone *delta_zone; u32 i; zone->virtual_chapter_low = (virtual_chapter >= sub_index->chapter_count ? virtual_chapter - sub_index->chapter_count + 1 : 0); zone->virtual_chapter_high = virtual_chapter; /* Check to see if the new zone data is too large. */ delta_zone = &sub_index->delta_index.delta_zones[zone_number]; for (i = 1; i <= delta_zone->list_count; i++) used_bits += delta_zone->delta_lists[i].size; if (used_bits > sub_index->max_zone_bits) { /* Expire enough chapters to free the desired space. */ u64 expire_count = 1 + (used_bits - sub_index->max_zone_bits) / sub_index->chapter_zone_bits; if (expire_count == 1) { vdo_log_ratelimit(vdo_log_info, "zone %u: At chapter %llu, expiring chapter %llu early", zone_number, (unsigned long long) virtual_chapter, (unsigned long long) zone->virtual_chapter_low); zone->early_flushes++; zone->virtual_chapter_low++; } else { u64 first_expired = zone->virtual_chapter_low; if (first_expired + expire_count < zone->virtual_chapter_high) { zone->early_flushes += expire_count; zone->virtual_chapter_low += expire_count; } else { zone->early_flushes += zone->virtual_chapter_high - zone->virtual_chapter_low; zone->virtual_chapter_low = zone->virtual_chapter_high; } vdo_log_ratelimit(vdo_log_info, "zone %u: At chapter %llu, expiring chapters %llu to %llu early", zone_number, (unsigned long long) virtual_chapter, (unsigned long long) first_expired, (unsigned long long) zone->virtual_chapter_low - 1); } } } void uds_set_volume_index_zone_open_chapter(struct volume_index *volume_index, unsigned int zone_number, u64 virtual_chapter) { struct mutex *mutex = &volume_index->zones[zone_number].hook_mutex; set_volume_sub_index_zone_open_chapter(&volume_index->vi_non_hook, zone_number, virtual_chapter); /* * Other threads cannot be allowed to call uds_lookup_volume_index_name() while the open * chapter number is changing. */ if (has_sparse(volume_index)) { mutex_lock(mutex); set_volume_sub_index_zone_open_chapter(&volume_index->vi_hook, zone_number, virtual_chapter); mutex_unlock(mutex); } } /* * Set the newest open chapter number for the index, while also advancing the oldest valid chapter * number. */ void uds_set_volume_index_open_chapter(struct volume_index *volume_index, u64 virtual_chapter) { unsigned int zone; for (zone = 0; zone < volume_index->zone_count; zone++) uds_set_volume_index_zone_open_chapter(volume_index, zone, virtual_chapter); } int uds_set_volume_index_record_chapter(struct volume_index_record *record, u64 virtual_chapter) { const struct volume_sub_index *sub_index = record->sub_index; int result; if (!record->is_found) return vdo_log_warning_strerror(UDS_BAD_STATE, "illegal operation on new record"); if (!is_virtual_chapter_indexed(record, virtual_chapter)) { u64 low = get_zone_for_record(record)->virtual_chapter_low; u64 high = get_zone_for_record(record)->virtual_chapter_high; return vdo_log_warning_strerror(UDS_INVALID_ARGUMENT, "cannot set chapter number %llu that is out of the valid range %llu to %llu", (unsigned long long) virtual_chapter, (unsigned long long) low, (unsigned long long) high); } if (unlikely(record->mutex != NULL)) mutex_lock(record->mutex); result = uds_set_delta_entry_value(&record->delta_entry, convert_virtual_to_index(sub_index, virtual_chapter)); if (unlikely(record->mutex != NULL)) mutex_unlock(record->mutex); if (result != UDS_SUCCESS) return result; record->virtual_chapter = virtual_chapter; return UDS_SUCCESS; } static u64 lookup_volume_sub_index_name(const struct volume_sub_index *sub_index, const struct uds_record_name *name) { int result; u32 address = extract_address(sub_index, name); u32 delta_list_number = extract_dlist_num(sub_index, name); unsigned int zone_number = get_volume_sub_index_zone(sub_index, name); const struct volume_sub_index_zone *zone = &sub_index->zones[zone_number]; u64 virtual_chapter; u32 index_chapter; u32 rolling_chapter; struct delta_index_entry delta_entry; result = uds_get_delta_index_entry(&sub_index->delta_index, delta_list_number, address, name->name, &delta_entry); if (result != UDS_SUCCESS) return NO_CHAPTER; if (delta_entry.at_end || (delta_entry.key != address)) return NO_CHAPTER; index_chapter = uds_get_delta_entry_value(&delta_entry); rolling_chapter = (index_chapter - zone->virtual_chapter_low) & sub_index->chapter_mask; virtual_chapter = zone->virtual_chapter_low + rolling_chapter; if (virtual_chapter > zone->virtual_chapter_high) return NO_CHAPTER; return virtual_chapter; } /* Do a read-only lookup of the record name for sparse cache management. */ u64 uds_lookup_volume_index_name(const struct volume_index *volume_index, const struct uds_record_name *name) { unsigned int zone_number = uds_get_volume_index_zone(volume_index, name); struct mutex *mutex = &volume_index->zones[zone_number].hook_mutex; u64 virtual_chapter; if (!uds_is_volume_index_sample(volume_index, name)) return NO_CHAPTER; mutex_lock(mutex); virtual_chapter = lookup_volume_sub_index_name(&volume_index->vi_hook, name); mutex_unlock(mutex); return virtual_chapter; } static void abort_restoring_volume_sub_index(struct volume_sub_index *sub_index) { uds_reset_delta_index(&sub_index->delta_index); } static void abort_restoring_volume_index(struct volume_index *volume_index) { abort_restoring_volume_sub_index(&volume_index->vi_non_hook); if (has_sparse(volume_index)) abort_restoring_volume_sub_index(&volume_index->vi_hook); } static int start_restoring_volume_sub_index(struct volume_sub_index *sub_index, struct buffered_reader **readers, unsigned int reader_count) { unsigned int z; int result; u64 virtual_chapter_low = 0, virtual_chapter_high = 0; unsigned int i; for (i = 0; i < reader_count; i++) { struct sub_index_data header; u8 buffer[sizeof(struct sub_index_data)]; size_t offset = 0; u32 j; result = uds_read_from_buffered_reader(readers[i], buffer, sizeof(buffer)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read volume index header"); } memcpy(&header.magic, buffer, MAGIC_SIZE); offset += MAGIC_SIZE; decode_u64_le(buffer, &offset, &header.volume_nonce); decode_u64_le(buffer, &offset, &header.virtual_chapter_low); decode_u64_le(buffer, &offset, &header.virtual_chapter_high); decode_u32_le(buffer, &offset, &header.first_list); decode_u32_le(buffer, &offset, &header.list_count); result = VDO_ASSERT(offset == sizeof(buffer), "%zu bytes decoded of %zu expected", offset, sizeof(buffer)); if (result != VDO_SUCCESS) result = UDS_CORRUPT_DATA; if (memcmp(header.magic, MAGIC_START_5, MAGIC_SIZE) != 0) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "volume index file had bad magic number"); } if (sub_index->volume_nonce == 0) { sub_index->volume_nonce = header.volume_nonce; } else if (header.volume_nonce != sub_index->volume_nonce) { return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "volume index volume nonce incorrect"); } if (i == 0) { virtual_chapter_low = header.virtual_chapter_low; virtual_chapter_high = header.virtual_chapter_high; } else if (virtual_chapter_high != header.virtual_chapter_high) { u64 low = header.virtual_chapter_low; u64 high = header.virtual_chapter_high; return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "Inconsistent volume index zone files: Chapter range is [%llu,%llu], chapter range %d is [%llu,%llu]", (unsigned long long) virtual_chapter_low, (unsigned long long) virtual_chapter_high, i, (unsigned long long) low, (unsigned long long) high); } else if (virtual_chapter_low < header.virtual_chapter_low) { virtual_chapter_low = header.virtual_chapter_low; } for (j = 0; j < header.list_count; j++) { u8 decoded[sizeof(u64)]; result = uds_read_from_buffered_reader(readers[i], decoded, sizeof(u64)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read volume index flush ranges"); } sub_index->flush_chapters[header.first_list + j] = get_unaligned_le64(decoded); } } for (z = 0; z < sub_index->zone_count; z++) { memset(&sub_index->zones[z], 0, sizeof(struct volume_sub_index_zone)); sub_index->zones[z].virtual_chapter_low = virtual_chapter_low; sub_index->zones[z].virtual_chapter_high = virtual_chapter_high; } result = uds_start_restoring_delta_index(&sub_index->delta_index, readers, reader_count); if (result != UDS_SUCCESS) return vdo_log_warning_strerror(result, "restoring delta index failed"); return UDS_SUCCESS; } static int start_restoring_volume_index(struct volume_index *volume_index, struct buffered_reader **buffered_readers, unsigned int reader_count) { unsigned int i; int result; if (!has_sparse(volume_index)) { return start_restoring_volume_sub_index(&volume_index->vi_non_hook, buffered_readers, reader_count); } for (i = 0; i < reader_count; i++) { struct volume_index_data header; u8 buffer[sizeof(struct volume_index_data)]; size_t offset = 0; result = uds_read_from_buffered_reader(buffered_readers[i], buffer, sizeof(buffer)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to read volume index header"); } memcpy(&header.magic, buffer, MAGIC_SIZE); offset += MAGIC_SIZE; decode_u32_le(buffer, &offset, &header.sparse_sample_rate); result = VDO_ASSERT(offset == sizeof(buffer), "%zu bytes decoded of %zu expected", offset, sizeof(buffer)); if (result != VDO_SUCCESS) result = UDS_CORRUPT_DATA; if (memcmp(header.magic, MAGIC_START_6, MAGIC_SIZE) != 0) return vdo_log_warning_strerror(UDS_CORRUPT_DATA, "volume index file had bad magic number"); if (i == 0) { volume_index->sparse_sample_rate = header.sparse_sample_rate; } else if (volume_index->sparse_sample_rate != header.sparse_sample_rate) { vdo_log_warning_strerror(UDS_CORRUPT_DATA, "Inconsistent sparse sample rate in delta index zone files: %u vs. %u", volume_index->sparse_sample_rate, header.sparse_sample_rate); return UDS_CORRUPT_DATA; } } result = start_restoring_volume_sub_index(&volume_index->vi_non_hook, buffered_readers, reader_count); if (result != UDS_SUCCESS) return result; return start_restoring_volume_sub_index(&volume_index->vi_hook, buffered_readers, reader_count); } static int finish_restoring_volume_sub_index(struct volume_sub_index *sub_index, struct buffered_reader **buffered_readers, unsigned int reader_count) { return uds_finish_restoring_delta_index(&sub_index->delta_index, buffered_readers, reader_count); } static int finish_restoring_volume_index(struct volume_index *volume_index, struct buffered_reader **buffered_readers, unsigned int reader_count) { int result; result = finish_restoring_volume_sub_index(&volume_index->vi_non_hook, buffered_readers, reader_count); if ((result == UDS_SUCCESS) && has_sparse(volume_index)) { result = finish_restoring_volume_sub_index(&volume_index->vi_hook, buffered_readers, reader_count); } return result; } int uds_load_volume_index(struct volume_index *volume_index, struct buffered_reader **readers, unsigned int reader_count) { int result; /* Start by reading the header section of the stream. */ result = start_restoring_volume_index(volume_index, readers, reader_count); if (result != UDS_SUCCESS) return result; result = finish_restoring_volume_index(volume_index, readers, reader_count); if (result != UDS_SUCCESS) { abort_restoring_volume_index(volume_index); return result; } /* Check the final guard lists to make sure there is no extra data. */ result = uds_check_guard_delta_lists(readers, reader_count); if (result != UDS_SUCCESS) abort_restoring_volume_index(volume_index); return result; } static int start_saving_volume_sub_index(const struct volume_sub_index *sub_index, unsigned int zone_number, struct buffered_writer *buffered_writer) { int result; struct volume_sub_index_zone *volume_index_zone = &sub_index->zones[zone_number]; u32 first_list = sub_index->delta_index.delta_zones[zone_number].first_list; u32 list_count = sub_index->delta_index.delta_zones[zone_number].list_count; u8 buffer[sizeof(struct sub_index_data)]; size_t offset = 0; u32 i; memcpy(buffer, MAGIC_START_5, MAGIC_SIZE); offset += MAGIC_SIZE; encode_u64_le(buffer, &offset, sub_index->volume_nonce); encode_u64_le(buffer, &offset, volume_index_zone->virtual_chapter_low); encode_u64_le(buffer, &offset, volume_index_zone->virtual_chapter_high); encode_u32_le(buffer, &offset, first_list); encode_u32_le(buffer, &offset, list_count); result = VDO_ASSERT(offset == sizeof(struct sub_index_data), "%zu bytes of config written, of %zu expected", offset, sizeof(struct sub_index_data)); if (result != VDO_SUCCESS) return result; result = uds_write_to_buffered_writer(buffered_writer, buffer, offset); if (result != UDS_SUCCESS) return vdo_log_warning_strerror(result, "failed to write volume index header"); for (i = 0; i < list_count; i++) { u8 encoded[sizeof(u64)]; put_unaligned_le64(sub_index->flush_chapters[first_list + i], &encoded); result = uds_write_to_buffered_writer(buffered_writer, encoded, sizeof(u64)); if (result != UDS_SUCCESS) { return vdo_log_warning_strerror(result, "failed to write volume index flush ranges"); } } return uds_start_saving_delta_index(&sub_index->delta_index, zone_number, buffered_writer); } static int start_saving_volume_index(const struct volume_index *volume_index, unsigned int zone_number, struct buffered_writer *writer) { u8 buffer[sizeof(struct volume_index_data)]; size_t offset = 0; int result; if (!has_sparse(volume_index)) { return start_saving_volume_sub_index(&volume_index->vi_non_hook, zone_number, writer); } memcpy(buffer, MAGIC_START_6, MAGIC_SIZE); offset += MAGIC_SIZE; encode_u32_le(buffer, &offset, volume_index->sparse_sample_rate); result = VDO_ASSERT(offset == sizeof(struct volume_index_data), "%zu bytes of header written, of %zu expected", offset, sizeof(struct volume_index_data)); if (result != VDO_SUCCESS) return result; result = uds_write_to_buffered_writer(writer, buffer, offset); if (result != UDS_SUCCESS) { vdo_log_warning_strerror(result, "failed to write volume index header"); return result; } result = start_saving_volume_sub_index(&volume_index->vi_non_hook, zone_number, writer); if (result != UDS_SUCCESS) return result; return start_saving_volume_sub_index(&volume_index->vi_hook, zone_number, writer); } static int finish_saving_volume_sub_index(const struct volume_sub_index *sub_index, unsigned int zone_number) { return uds_finish_saving_delta_index(&sub_index->delta_index, zone_number); } static int finish_saving_volume_index(const struct volume_index *volume_index, unsigned int zone_number) { int result; result = finish_saving_volume_sub_index(&volume_index->vi_non_hook, zone_number); if ((result == UDS_SUCCESS) && has_sparse(volume_index)) result = finish_saving_volume_sub_index(&volume_index->vi_hook, zone_number); return result; } int uds_save_volume_index(struct volume_index *volume_index, struct buffered_writer **writers, unsigned int writer_count) { int result = UDS_SUCCESS; unsigned int zone; for (zone = 0; zone < writer_count; zone++) { result = start_saving_volume_index(volume_index, zone, writers[zone]); if (result != UDS_SUCCESS) break; result = finish_saving_volume_index(volume_index, zone); if (result != UDS_SUCCESS) break; result = uds_write_guard_delta_list(writers[zone]); if (result != UDS_SUCCESS) break; result = uds_flush_buffered_writer(writers[zone]); if (result != UDS_SUCCESS) break; } return result; } static void get_volume_sub_index_stats(const struct volume_sub_index *sub_index, struct volume_index_stats *stats) { struct delta_index_stats dis; unsigned int z; uds_get_delta_index_stats(&sub_index->delta_index, &dis); stats->rebalance_time = dis.rebalance_time; stats->rebalance_count = dis.rebalance_count; stats->record_count = dis.record_count; stats->collision_count = dis.collision_count; stats->discard_count = dis.discard_count; stats->overflow_count = dis.overflow_count; stats->delta_lists = dis.list_count; stats->early_flushes = 0; for (z = 0; z < sub_index->zone_count; z++) stats->early_flushes += sub_index->zones[z].early_flushes; } void uds_get_volume_index_stats(const struct volume_index *volume_index, struct volume_index_stats *stats) { struct volume_index_stats sparse_stats; get_volume_sub_index_stats(&volume_index->vi_non_hook, stats); if (!has_sparse(volume_index)) return; get_volume_sub_index_stats(&volume_index->vi_hook, &sparse_stats); stats->rebalance_time += sparse_stats.rebalance_time; stats->rebalance_count += sparse_stats.rebalance_count; stats->record_count += sparse_stats.record_count; stats->collision_count += sparse_stats.collision_count; stats->discard_count += sparse_stats.discard_count; stats->overflow_count += sparse_stats.overflow_count; stats->delta_lists += sparse_stats.delta_lists; stats->early_flushes += sparse_stats.early_flushes; } static int initialize_volume_sub_index(const struct uds_configuration *config, u64 volume_nonce, u8 tag, struct volume_sub_index *sub_index) { struct sub_index_parameters params = { .address_bits = 0 }; unsigned int zone_count = config->zone_count; u64 available_bytes = 0; unsigned int z; int result; result = compute_volume_sub_index_parameters(config, ¶ms); if (result != UDS_SUCCESS) return result; sub_index->address_bits = params.address_bits; sub_index->address_mask = (1u << params.address_bits) - 1; sub_index->chapter_bits = params.chapter_bits; sub_index->chapter_mask = (1u << params.chapter_bits) - 1; sub_index->chapter_count = params.chapter_count; sub_index->list_count = params.list_count; sub_index->zone_count = zone_count; sub_index->chapter_zone_bits = params.chapter_size_in_bits / zone_count; sub_index->volume_nonce = volume_nonce; result = uds_initialize_delta_index(&sub_index->delta_index, zone_count, params.list_count, params.mean_delta, params.chapter_bits, params.memory_size, tag); if (result != UDS_SUCCESS) return result; for (z = 0; z < sub_index->delta_index.zone_count; z++) available_bytes += sub_index->delta_index.delta_zones[z].size; available_bytes -= params.target_free_bytes; sub_index->max_zone_bits = (available_bytes * BITS_PER_BYTE) / zone_count; sub_index->memory_size = (sub_index->delta_index.memory_size + sizeof(struct volume_sub_index) + (params.list_count * sizeof(u64)) + (zone_count * sizeof(struct volume_sub_index_zone))); /* The following arrays are initialized to all zeros. */ result = vdo_allocate(params.list_count, u64, "first chapter to flush", &sub_index->flush_chapters); if (result != VDO_SUCCESS) return result; return vdo_allocate(zone_count, struct volume_sub_index_zone, "volume index zones", &sub_index->zones); } int uds_make_volume_index(const struct uds_configuration *config, u64 volume_nonce, struct volume_index **volume_index_ptr) { struct split_config split; unsigned int zone; struct volume_index *volume_index; int result; result = vdo_allocate(1, struct volume_index, "volume index", &volume_index); if (result != VDO_SUCCESS) return result; volume_index->zone_count = config->zone_count; if (!uds_is_sparse_index_geometry(config->geometry)) { result = initialize_volume_sub_index(config, volume_nonce, 'm', &volume_index->vi_non_hook); if (result != UDS_SUCCESS) { uds_free_volume_index(volume_index); return result; } volume_index->memory_size = volume_index->vi_non_hook.memory_size; *volume_index_ptr = volume_index; return UDS_SUCCESS; } volume_index->sparse_sample_rate = config->sparse_sample_rate; result = vdo_allocate(config->zone_count, struct volume_index_zone, "volume index zones", &volume_index->zones); if (result != VDO_SUCCESS) { uds_free_volume_index(volume_index); return result; } for (zone = 0; zone < config->zone_count; zone++) mutex_init(&volume_index->zones[zone].hook_mutex); split_configuration(config, &split); result = initialize_volume_sub_index(&split.non_hook_config, volume_nonce, 'd', &volume_index->vi_non_hook); if (result != UDS_SUCCESS) { uds_free_volume_index(volume_index); return vdo_log_error_strerror(result, "Error creating non hook volume index"); } result = initialize_volume_sub_index(&split.hook_config, volume_nonce, 's', &volume_index->vi_hook); if (result != UDS_SUCCESS) { uds_free_volume_index(volume_index); return vdo_log_error_strerror(result, "Error creating hook volume index"); } volume_index->memory_size = volume_index->vi_non_hook.memory_size + volume_index->vi_hook.memory_size; *volume_index_ptr = volume_index; return UDS_SUCCESS; } vdo-8.3.1.1/utils/uds/volume-index.h000066400000000000000000000146651476467262700172330ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_VOLUME_INDEX_H #define UDS_VOLUME_INDEX_H #include #include "thread-utils.h" #include "config.h" #include "delta-index.h" #include "indexer.h" /* * The volume index is the primary top-level index for UDS. It contains records which map a record * name to the chapter where a record with that name is stored. This mapping can definitively say * when no record exists. However, because we only use a subset of the name for this index, it * cannot definitively say that a record for the entry does exist. It can only say that if a record * exists, it will be in a particular chapter. The request can then be dispatched to that chapter * for further processing. * * If the volume_index_record does not actually match the record name, the index can store a more * specific collision record to disambiguate the new entry from the existing one. Index entries are * managed with volume_index_record structures. */ #define NO_CHAPTER U64_MAX struct volume_index_stats { /* Nanoseconds spent rebalancing */ ktime_t rebalance_time; /* Number of memory rebalances */ u32 rebalance_count; /* The number of records in the index */ u64 record_count; /* The number of collision records */ u64 collision_count; /* The number of records removed */ u64 discard_count; /* The number of UDS_OVERFLOWs detected */ u64 overflow_count; /* The number of delta lists */ u32 delta_lists; /* Number of early flushes */ u64 early_flushes; }; struct volume_sub_index_zone { u64 virtual_chapter_low; u64 virtual_chapter_high; u64 early_flushes; } __aligned(L1_CACHE_BYTES); struct volume_sub_index { /* The delta index */ struct delta_index delta_index; /* The first chapter to be flushed in each zone */ u64 *flush_chapters; /* The zones */ struct volume_sub_index_zone *zones; /* The volume nonce */ u64 volume_nonce; /* Expected size of a chapter (per zone) */ u64 chapter_zone_bits; /* Maximum size of the index (per zone) */ u64 max_zone_bits; /* The number of bits in address mask */ u8 address_bits; /* Mask to get address within delta list */ u32 address_mask; /* The number of bits in chapter number */ u8 chapter_bits; /* The largest storable chapter number */ u32 chapter_mask; /* The number of chapters used */ u32 chapter_count; /* The number of delta lists */ u32 list_count; /* The number of zones */ unsigned int zone_count; /* The amount of memory allocated */ u64 memory_size; }; struct volume_index_zone { /* Protects the sampled index in this zone */ struct mutex hook_mutex; } __aligned(L1_CACHE_BYTES); struct volume_index { u32 sparse_sample_rate; unsigned int zone_count; u64 memory_size; struct volume_sub_index vi_non_hook; struct volume_sub_index vi_hook; struct volume_index_zone *zones; }; /* * The volume_index_record structure is used to facilitate processing of a record name. A client * first calls uds_get_volume_index_record() to find the volume index record for a record name. The * fields of the record can then be examined to determine the state of the record. * * If is_found is false, then the index did not find an entry for the record name. Calling * uds_put_volume_index_record() will insert a new entry for that name at the proper place. * * If is_found is true, then we did find an entry for the record name, and the virtual_chapter and * is_collision fields reflect the entry found. Subsequently, a call to * uds_remove_volume_index_record() will remove the entry, a call to * uds_set_volume_index_record_chapter() will update the existing entry, and a call to * uds_put_volume_index_record() will insert a new collision record after the existing entry. */ struct volume_index_record { /* Public fields */ /* Chapter where the record info is found */ u64 virtual_chapter; /* This record is a collision */ bool is_collision; /* This record is the requested record */ bool is_found; /* Private fields */ /* Zone that contains this name */ unsigned int zone_number; /* The volume index */ struct volume_sub_index *sub_index; /* Mutex for accessing this delta index entry in the hook index */ struct mutex *mutex; /* The record name to which this record refers */ const struct uds_record_name *name; /* The delta index entry for this record */ struct delta_index_entry delta_entry; }; int __must_check uds_make_volume_index(const struct uds_configuration *config, u64 volume_nonce, struct volume_index **volume_index); void uds_free_volume_index(struct volume_index *volume_index); int __must_check uds_compute_volume_index_save_blocks(const struct uds_configuration *config, size_t block_size, u64 *block_count); unsigned int __must_check uds_get_volume_index_zone(const struct volume_index *volume_index, const struct uds_record_name *name); bool __must_check uds_is_volume_index_sample(const struct volume_index *volume_index, const struct uds_record_name *name); /* * This function is only used to manage sparse cache membership. Most requests should use * uds_get_volume_index_record() to look up index records instead. */ u64 __must_check uds_lookup_volume_index_name(const struct volume_index *volume_index, const struct uds_record_name *name); int __must_check uds_get_volume_index_record(struct volume_index *volume_index, const struct uds_record_name *name, struct volume_index_record *record); int __must_check uds_put_volume_index_record(struct volume_index_record *record, u64 virtual_chapter); int __must_check uds_remove_volume_index_record(struct volume_index_record *record); int __must_check uds_set_volume_index_record_chapter(struct volume_index_record *record, u64 virtual_chapter); void uds_set_volume_index_open_chapter(struct volume_index *volume_index, u64 virtual_chapter); void uds_set_volume_index_zone_open_chapter(struct volume_index *volume_index, unsigned int zone_number, u64 virtual_chapter); int __must_check uds_load_volume_index(struct volume_index *volume_index, struct buffered_reader **readers, unsigned int reader_count); int __must_check uds_save_volume_index(struct volume_index *volume_index, struct buffered_writer **writers, unsigned int writer_count); void uds_get_volume_index_stats(const struct volume_index *volume_index, struct volume_index_stats *stats); #endif /* UDS_VOLUME_INDEX_H */ vdo-8.3.1.1/utils/uds/volume.c000066400000000000000000001536441476467262700161220ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "volume.h" #include #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "string-utils.h" #include "thread-utils.h" #include "chapter-index.h" #include "config.h" #include "geometry.h" #include "hash-utils.h" #include "index.h" #include "sparse-cache.h" /* * The first block of the volume layout is reserved for the volume header, which is no longer used. * The remainder of the volume is divided into chapters consisting of several pages of records, and * several pages of static index to use to find those records. The index pages are recorded first, * followed by the record pages. The chapters are written in order as they are filled, so the * volume storage acts as a circular log of the most recent chapters, with each new chapter * overwriting the oldest saved one. * * When a new chapter is filled and closed, the records from that chapter are sorted and * interleaved in approximate temporal order, and assigned to record pages. Then a static delta * index is generated to store which record page contains each record. The in-memory index page map * is also updated to indicate which delta lists fall on each chapter index page. This means that * when a record is read, the volume only has to load a single index page and a single record page, * rather than search the entire chapter. These index and record pages are written to storage, and * the index pages are transferred to the page cache under the theory that the most recently * written chapter is likely to be accessed again soon. * * When reading a record, the volume index will indicate which chapter should contain it. The * volume uses the index page map to determine which chapter index page needs to be loaded, and * then reads the relevant record page number from the chapter index. Both index and record pages * are stored in a page cache when read for the common case that subsequent records need the same * pages. The page cache evicts the least recently accessed entries when caching new pages. In * addition, the volume uses dm-bufio to manage access to the storage, which may allow for * additional caching depending on available system resources. * * Record requests are handled from cached pages when possible. If a page needs to be read, it is * placed on a queue along with the request that wants to read it. Any requests for the same page * that arrive while the read is pending are added to the queue entry. A separate reader thread * handles the queued reads, adding the page to the cache and updating any requests queued with it * so they can continue processing. This allows the index zone threads to continue processing new * requests rather than wait for the storage reads. * * When an index rebuild is necessary, the volume reads each stored chapter to determine which * range of chapters contain valid records, so that those records can be used to reconstruct the * in-memory volume index. */ /* The maximum allowable number of contiguous bad chapters */ #define MAX_BAD_CHAPTERS 100 #define VOLUME_CACHE_MAX_ENTRIES (U16_MAX >> 1) #define VOLUME_CACHE_QUEUED_FLAG (1 << 15) #define VOLUME_CACHE_MAX_QUEUED_READS 4096 static const u64 BAD_CHAPTER = U64_MAX; /* * The invalidate counter is two 32 bits fields stored together atomically. The low order 32 bits * are the physical page number of the cached page being read. The high order 32 bits are a * sequence number. This value is written when the zone that owns it begins or completes a cache * search. Any other thread will only read the counter in wait_for_pending_searches() while waiting * to update the cache contents. */ union invalidate_counter { u64 value; struct { u32 page; u32 counter; }; }; static inline u32 map_to_page_number(struct index_geometry *geometry, u32 physical_page) { return (physical_page - HEADER_PAGES_PER_VOLUME) % geometry->pages_per_chapter; } static inline u32 map_to_chapter_number(struct index_geometry *geometry, u32 physical_page) { return (physical_page - HEADER_PAGES_PER_VOLUME) / geometry->pages_per_chapter; } static inline bool is_record_page(struct index_geometry *geometry, u32 physical_page) { return map_to_page_number(geometry, physical_page) >= geometry->index_pages_per_chapter; } static u32 map_to_physical_page(const struct index_geometry *geometry, u32 chapter, u32 page) { /* Page zero is the header page, so the first chapter index page is page one. */ return HEADER_PAGES_PER_VOLUME + (geometry->pages_per_chapter * chapter) + page; } static inline union invalidate_counter get_invalidate_counter(struct page_cache *cache, unsigned int zone_number) { return (union invalidate_counter) { .value = READ_ONCE(cache->search_pending_counters[zone_number].atomic_value), }; } static inline void set_invalidate_counter(struct page_cache *cache, unsigned int zone_number, union invalidate_counter invalidate_counter) { WRITE_ONCE(cache->search_pending_counters[zone_number].atomic_value, invalidate_counter.value); } static inline bool search_pending(union invalidate_counter invalidate_counter) { return (invalidate_counter.counter & 1) != 0; } /* Lock the cache for a zone in order to search for a page. */ static void begin_pending_search(struct page_cache *cache, u32 physical_page, unsigned int zone_number) { union invalidate_counter invalidate_counter = get_invalidate_counter(cache, zone_number); invalidate_counter.page = physical_page; invalidate_counter.counter++; set_invalidate_counter(cache, zone_number, invalidate_counter); VDO_ASSERT_LOG_ONLY(search_pending(invalidate_counter), "Search is pending for zone %u", zone_number); /* * This memory barrier ensures that the write to the invalidate counter is seen by other * threads before this thread accesses the cached page. The corresponding read memory * barrier is in wait_for_pending_searches(). */ smp_mb(); } /* Unlock the cache for a zone by clearing its invalidate counter. */ static void end_pending_search(struct page_cache *cache, unsigned int zone_number) { union invalidate_counter invalidate_counter; /* * This memory barrier ensures that this thread completes reads of the * cached page before other threads see the write to the invalidate * counter. */ smp_mb(); invalidate_counter = get_invalidate_counter(cache, zone_number); VDO_ASSERT_LOG_ONLY(search_pending(invalidate_counter), "Search is pending for zone %u", zone_number); invalidate_counter.counter++; set_invalidate_counter(cache, zone_number, invalidate_counter); } static void wait_for_pending_searches(struct page_cache *cache, u32 physical_page) { union invalidate_counter initial_counters[MAX_ZONES]; unsigned int i; /* * We hold the read_threads_mutex. We are waiting for threads that do not hold the * read_threads_mutex. Those threads have "locked" their targeted page by setting the * search_pending_counter. The corresponding write memory barrier is in * begin_pending_search(). */ smp_mb(); for (i = 0; i < cache->zone_count; i++) initial_counters[i] = get_invalidate_counter(cache, i); for (i = 0; i < cache->zone_count; i++) { if (search_pending(initial_counters[i]) && (initial_counters[i].page == physical_page)) { /* * There is an active search using the physical page. We need to wait for * the search to finish. * * FIXME: Investigate using wait_event() to wait for the search to finish. */ while (initial_counters[i].value == get_invalidate_counter(cache, i).value) cond_resched(); } } } static void release_page_buffer(struct cached_page *page) { if (page->buffer != NULL) dm_bufio_release(vdo_forget(page->buffer)); } static void clear_cache_page(struct page_cache *cache, struct cached_page *page) { /* Do not clear read_pending because the read queue relies on it. */ release_page_buffer(page); page->physical_page = cache->indexable_pages; WRITE_ONCE(page->last_used, 0); } static void make_page_most_recent(struct page_cache *cache, struct cached_page *page) { /* * ASSERTION: We are either a zone thread holding a search_pending_counter, or we are any * thread holding the read_threads_mutex. */ if (atomic64_read(&cache->clock) != READ_ONCE(page->last_used)) WRITE_ONCE(page->last_used, atomic64_inc_return(&cache->clock)); } /* Select a page to remove from the cache to make space for a new entry. */ static struct cached_page *select_victim_in_cache(struct page_cache *cache) { struct cached_page *page; int oldest_index = 0; s64 oldest_time = S64_MAX; s64 last_used; u16 i; /* Find the oldest unclaimed page. We hold the read_threads_mutex. */ for (i = 0; i < cache->cache_slots; i++) { /* A page with a pending read must not be replaced. */ if (cache->cache[i].read_pending) continue; last_used = READ_ONCE(cache->cache[i].last_used); if (last_used <= oldest_time) { oldest_time = last_used; oldest_index = i; } } page = &cache->cache[oldest_index]; if (page->physical_page != cache->indexable_pages) { WRITE_ONCE(cache->index[page->physical_page], cache->cache_slots); wait_for_pending_searches(cache, page->physical_page); } page->read_pending = true; clear_cache_page(cache, page); return page; } /* Make a newly filled cache entry available to other threads. */ static int put_page_in_cache(struct page_cache *cache, u32 physical_page, struct cached_page *page) { int result; /* We hold the read_threads_mutex. */ result = VDO_ASSERT((page->read_pending), "page to install has a pending read"); if (result != VDO_SUCCESS) return result; page->physical_page = physical_page; make_page_most_recent(cache, page); page->read_pending = false; /* * We hold the read_threads_mutex, but we must have a write memory barrier before making * the cached_page available to the readers that do not hold the mutex. The corresponding * read memory barrier is in get_page_and_index(). */ smp_wmb(); /* This assignment also clears the queued flag. */ WRITE_ONCE(cache->index[physical_page], page - cache->cache); return UDS_SUCCESS; } static void cancel_page_in_cache(struct page_cache *cache, u32 physical_page, struct cached_page *page) { int result; /* We hold the read_threads_mutex. */ result = VDO_ASSERT((page->read_pending), "page to install has a pending read"); if (result != VDO_SUCCESS) return; clear_cache_page(cache, page); page->read_pending = false; /* Clear the mapping and the queued flag for the new page. */ WRITE_ONCE(cache->index[physical_page], cache->cache_slots); } static inline u16 next_queue_position(u16 position) { return (position + 1) % VOLUME_CACHE_MAX_QUEUED_READS; } static inline void advance_queue_position(u16 *position) { *position = next_queue_position(*position); } static inline bool read_queue_is_full(struct page_cache *cache) { return cache->read_queue_first == next_queue_position(cache->read_queue_last); } static bool enqueue_read(struct page_cache *cache, struct uds_request *request, u32 physical_page) { struct queued_read *queue_entry; u16 last = cache->read_queue_last; u16 read_queue_index; /* We hold the read_threads_mutex. */ if ((cache->index[physical_page] & VOLUME_CACHE_QUEUED_FLAG) == 0) { /* This page has no existing entry in the queue. */ if (read_queue_is_full(cache)) return false; /* Fill in the read queue entry. */ cache->read_queue[last].physical_page = physical_page; cache->read_queue[last].invalid = false; cache->read_queue[last].first_request = NULL; cache->read_queue[last].last_request = NULL; /* Point the cache index to the read queue entry. */ read_queue_index = last; WRITE_ONCE(cache->index[physical_page], read_queue_index | VOLUME_CACHE_QUEUED_FLAG); advance_queue_position(&cache->read_queue_last); } else { /* It's already queued, so add this request to the existing entry. */ read_queue_index = cache->index[physical_page] & ~VOLUME_CACHE_QUEUED_FLAG; } request->next_request = NULL; queue_entry = &cache->read_queue[read_queue_index]; if (queue_entry->first_request == NULL) queue_entry->first_request = request; else queue_entry->last_request->next_request = request; queue_entry->last_request = request; return true; } static void enqueue_page_read(struct volume *volume, struct uds_request *request, u32 physical_page) { /* Mark the page as queued, so that chapter invalidation knows to cancel a read. */ while (!enqueue_read(&volume->page_cache, request, physical_page)) { vdo_log_debug("Read queue full, waiting for reads to finish"); uds_wait_cond(&volume->read_threads_read_done_cond, &volume->read_threads_mutex); } uds_signal_cond(&volume->read_threads_cond); } /* * Reserve the next read queue entry for processing, but do not actually remove it from the queue. * Must be followed by release_queued_requests(). */ static struct queued_read *reserve_read_queue_entry(struct page_cache *cache) { /* We hold the read_threads_mutex. */ struct queued_read *entry; u16 index_value; bool queued; /* No items to dequeue */ if (cache->read_queue_next_read == cache->read_queue_last) return NULL; entry = &cache->read_queue[cache->read_queue_next_read]; index_value = cache->index[entry->physical_page]; queued = (index_value & VOLUME_CACHE_QUEUED_FLAG) != 0; /* Check to see if it's still queued before resetting. */ if (entry->invalid && queued) WRITE_ONCE(cache->index[entry->physical_page], cache->cache_slots); /* * If a synchronous read has taken this page, set invalid to true so it doesn't get * overwritten. Requests will just be requeued. */ if (!queued) entry->invalid = true; entry->reserved = true; advance_queue_position(&cache->read_queue_next_read); return entry; } static inline struct queued_read *wait_to_reserve_read_queue_entry(struct volume *volume) { struct queued_read *queue_entry = NULL; while (!volume->read_threads_exiting) { queue_entry = reserve_read_queue_entry(&volume->page_cache); if (queue_entry != NULL) break; uds_wait_cond(&volume->read_threads_cond, &volume->read_threads_mutex); } return queue_entry; } static int init_chapter_index_page(const struct volume *volume, u8 *index_page, u32 chapter, u32 index_page_number, struct delta_index_page *chapter_index_page) { u64 ci_virtual; u32 ci_chapter; u32 lowest_list; u32 highest_list; struct index_geometry *geometry = volume->geometry; int result; result = uds_initialize_chapter_index_page(chapter_index_page, geometry, index_page, volume->nonce); if (volume->lookup_mode == LOOKUP_FOR_REBUILD) return result; if (result != UDS_SUCCESS) { return vdo_log_error_strerror(result, "Reading chapter index page for chapter %u page %u", chapter, index_page_number); } uds_get_list_number_bounds(volume->index_page_map, chapter, index_page_number, &lowest_list, &highest_list); ci_virtual = chapter_index_page->virtual_chapter_number; ci_chapter = uds_map_to_physical_chapter(geometry, ci_virtual); if ((chapter == ci_chapter) && (lowest_list == chapter_index_page->lowest_list_number) && (highest_list == chapter_index_page->highest_list_number)) return UDS_SUCCESS; vdo_log_warning("Index page map updated to %llu", (unsigned long long) volume->index_page_map->last_update); vdo_log_warning("Page map expects that chapter %u page %u has range %u to %u, but chapter index page has chapter %llu with range %u to %u", chapter, index_page_number, lowest_list, highest_list, (unsigned long long) ci_virtual, chapter_index_page->lowest_list_number, chapter_index_page->highest_list_number); return vdo_log_error_strerror(UDS_CORRUPT_DATA, "index page map mismatch with chapter index"); } static int initialize_index_page(const struct volume *volume, u32 physical_page, struct cached_page *page) { u32 chapter = map_to_chapter_number(volume->geometry, physical_page); u32 index_page_number = map_to_page_number(volume->geometry, physical_page); return init_chapter_index_page(volume, dm_bufio_get_block_data(page->buffer), chapter, index_page_number, &page->index_page); } static bool search_record_page(const u8 record_page[], const struct uds_record_name *name, const struct index_geometry *geometry, struct uds_record_data *metadata) { /* * The array of records is sorted by name and stored as a binary tree in heap order, so the * root of the tree is the first array element. */ u32 node = 0; const struct uds_volume_record *records = (const struct uds_volume_record *) record_page; while (node < geometry->records_per_page) { int result; const struct uds_volume_record *record = &records[node]; result = memcmp(name, &record->name, UDS_RECORD_NAME_SIZE); if (result == 0) { if (metadata != NULL) *metadata = record->data; return true; } /* The children of node N are at indexes 2N+1 and 2N+2. */ node = ((2 * node) + ((result < 0) ? 1 : 2)); } return false; } /* * If we've read in a record page, we're going to do an immediate search, to speed up processing by * avoiding get_record_from_zone(), and to ensure that requests make progress even when queued. If * we've read in an index page, we save the record page number so we don't have to resolve the * index page again. We use the location, virtual_chapter, and old_metadata fields in the request * to allow the index code to know where to begin processing the request again. */ static int search_page(struct cached_page *page, const struct volume *volume, struct uds_request *request, u32 physical_page) { int result; enum uds_index_region location; u16 record_page_number; if (is_record_page(volume->geometry, physical_page)) { if (search_record_page(dm_bufio_get_block_data(page->buffer), &request->record_name, volume->geometry, &request->old_metadata)) location = UDS_LOCATION_RECORD_PAGE_LOOKUP; else location = UDS_LOCATION_UNAVAILABLE; } else { result = uds_search_chapter_index_page(&page->index_page, volume->geometry, &request->record_name, &record_page_number); if (result != UDS_SUCCESS) return result; if (record_page_number == NO_CHAPTER_INDEX_ENTRY) { location = UDS_LOCATION_UNAVAILABLE; } else { location = UDS_LOCATION_INDEX_PAGE_LOOKUP; *((u16 *) &request->old_metadata) = record_page_number; } } request->location = location; request->found = false; return UDS_SUCCESS; } static int process_entry(struct volume *volume, struct queued_read *entry) { u32 page_number = entry->physical_page; struct uds_request *request; struct cached_page *page = NULL; u8 *page_data; int result; if (entry->invalid) { vdo_log_debug("Requeuing requests for invalid page"); return UDS_SUCCESS; } page = select_victim_in_cache(&volume->page_cache); mutex_unlock(&volume->read_threads_mutex); page_data = dm_bufio_read(volume->client, page_number, &page->buffer); mutex_lock(&volume->read_threads_mutex); if (IS_ERR(page_data)) { result = -PTR_ERR(page_data); vdo_log_warning_strerror(result, "error reading physical page %u from volume", page_number); cancel_page_in_cache(&volume->page_cache, page_number, page); return result; } if (entry->invalid) { vdo_log_warning("Page %u invalidated after read", page_number); cancel_page_in_cache(&volume->page_cache, page_number, page); return UDS_SUCCESS; } if (!is_record_page(volume->geometry, page_number)) { result = initialize_index_page(volume, page_number, page); if (result != UDS_SUCCESS) { vdo_log_warning("Error initializing chapter index page"); cancel_page_in_cache(&volume->page_cache, page_number, page); return result; } } result = put_page_in_cache(&volume->page_cache, page_number, page); if (result != UDS_SUCCESS) { vdo_log_warning("Error putting page %u in cache", page_number); cancel_page_in_cache(&volume->page_cache, page_number, page); return result; } request = entry->first_request; while ((request != NULL) && (result == UDS_SUCCESS)) { result = search_page(page, volume, request, page_number); request = request->next_request; } return result; } static void release_queued_requests(struct volume *volume, struct queued_read *entry, int result) { struct page_cache *cache = &volume->page_cache; u16 next_read = cache->read_queue_next_read; struct uds_request *request; struct uds_request *next; for (request = entry->first_request; request != NULL; request = next) { next = request->next_request; request->status = result; request->requeued = true; uds_enqueue_request(request, STAGE_INDEX); } entry->reserved = false; /* Move the read_queue_first pointer as far as we can. */ while ((cache->read_queue_first != next_read) && (!cache->read_queue[cache->read_queue_first].reserved)) advance_queue_position(&cache->read_queue_first); uds_broadcast_cond(&volume->read_threads_read_done_cond); } static void read_thread_function(void *arg) { struct volume *volume = arg; vdo_log_debug("reader starting"); mutex_lock(&volume->read_threads_mutex); while (true) { struct queued_read *queue_entry; int result; queue_entry = wait_to_reserve_read_queue_entry(volume); if (volume->read_threads_exiting) break; result = process_entry(volume, queue_entry); release_queued_requests(volume, queue_entry, result); } mutex_unlock(&volume->read_threads_mutex); vdo_log_debug("reader done"); } static void get_page_and_index(struct page_cache *cache, u32 physical_page, int *queue_index, struct cached_page **page_ptr) { u16 index_value; u16 index; bool queued; /* * ASSERTION: We are either a zone thread holding a search_pending_counter, or we are any * thread holding the read_threads_mutex. * * Holding only a search_pending_counter is the most frequent case. */ /* * It would be unlikely for the compiler to turn the usage of index_value into two reads of * cache->index, but it would be possible and very bad if those reads did not return the * same bits. */ index_value = READ_ONCE(cache->index[physical_page]); queued = (index_value & VOLUME_CACHE_QUEUED_FLAG) != 0; index = index_value & ~VOLUME_CACHE_QUEUED_FLAG; if (!queued && (index < cache->cache_slots)) { *page_ptr = &cache->cache[index]; /* * We have acquired access to the cached page, but unless we hold the * read_threads_mutex, we need a read memory barrier now. The corresponding write * memory barrier is in put_page_in_cache(). */ smp_rmb(); } else { *page_ptr = NULL; } *queue_index = queued ? index : -1; } static void get_page_from_cache(struct page_cache *cache, u32 physical_page, struct cached_page **page) { /* * ASSERTION: We are in a zone thread. * ASSERTION: We holding a search_pending_counter or the read_threads_mutex. */ int queue_index = -1; get_page_and_index(cache, physical_page, &queue_index, page); } static int read_page_locked(struct volume *volume, u32 physical_page, struct cached_page **page_ptr) { int result = UDS_SUCCESS; struct cached_page *page = NULL; u8 *page_data; page = select_victim_in_cache(&volume->page_cache); page_data = dm_bufio_read(volume->client, physical_page, &page->buffer); if (IS_ERR(page_data)) { result = -PTR_ERR(page_data); vdo_log_warning_strerror(result, "error reading physical page %u from volume", physical_page); cancel_page_in_cache(&volume->page_cache, physical_page, page); return result; } if (!is_record_page(volume->geometry, physical_page)) { result = initialize_index_page(volume, physical_page, page); if (result != UDS_SUCCESS) { if (volume->lookup_mode != LOOKUP_FOR_REBUILD) vdo_log_warning("Corrupt index page %u", physical_page); cancel_page_in_cache(&volume->page_cache, physical_page, page); return result; } } result = put_page_in_cache(&volume->page_cache, physical_page, page); if (result != UDS_SUCCESS) { vdo_log_warning("Error putting page %u in cache", physical_page); cancel_page_in_cache(&volume->page_cache, physical_page, page); return result; } *page_ptr = page; return UDS_SUCCESS; } /* Retrieve a page from the cache while holding the read threads mutex. */ static int get_volume_page_locked(struct volume *volume, u32 physical_page, struct cached_page **page_ptr) { int result; struct cached_page *page = NULL; get_page_from_cache(&volume->page_cache, physical_page, &page); if (page == NULL) { result = read_page_locked(volume, physical_page, &page); if (result != UDS_SUCCESS) return result; } else { make_page_most_recent(&volume->page_cache, page); } *page_ptr = page; return UDS_SUCCESS; } /* Retrieve a page from the cache while holding a search_pending lock. */ static int get_volume_page_protected(struct volume *volume, struct uds_request *request, u32 physical_page, struct cached_page **page_ptr) { struct cached_page *page; get_page_from_cache(&volume->page_cache, physical_page, &page); if (page != NULL) { if (request->zone_number == 0) { /* Only one zone is allowed to update the LRU. */ make_page_most_recent(&volume->page_cache, page); } *page_ptr = page; return UDS_SUCCESS; } /* Prepare to enqueue a read for the page. */ end_pending_search(&volume->page_cache, request->zone_number); mutex_lock(&volume->read_threads_mutex); /* * Do the lookup again while holding the read mutex (no longer the fast case so this should * be fine to repeat). We need to do this because a page may have been added to the cache * by a reader thread between the time we searched above and the time we went to actually * try to enqueue it below. This could result in us enqueuing another read for a page which * is already in the cache, which would mean we end up with two entries in the cache for * the same page. */ get_page_from_cache(&volume->page_cache, physical_page, &page); if (page == NULL) { enqueue_page_read(volume, request, physical_page); /* * The performance gain from unlocking first, while "search pending" mode is off, * turns out to be significant in some cases. The page is not available yet so * the order does not matter for correctness as it does below. */ mutex_unlock(&volume->read_threads_mutex); begin_pending_search(&volume->page_cache, physical_page, request->zone_number); return UDS_QUEUED; } /* * Now that the page is loaded, the volume needs to switch to "reader thread unlocked" and * "search pending" state in careful order so no other thread can mess with the data before * the caller gets to look at it. */ begin_pending_search(&volume->page_cache, physical_page, request->zone_number); mutex_unlock(&volume->read_threads_mutex); *page_ptr = page; return UDS_SUCCESS; } static int get_volume_page(struct volume *volume, u32 chapter, u32 page_number, struct cached_page **page_ptr) { int result; u32 physical_page = map_to_physical_page(volume->geometry, chapter, page_number); mutex_lock(&volume->read_threads_mutex); result = get_volume_page_locked(volume, physical_page, page_ptr); mutex_unlock(&volume->read_threads_mutex); return result; } int uds_get_volume_record_page(struct volume *volume, u32 chapter, u32 page_number, u8 **data_ptr) { int result; struct cached_page *page = NULL; result = get_volume_page(volume, chapter, page_number, &page); if (result == UDS_SUCCESS) *data_ptr = dm_bufio_get_block_data(page->buffer); return result; } int uds_get_volume_index_page(struct volume *volume, u32 chapter, u32 page_number, struct delta_index_page **index_page_ptr) { int result; struct cached_page *page = NULL; result = get_volume_page(volume, chapter, page_number, &page); if (result == UDS_SUCCESS) *index_page_ptr = &page->index_page; return result; } /* * Find the record page associated with a name in a given index page. This will return UDS_QUEUED * if the page in question must be read from storage. */ static int search_cached_index_page(struct volume *volume, struct uds_request *request, u32 chapter, u32 index_page_number, u16 *record_page_number) { int result; struct cached_page *page = NULL; u32 physical_page = map_to_physical_page(volume->geometry, chapter, index_page_number); /* * Make sure the invalidate counter is updated before we try and read the mapping. This * prevents this thread from reading a page in the cache which has already been marked for * invalidation by the reader thread, before the reader thread has noticed that the * invalidate_counter has been incremented. */ begin_pending_search(&volume->page_cache, physical_page, request->zone_number); result = get_volume_page_protected(volume, request, physical_page, &page); if (result != UDS_SUCCESS) { end_pending_search(&volume->page_cache, request->zone_number); return result; } result = uds_search_chapter_index_page(&page->index_page, volume->geometry, &request->record_name, record_page_number); end_pending_search(&volume->page_cache, request->zone_number); return result; } /* * Find the metadata associated with a name in a given record page. This will return UDS_QUEUED if * the page in question must be read from storage. */ int uds_search_cached_record_page(struct volume *volume, struct uds_request *request, u32 chapter, u16 record_page_number, bool *found) { struct cached_page *record_page; struct index_geometry *geometry = volume->geometry; int result; u32 physical_page, page_number; *found = false; if (record_page_number == NO_CHAPTER_INDEX_ENTRY) return UDS_SUCCESS; result = VDO_ASSERT(record_page_number < geometry->record_pages_per_chapter, "0 <= %d < %u", record_page_number, geometry->record_pages_per_chapter); if (result != VDO_SUCCESS) return result; page_number = geometry->index_pages_per_chapter + record_page_number; physical_page = map_to_physical_page(volume->geometry, chapter, page_number); /* * Make sure the invalidate counter is updated before we try and read the mapping. This * prevents this thread from reading a page in the cache which has already been marked for * invalidation by the reader thread, before the reader thread has noticed that the * invalidate_counter has been incremented. */ begin_pending_search(&volume->page_cache, physical_page, request->zone_number); result = get_volume_page_protected(volume, request, physical_page, &record_page); if (result != UDS_SUCCESS) { end_pending_search(&volume->page_cache, request->zone_number); return result; } if (search_record_page(dm_bufio_get_block_data(record_page->buffer), &request->record_name, geometry, &request->old_metadata)) *found = true; end_pending_search(&volume->page_cache, request->zone_number); return UDS_SUCCESS; } void uds_prefetch_volume_chapter(const struct volume *volume, u32 chapter) { const struct index_geometry *geometry = volume->geometry; u32 physical_page = map_to_physical_page(geometry, chapter, 0); dm_bufio_prefetch(volume->client, physical_page, geometry->pages_per_chapter); } int uds_read_chapter_index_from_volume(const struct volume *volume, u64 virtual_chapter, struct dm_buffer *volume_buffers[], struct delta_index_page index_pages[]) { int result; u32 i; const struct index_geometry *geometry = volume->geometry; u32 physical_chapter = uds_map_to_physical_chapter(geometry, virtual_chapter); u32 physical_page = map_to_physical_page(geometry, physical_chapter, 0); dm_bufio_prefetch(volume->client, physical_page, geometry->index_pages_per_chapter); for (i = 0; i < geometry->index_pages_per_chapter; i++) { u8 *index_page; index_page = dm_bufio_read(volume->client, physical_page + i, &volume_buffers[i]); if (IS_ERR(index_page)) { result = -PTR_ERR(index_page); vdo_log_warning_strerror(result, "error reading physical page %u", physical_page); return result; } result = init_chapter_index_page(volume, index_page, physical_chapter, i, &index_pages[i]); if (result != UDS_SUCCESS) return result; } return UDS_SUCCESS; } int uds_search_volume_page_cache(struct volume *volume, struct uds_request *request, bool *found) { int result; u32 physical_chapter = uds_map_to_physical_chapter(volume->geometry, request->virtual_chapter); u32 index_page_number; u16 record_page_number; index_page_number = uds_find_index_page_number(volume->index_page_map, &request->record_name, physical_chapter); if (request->location == UDS_LOCATION_INDEX_PAGE_LOOKUP) { record_page_number = *((u16 *) &request->old_metadata); } else { result = search_cached_index_page(volume, request, physical_chapter, index_page_number, &record_page_number); if (result != UDS_SUCCESS) return result; } return uds_search_cached_record_page(volume, request, physical_chapter, record_page_number, found); } int uds_search_volume_page_cache_for_rebuild(struct volume *volume, const struct uds_record_name *name, u64 virtual_chapter, bool *found) { int result; struct index_geometry *geometry = volume->geometry; struct cached_page *page; u32 physical_chapter = uds_map_to_physical_chapter(geometry, virtual_chapter); u32 index_page_number; u16 record_page_number; u32 page_number; *found = false; index_page_number = uds_find_index_page_number(volume->index_page_map, name, physical_chapter); result = get_volume_page(volume, physical_chapter, index_page_number, &page); if (result != UDS_SUCCESS) return result; result = uds_search_chapter_index_page(&page->index_page, geometry, name, &record_page_number); if (result != UDS_SUCCESS) return result; if (record_page_number == NO_CHAPTER_INDEX_ENTRY) return UDS_SUCCESS; page_number = geometry->index_pages_per_chapter + record_page_number; result = get_volume_page(volume, physical_chapter, page_number, &page); if (result != UDS_SUCCESS) return result; *found = search_record_page(dm_bufio_get_block_data(page->buffer), name, geometry, NULL); return UDS_SUCCESS; } static void invalidate_page(struct page_cache *cache, u32 physical_page) { struct cached_page *page; int queue_index = -1; /* We hold the read_threads_mutex. */ get_page_and_index(cache, physical_page, &queue_index, &page); if (page != NULL) { WRITE_ONCE(cache->index[page->physical_page], cache->cache_slots); wait_for_pending_searches(cache, page->physical_page); clear_cache_page(cache, page); } else if (queue_index > -1) { vdo_log_debug("setting pending read to invalid"); cache->read_queue[queue_index].invalid = true; } } void uds_forget_chapter(struct volume *volume, u64 virtual_chapter) { u32 physical_chapter = uds_map_to_physical_chapter(volume->geometry, virtual_chapter); u32 first_page = map_to_physical_page(volume->geometry, physical_chapter, 0); u32 i; vdo_log_debug("forgetting chapter %llu", (unsigned long long) virtual_chapter); mutex_lock(&volume->read_threads_mutex); for (i = 0; i < volume->geometry->pages_per_chapter; i++) invalidate_page(&volume->page_cache, first_page + i); mutex_unlock(&volume->read_threads_mutex); } /* * Donate an index pages from a newly written chapter to the page cache since it is likely to be * used again soon. The caller must already hold the reader thread mutex. */ static int donate_index_page_locked(struct volume *volume, u32 physical_chapter, u32 index_page_number, struct dm_buffer *page_buffer) { int result; struct cached_page *page = NULL; u32 physical_page = map_to_physical_page(volume->geometry, physical_chapter, index_page_number); page = select_victim_in_cache(&volume->page_cache); page->buffer = page_buffer; result = init_chapter_index_page(volume, dm_bufio_get_block_data(page_buffer), physical_chapter, index_page_number, &page->index_page); if (result != UDS_SUCCESS) { vdo_log_warning("Error initialize chapter index page"); cancel_page_in_cache(&volume->page_cache, physical_page, page); return result; } result = put_page_in_cache(&volume->page_cache, physical_page, page); if (result != UDS_SUCCESS) { vdo_log_warning("Error putting page %u in cache", physical_page); cancel_page_in_cache(&volume->page_cache, physical_page, page); return result; } return UDS_SUCCESS; } static int write_index_pages(struct volume *volume, u32 physical_chapter_number, struct open_chapter_index *chapter_index) { struct index_geometry *geometry = volume->geometry; struct dm_buffer *page_buffer; u32 first_index_page = map_to_physical_page(geometry, physical_chapter_number, 0); u32 delta_list_number = 0; u32 index_page_number; for (index_page_number = 0; index_page_number < geometry->index_pages_per_chapter; index_page_number++) { u8 *page_data; u32 physical_page = first_index_page + index_page_number; u32 lists_packed; bool last_page; int result; page_data = dm_bufio_new(volume->client, physical_page, &page_buffer); if (IS_ERR(page_data)) { return vdo_log_warning_strerror(-PTR_ERR(page_data), "failed to prepare index page"); } last_page = ((index_page_number + 1) == geometry->index_pages_per_chapter); result = uds_pack_open_chapter_index_page(chapter_index, page_data, delta_list_number, last_page, &lists_packed); if (result != UDS_SUCCESS) { dm_bufio_release(page_buffer); return vdo_log_warning_strerror(result, "failed to pack index page"); } dm_bufio_mark_buffer_dirty(page_buffer); if (lists_packed == 0) { vdo_log_debug("no delta lists packed on chapter %u page %u", physical_chapter_number, index_page_number); } else { delta_list_number += lists_packed; } uds_update_index_page_map(volume->index_page_map, chapter_index->virtual_chapter_number, physical_chapter_number, index_page_number, delta_list_number - 1); mutex_lock(&volume->read_threads_mutex); result = donate_index_page_locked(volume, physical_chapter_number, index_page_number, page_buffer); mutex_unlock(&volume->read_threads_mutex); if (result != UDS_SUCCESS) { dm_bufio_release(page_buffer); return result; } } return UDS_SUCCESS; } static u32 encode_tree(u8 record_page[], const struct uds_volume_record *sorted_pointers[], u32 next_record, u32 node, u32 node_count) { if (node < node_count) { u32 child = (2 * node) + 1; next_record = encode_tree(record_page, sorted_pointers, next_record, child, node_count); /* * In-order traversal: copy the contents of the next record into the page at the * node offset. */ memcpy(&record_page[node * BYTES_PER_RECORD], sorted_pointers[next_record++], BYTES_PER_RECORD); next_record = encode_tree(record_page, sorted_pointers, next_record, child + 1, node_count); } return next_record; } static int encode_record_page(const struct volume *volume, const struct uds_volume_record records[], u8 record_page[]) { int result; u32 i; u32 records_per_page = volume->geometry->records_per_page; const struct uds_volume_record **record_pointers = volume->record_pointers; for (i = 0; i < records_per_page; i++) record_pointers[i] = &records[i]; /* * Sort the record pointers by using just the names in the records, which is less work than * sorting the entire record values. */ BUILD_BUG_ON(offsetof(struct uds_volume_record, name) != 0); result = uds_radix_sort(volume->radix_sorter, (const u8 **) record_pointers, records_per_page, UDS_RECORD_NAME_SIZE); if (result != UDS_SUCCESS) return result; encode_tree(record_page, record_pointers, 0, 0, records_per_page); return UDS_SUCCESS; } static int write_record_pages(struct volume *volume, u32 physical_chapter_number, const struct uds_volume_record *records) { u32 record_page_number; struct index_geometry *geometry = volume->geometry; struct dm_buffer *page_buffer; const struct uds_volume_record *next_record = records; u32 first_record_page = map_to_physical_page(geometry, physical_chapter_number, geometry->index_pages_per_chapter); for (record_page_number = 0; record_page_number < geometry->record_pages_per_chapter; record_page_number++) { u8 *page_data; u32 physical_page = first_record_page + record_page_number; int result; page_data = dm_bufio_new(volume->client, physical_page, &page_buffer); if (IS_ERR(page_data)) { return vdo_log_warning_strerror(-PTR_ERR(page_data), "failed to prepare record page"); } result = encode_record_page(volume, next_record, page_data); if (result != UDS_SUCCESS) { dm_bufio_release(page_buffer); return vdo_log_warning_strerror(result, "failed to encode record page %u", record_page_number); } next_record += geometry->records_per_page; dm_bufio_mark_buffer_dirty(page_buffer); dm_bufio_release(page_buffer); } return UDS_SUCCESS; } int uds_write_chapter(struct volume *volume, struct open_chapter_index *chapter_index, const struct uds_volume_record *records) { int result; u32 physical_chapter_number = uds_map_to_physical_chapter(volume->geometry, chapter_index->virtual_chapter_number); result = write_index_pages(volume, physical_chapter_number, chapter_index); if (result != UDS_SUCCESS) return result; result = write_record_pages(volume, physical_chapter_number, records); if (result != UDS_SUCCESS) return result; result = -dm_bufio_write_dirty_buffers(volume->client); if (result != UDS_SUCCESS) vdo_log_error_strerror(result, "cannot sync chapter to volume"); return result; } static void probe_chapter(struct volume *volume, u32 chapter_number, u64 *virtual_chapter_number) { const struct index_geometry *geometry = volume->geometry; u32 expected_list_number = 0; u32 i; u64 vcn = BAD_CHAPTER; *virtual_chapter_number = BAD_CHAPTER; dm_bufio_prefetch(volume->client, map_to_physical_page(geometry, chapter_number, 0), geometry->index_pages_per_chapter); for (i = 0; i < geometry->index_pages_per_chapter; i++) { struct delta_index_page *page; int result; result = uds_get_volume_index_page(volume, chapter_number, i, &page); if (result != UDS_SUCCESS) return; if (page->virtual_chapter_number == BAD_CHAPTER) { vdo_log_error("corrupt index page in chapter %u", chapter_number); return; } if (vcn == BAD_CHAPTER) { vcn = page->virtual_chapter_number; } else if (page->virtual_chapter_number != vcn) { vdo_log_error("inconsistent chapter %u index page %u: expected vcn %llu, got vcn %llu", chapter_number, i, (unsigned long long) vcn, (unsigned long long) page->virtual_chapter_number); return; } if (expected_list_number != page->lowest_list_number) { vdo_log_error("inconsistent chapter %u index page %u: expected list number %u, got list number %u", chapter_number, i, expected_list_number, page->lowest_list_number); return; } expected_list_number = page->highest_list_number + 1; result = uds_validate_chapter_index_page(page, geometry); if (result != UDS_SUCCESS) return; } if (chapter_number != uds_map_to_physical_chapter(geometry, vcn)) { vdo_log_error("chapter %u vcn %llu is out of phase (%u)", chapter_number, (unsigned long long) vcn, geometry->chapters_per_volume); return; } *virtual_chapter_number = vcn; } /* Find the last valid physical chapter in the volume. */ static void find_real_end_of_volume(struct volume *volume, u32 limit, u32 *limit_ptr) { u32 span = 1; u32 tries = 0; while (limit > 0) { u32 chapter = (span > limit) ? 0 : limit - span; u64 vcn = 0; probe_chapter(volume, chapter, &vcn); if (vcn == BAD_CHAPTER) { limit = chapter; if (++tries > 1) span *= 2; } else { if (span == 1) break; span /= 2; tries = 0; } } *limit_ptr = limit; } static int find_chapter_limits(struct volume *volume, u32 chapter_limit, u64 *lowest_vcn, u64 *highest_vcn) { struct index_geometry *geometry = volume->geometry; u64 zero_vcn; u64 lowest = BAD_CHAPTER; u64 highest = BAD_CHAPTER; u64 moved_chapter = BAD_CHAPTER; u32 left_chapter = 0; u32 right_chapter = 0; u32 bad_chapters = 0; /* * This method assumes there is at most one run of contiguous bad chapters caused by * unflushed writes. Either the bad spot is at the beginning and end, or somewhere in the * middle. Wherever it is, the highest and lowest VCNs are adjacent to it. Otherwise the * volume is cleanly saved and somewhere in the middle of it the highest VCN immediately * precedes the lowest one. */ /* It doesn't matter if this results in a bad spot (BAD_CHAPTER). */ probe_chapter(volume, 0, &zero_vcn); /* * Binary search for end of the discontinuity in the monotonically increasing virtual * chapter numbers; bad spots are treated as a span of BAD_CHAPTER values. In effect we're * searching for the index of the smallest value less than zero_vcn. In the case we go off * the end it means that chapter 0 has the lowest vcn. * * If a virtual chapter is out-of-order, it will be the one moved by conversion. Always * skip over the moved chapter when searching, adding it to the range at the end if * necessary. */ if (geometry->remapped_physical > 0) { u64 remapped_vcn; probe_chapter(volume, geometry->remapped_physical, &remapped_vcn); if (remapped_vcn == geometry->remapped_virtual) moved_chapter = geometry->remapped_physical; } left_chapter = 0; right_chapter = chapter_limit; while (left_chapter < right_chapter) { u64 probe_vcn; u32 chapter = (left_chapter + right_chapter) / 2; if (chapter == moved_chapter) chapter--; probe_chapter(volume, chapter, &probe_vcn); if (zero_vcn <= probe_vcn) { left_chapter = chapter + 1; if (left_chapter == moved_chapter) left_chapter++; } else { right_chapter = chapter; } } /* If left_chapter goes off the end, chapter 0 has the lowest virtual chapter number.*/ if (left_chapter >= chapter_limit) left_chapter = 0; /* At this point, left_chapter is the chapter with the lowest virtual chapter number. */ probe_chapter(volume, left_chapter, &lowest); /* The moved chapter might be the lowest in the range. */ if ((moved_chapter != BAD_CHAPTER) && (lowest == geometry->remapped_virtual + 1)) lowest = geometry->remapped_virtual; /* * Circularly scan backwards, moving over any bad chapters until encountering a good one, * which is the chapter with the highest vcn. */ while (highest == BAD_CHAPTER) { right_chapter = (right_chapter + chapter_limit - 1) % chapter_limit; if (right_chapter == moved_chapter) continue; probe_chapter(volume, right_chapter, &highest); if (bad_chapters++ >= MAX_BAD_CHAPTERS) { vdo_log_error("too many bad chapters in volume: %u", bad_chapters); return UDS_CORRUPT_DATA; } } *lowest_vcn = lowest; *highest_vcn = highest; return UDS_SUCCESS; } /* * Find the highest and lowest contiguous chapters present in the volume and determine their * virtual chapter numbers. This is used by rebuild. */ int uds_find_volume_chapter_boundaries(struct volume *volume, u64 *lowest_vcn, u64 *highest_vcn, bool *is_empty) { u32 chapter_limit = volume->geometry->chapters_per_volume; find_real_end_of_volume(volume, chapter_limit, &chapter_limit); if (chapter_limit == 0) { *lowest_vcn = 0; *highest_vcn = 0; *is_empty = true; return UDS_SUCCESS; } *is_empty = false; return find_chapter_limits(volume, chapter_limit, lowest_vcn, highest_vcn); } int __must_check uds_replace_volume_storage(struct volume *volume, struct index_layout *layout, struct block_device *bdev) { int result; u32 i; result = uds_replace_index_layout_storage(layout, bdev); if (result != UDS_SUCCESS) return result; /* Release all outstanding dm_bufio objects */ for (i = 0; i < volume->page_cache.indexable_pages; i++) volume->page_cache.index[i] = volume->page_cache.cache_slots; for (i = 0; i < volume->page_cache.cache_slots; i++) clear_cache_page(&volume->page_cache, &volume->page_cache.cache[i]); if (volume->sparse_cache != NULL) uds_invalidate_sparse_cache(volume->sparse_cache); if (volume->client != NULL) dm_bufio_client_destroy(vdo_forget(volume->client)); return uds_open_volume_bufio(layout, volume->geometry->bytes_per_page, volume->reserved_buffers, &volume->client); } static int __must_check initialize_page_cache(struct page_cache *cache, const struct index_geometry *geometry, u32 chapters_in_cache, unsigned int zone_count) { int result; u32 i; cache->indexable_pages = geometry->pages_per_volume + 1; cache->cache_slots = chapters_in_cache * geometry->record_pages_per_chapter; cache->zone_count = zone_count; atomic64_set(&cache->clock, 1); result = VDO_ASSERT((cache->cache_slots <= VOLUME_CACHE_MAX_ENTRIES), "requested cache size, %u, within limit %u", cache->cache_slots, VOLUME_CACHE_MAX_ENTRIES); if (result != VDO_SUCCESS) return result; result = vdo_allocate(VOLUME_CACHE_MAX_QUEUED_READS, struct queued_read, "volume read queue", &cache->read_queue); if (result != VDO_SUCCESS) return result; result = vdo_allocate(cache->zone_count, struct search_pending_counter, "Volume Cache Zones", &cache->search_pending_counters); if (result != VDO_SUCCESS) return result; result = vdo_allocate(cache->indexable_pages, u16, "page cache index", &cache->index); if (result != VDO_SUCCESS) return result; result = vdo_allocate(cache->cache_slots, struct cached_page, "page cache cache", &cache->cache); if (result != VDO_SUCCESS) return result; /* Initialize index values to invalid values. */ for (i = 0; i < cache->indexable_pages; i++) cache->index[i] = cache->cache_slots; for (i = 0; i < cache->cache_slots; i++) clear_cache_page(cache, &cache->cache[i]); return UDS_SUCCESS; } int uds_make_volume(const struct uds_configuration *config, struct index_layout *layout, struct volume **new_volume) { unsigned int i; struct volume *volume = NULL; struct index_geometry *geometry; unsigned int reserved_buffers; int result; result = vdo_allocate(1, struct volume, "volume", &volume); if (result != VDO_SUCCESS) return result; volume->nonce = uds_get_volume_nonce(layout); result = uds_copy_index_geometry(config->geometry, &volume->geometry); if (result != UDS_SUCCESS) { uds_free_volume(volume); return vdo_log_warning_strerror(result, "failed to allocate geometry: error"); } geometry = volume->geometry; /* * Reserve a buffer for each entry in the page cache, one for the chapter writer, and one * for each entry in the sparse cache. */ reserved_buffers = config->cache_chapters * geometry->record_pages_per_chapter; reserved_buffers += 1; if (uds_is_sparse_index_geometry(geometry)) reserved_buffers += (config->cache_chapters * geometry->index_pages_per_chapter); volume->reserved_buffers = reserved_buffers; result = uds_open_volume_bufio(layout, geometry->bytes_per_page, volume->reserved_buffers, &volume->client); if (result != UDS_SUCCESS) { uds_free_volume(volume); return result; } result = uds_make_radix_sorter(geometry->records_per_page, &volume->radix_sorter); if (result != UDS_SUCCESS) { uds_free_volume(volume); return result; } result = vdo_allocate(geometry->records_per_page, const struct uds_volume_record *, "record pointers", &volume->record_pointers); if (result != VDO_SUCCESS) { uds_free_volume(volume); return result; } if (uds_is_sparse_index_geometry(geometry)) { size_t page_size = sizeof(struct delta_index_page) + geometry->bytes_per_page; result = uds_make_sparse_cache(geometry, config->cache_chapters, config->zone_count, &volume->sparse_cache); if (result != UDS_SUCCESS) { uds_free_volume(volume); return result; } volume->cache_size = page_size * geometry->index_pages_per_chapter * config->cache_chapters; } result = initialize_page_cache(&volume->page_cache, geometry, config->cache_chapters, config->zone_count); if (result != UDS_SUCCESS) { uds_free_volume(volume); return result; } volume->cache_size += volume->page_cache.cache_slots * sizeof(struct delta_index_page); result = uds_make_index_page_map(geometry, &volume->index_page_map); if (result != UDS_SUCCESS) { uds_free_volume(volume); return result; } mutex_init(&volume->read_threads_mutex); uds_init_cond(&volume->read_threads_read_done_cond); uds_init_cond(&volume->read_threads_cond); result = vdo_allocate(config->read_threads, struct thread *, "reader threads", &volume->reader_threads); if (result != VDO_SUCCESS) { uds_free_volume(volume); return result; } for (i = 0; i < config->read_threads; i++) { result = vdo_create_thread(read_thread_function, (void *) volume, "reader", &volume->reader_threads[i]); if (result != VDO_SUCCESS) { uds_free_volume(volume); return result; } volume->read_thread_count = i + 1; } *new_volume = volume; return UDS_SUCCESS; } static void uninitialize_page_cache(struct page_cache *cache) { u16 i; if (cache->cache != NULL) { for (i = 0; i < cache->cache_slots; i++) release_page_buffer(&cache->cache[i]); } vdo_free(cache->index); vdo_free(cache->cache); vdo_free(cache->search_pending_counters); vdo_free(cache->read_queue); } void uds_free_volume(struct volume *volume) { if (volume == NULL) return; if (volume->reader_threads != NULL) { unsigned int i; /* This works even if some threads weren't started. */ mutex_lock(&volume->read_threads_mutex); volume->read_threads_exiting = true; uds_broadcast_cond(&volume->read_threads_cond); mutex_unlock(&volume->read_threads_mutex); for (i = 0; i < volume->read_thread_count; i++) vdo_join_threads(volume->reader_threads[i]); vdo_free(volume->reader_threads); volume->reader_threads = NULL; } /* Must destroy the client AFTER freeing the cached pages. */ uninitialize_page_cache(&volume->page_cache); uds_free_sparse_cache(volume->sparse_cache); if (volume->client != NULL) dm_bufio_client_destroy(vdo_forget(volume->client)); uds_destroy_cond(&volume->read_threads_cond); uds_destroy_cond(&volume->read_threads_read_done_cond); mutex_destroy(&volume->read_threads_mutex); uds_free_index_page_map(volume->index_page_map); uds_free_radix_sorter(volume->radix_sorter); vdo_free(volume->geometry); vdo_free(volume->record_pointers); vdo_free(volume); } vdo-8.3.1.1/utils/uds/volume.h000066400000000000000000000125201476467262700161120ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef UDS_VOLUME_H #define UDS_VOLUME_H #include #include #include #include #include "permassert.h" #include "thread-utils.h" #include "chapter-index.h" #include "config.h" #include "geometry.h" #include "indexer.h" #include "index-layout.h" #include "index-page-map.h" #include "radix-sort.h" #include "sparse-cache.h" /* * The volume manages deduplication records on permanent storage. The term "volume" can also refer * to the region of permanent storage where the records (and the chapters containing them) are * stored. The volume handles all I/O to this region by reading, caching, and writing chapter pages * as necessary. */ enum index_lookup_mode { /* Always do lookups in all chapters normally */ LOOKUP_NORMAL, /* Only do a subset of lookups needed when rebuilding an index */ LOOKUP_FOR_REBUILD, }; struct queued_read { bool invalid; bool reserved; u32 physical_page; struct uds_request *first_request; struct uds_request *last_request; }; struct __aligned(L1_CACHE_BYTES) search_pending_counter { u64 atomic_value; }; struct cached_page { /* Whether this page is currently being read asynchronously */ bool read_pending; /* The physical page stored in this cache entry */ u32 physical_page; /* The value of the volume clock when this page was last used */ s64 last_used; /* The cached page buffer */ struct dm_buffer *buffer; /* The chapter index page, meaningless for record pages */ struct delta_index_page index_page; }; struct page_cache { /* The number of zones */ unsigned int zone_count; /* The number of volume pages that can be cached */ u32 indexable_pages; /* The maximum number of simultaneously cached pages */ u16 cache_slots; /* An index for each physical page noting where it is in the cache */ u16 *index; /* The array of cached pages */ struct cached_page *cache; /* A counter for each zone tracking if a search is occurring there */ struct search_pending_counter *search_pending_counters; /* The read queue entries as a circular array */ struct queued_read *read_queue; /* All entries above this point are constant after initialization. */ /* * These values are all indexes into the array of read queue entries. New entries in the * read queue are enqueued at read_queue_last. To dequeue entries, a reader thread gets the * lock and then claims the entry pointed to by read_queue_next_read and increments that * value. After the read is completed, the reader thread calls release_read_queue_entry(), * which increments read_queue_first until it points to a pending read, or is equal to * read_queue_next_read. This means that if multiple reads are outstanding, * read_queue_first might not advance until the last of the reads finishes. */ u16 read_queue_first; u16 read_queue_next_read; u16 read_queue_last; atomic64_t clock; }; struct volume { struct index_geometry *geometry; struct dm_bufio_client *client; u64 nonce; size_t cache_size; /* A single page worth of records, for sorting */ const struct uds_volume_record **record_pointers; /* Sorter for sorting records within each page */ struct radix_sorter *radix_sorter; struct sparse_cache *sparse_cache; struct page_cache page_cache; struct index_page_map *index_page_map; struct mutex read_threads_mutex; struct cond_var read_threads_cond; struct cond_var read_threads_read_done_cond; struct thread **reader_threads; unsigned int read_thread_count; bool read_threads_exiting; enum index_lookup_mode lookup_mode; unsigned int reserved_buffers; }; int __must_check uds_make_volume(const struct uds_configuration *config, struct index_layout *layout, struct volume **new_volume); void uds_free_volume(struct volume *volume); int __must_check uds_replace_volume_storage(struct volume *volume, struct index_layout *layout, struct block_device *bdev); int __must_check uds_find_volume_chapter_boundaries(struct volume *volume, u64 *lowest_vcn, u64 *highest_vcn, bool *is_empty); int __must_check uds_search_volume_page_cache(struct volume *volume, struct uds_request *request, bool *found); int __must_check uds_search_volume_page_cache_for_rebuild(struct volume *volume, const struct uds_record_name *name, u64 virtual_chapter, bool *found); int __must_check uds_search_cached_record_page(struct volume *volume, struct uds_request *request, u32 chapter, u16 record_page_number, bool *found); void uds_forget_chapter(struct volume *volume, u64 chapter); int __must_check uds_write_chapter(struct volume *volume, struct open_chapter_index *chapter_index, const struct uds_volume_record records[]); void uds_prefetch_volume_chapter(const struct volume *volume, u32 chapter); int __must_check uds_read_chapter_index_from_volume(const struct volume *volume, u64 virtual_chapter, struct dm_buffer *volume_buffers[], struct delta_index_page index_pages[]); int __must_check uds_get_volume_record_page(struct volume *volume, u32 chapter, u32 page_number, u8 **data_ptr); int __must_check uds_get_volume_index_page(struct volume *volume, u32 chapter, u32 page_number, struct delta_index_page **page_ptr); #endif /* UDS_VOLUME_H */ vdo-8.3.1.1/utils/vdo/000077500000000000000000000000001476467262700144275ustar00rootroot00000000000000vdo-8.3.1.1/utils/vdo/Makefile000066400000000000000000000106171476467262700160740ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # VDO_VERSION = 8.3.1.1 UDS_DIR = ../uds ifdef LLVM export CC := clang export LD := ld.ldd endif ifeq ($(origin CC), default) CC := gcc endif ifeq ($(findstring clang, $(CC)),clang) # Ignore additional warnings for clang WARNS = -Wno-compare-distinct-pointer-types \ -Wno-gnu-statement-expression \ -Wno-gnu-zero-variadic-macro-arguments \ -Wno-implicit-const-int-float-conversion \ -Wno-language-extension-token else WARNS = -Wcast-align \ -Wcast-qual \ -Wformat=2 \ -Wlogical-op endif WARNS += \ -Wall \ -Werror \ -Wextra \ -Winit-self \ -Wmissing-include-dirs \ -Wpointer-arith \ -Wredundant-decls \ -Wunused \ -Wwrite-strings \ C_WARNS = \ -Wbad-function-cast \ -Wfloat-equal \ -Wmissing-declarations \ -Wmissing-format-attribute \ -Wmissing-prototypes \ -Wnested-externs \ -Wold-style-definition \ -Wswitch-default \ ifeq ($(AR), ar) ifeq ($(origin AR), default) AR := gcc-ar endif endif OPT_FLAGS = -O3 -fno-omit-frame-pointer DEBUG_FLAGS = RPM_OPT_FLAGS ?= -fpic GLOBAL_FLAGS = $(RPM_OPT_FLAGS) -D_GNU_SOURCE -g $(OPT_FLAGS) $(WARNS) \ $(shell getconf LFS_CFLAGS) $(DEBUG_FLAGS) GLOBAL_CFLAGS = $(GLOBAL_FLAGS) -std=gnu11 -pedantic $(C_WARNS) \ $(EXTRA_CFLAGS) EXTRA_FLAGS = EXTRA_CFLAGS = $(EXTRA_FLAGS) GLOBAL_LDFLAGS = $(RPM_LD_FLAGS) $(EXTRA_LDFLAGS) EXTRA_LDFLAGS = DEPDIR = .deps MV = mv -f INCLUDES = -I. -I$(UDS_DIR) CFLAGS = -fPIC $(GLOBAL_CFLAGS) $(INCLUDES) -Wno-write-strings \ -DCURRENT_VERSION="\"$(VDO_VERSION)\"" LDFLAGS = $(GLOBAL_LDFLAGS) LDPRFLAGS = -ldl -pthread -lz -lrt -lm -luuid DEPLIBS = $(UDS_DIR)/libuds.a LIBFLAGS = -pthread -lrt PROGS = vdoaudit \ vdodebugmetadata \ vdodumpblockmap \ vdodumpmetadata \ vdoforcerebuild \ vdoformat \ vdolistmetadata \ vdoreadonly \ vdostats COMPLETIONS=vdostats NOBUILDPROGS = adaptlvm \ vdorecover PROG_SOURCES := $(PROGS:%=%.c) C_FILES := $(filter-out $(PROG_SOURCES),$(wildcard *.c)) LIB_OBJECTS := $(C_FILES:%.c=%.o) .PHONY: all all: libvdo.a $(PROGS) .PHONY: clean clean: $(MAKE) -C man clean rm -f *.o *.a rm -rf $(DEPDIR) $(PROGS) libvdo.a: $(LIB_OBJECTS) $(RM) $@ $(AR) cr $@ $(LIB_OBJECTS) INSTALL = install INSTALLOWNER ?= -o root -g root bindir ?= /usr/bin INSTALLDIR=$(DESTDIR)$(bindir) bash_completions_dir ?= /usr/share/bash-completion/completions COMPLETIONINSTALLDIR=$(DESTDIR)$(bash_completions_dir)/ .PHONY: install install: $(INSTALL) $(INSTALLOWNER) -d $(INSTALLDIR) for i in $(PROGS) $(NOBUILDPROGS); do \ $(INSTALL) $(INSTALLOWNER) -m 755 $$i $(INSTALLDIR); \ done $(MAKE) -C man install $(INSTALL) $(INSTALLOWNER) -d $(COMPLETIONINSTALLDIR) for c in $(COMPLETIONS); do \ $(INSTALL) $(INSTALLOWNER) -m 644 $$c.bash \ $(COMPLETIONINSTALLDIR)/$$c; \ done ######################################################################## # Dependency processing %.o: %.c $(COMPILE.c) -MMD -MF $(DEPDIR)/$*.d.new -MP -MT $@ -o $@ $< if cmp -s $(DEPDIR)/$*.d $(DEPDIR)/$*.d.new ; \ then \ $(RM) $(DEPDIR)/$*.d.new ; \ else \ $(MV) $(DEPDIR)/$*.d.new $(DEPDIR)/$*.d ; \ fi $(DEPDIR)/%.d: %.c @mkdir -p $(DEPDIR) $(CC) $(CFLAGS) -MM -MF $@ -MP -MT $*.o $< .SECONDEXPANSION: $(PROGS): $$@.o libvdo.a $(DEPLIBS) echo "Building $@ from $^" $(CC) $(LDFLAGS) $^ $(LDPRFLAGS) -o $@ vdoformat: LDPRFLAGS += "-lblkid" ifneq ($(MAKECMDGOALS),clean) DEPSOURCES = $(wildcard *.c) -include $(DEPSOURCES:%.c=$(DEPDIR)/%.d) endif vdo-8.3.1.1/utils/vdo/adaptlvm000077500000000000000000000071631476467262700161740ustar00rootroot00000000000000#!/bin/bash # # Copyright (C) 2021 Red Hat, Inc. All rights reserved. # # This copyrighted material is made available to anyone wishing to use, # modify, copy, or redistribute it subject to the terms and conditions # of the GNU General Public License v.2. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software Foundation, # Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA # # Author: Andy Walsh # # Script for marking an LVMVDO backing volume as Read/Write # set -euE -o pipefail TOOL=$(basename $0) LVM_EXTRA_ARGS=${EXTRA_LVM_ARGS:-} BACKINGTABLE= DEV_VG_LV= LV= LVMVDO= OPERATION= VG= VPOOL_NAME= # Expand the arguments provided into the necessary parameters to operate on # setRO or setRW. # Make sure that we actually received the arguments we need and exit if we didn't. applyConversion() { if [ -z "$1" ] || [ -z "$2" ]; then printUsage fi OPERATION=$1 LVMVDO=$2 # Break apart the LVMVDO argument into separate components for use later as # well as the full R/O device path. LV=$(echo ${LVMVDO} | awk -F/ '{print $2}') VG=$(echo ${LVMVDO} | awk -F/ '{print $1}') DEV_VG_LV=/dev/${VG}/${LV} echo "Found LV: ${LV}" echo "Found VG: ${VG}" echo "Found VPOOL_NAME: ${VPOOL_NAME}" case "$OPERATION" in "setro"|"setRO") setRO ;; "setrw"|"setRW") setRW ;; *) echo "Invalid operation requested" printUsage ;; esac } printUsage() { echo "${TOOL}: Mark the backing storage for a LVMVDO volume as read only or read write" echo echo "${TOOL} [ setRO | setRW ] /" echo echo " Options:" echo " setRO Revert a R/W LVMVDO volume to its original R/O configuration" echo " setRW Modify an LVMVDO volume to present the backing store as R/W" echo exit } # Disassemble the temporary Read/Write volume and re-activate the original volume. setRO() { dmsetup remove ${VG}-${LV} lvchange -ay ${LVM_EXTRA_ARGS} ${VG}/${LV} if [ -b ${DEV_VG_LV} ]; then echo "LVMVDO volume re-activated at ${DEV_VG_LV}" else echo "There was a problem re-activating ${DEV_VG_LV}" fi exit 0 } # Disassemble the original Read/Only volume and start a temporary Read/Write # volume in /dev/mapper setRW() { if [ ! -b "${DEV_VG_LV}" ]; then echo "${DEV_VG_LV} is not a block device" printUsage fi VPOOL_NAME=$(lvdisplay ${LVMVDO} | awk '/LV VDO Pool name/ {print $NF}') DM_VDATA="${VG}-${VPOOL_NAME}_vdata" # Look in the list of dm devices and find the appropriate backing device # If we don't find one, then there's something wrong, it's best to just exit. if [ "$(dmsetup ls | grep -q "${DM_VDATA}")" != "" ]; then echo "vdata device not found, is this an LVMVDO volume?" exit fi # Capture the DM table for the backing device so we can reuse it on the # temporary device. BACKINGTABLE="$(dmsetup table ${DM_VDATA})" # Deactivate the existing volume so that it is only being used from one # place. lvchange -an ${LVM_EXTRA_ARGS} ${DEV_VG_LV} # Create the temporary device with the name containing the parameters we need # to undo this operation later. dmsetup create ${VG}-${LV} --table "${BACKINGTABLE}" echo "Writable backing device is now available at /dev/mapper/${VG}-${LV}" echo "To undo this operation, run ${TOOL} setro ${VG}/${LV}" exit 0 } ############################################################################### # main() trap "cleanup 2" 2 test "$#" -ne 2 && printUsage echo "Received extra LVM Args '${LVM_EXTRA_ARGS}'" applyConversion $1 $2 exit vdo-8.3.1.1/utils/vdo/blockMapUtils.c000066400000000000000000000203711476467262700173470ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "blockMapUtils.h" #include #include "errors.h" #include "memory-alloc.h" #include "string-utils.h" #include "syscalls.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "physicalLayer.h" #include "userVDO.h" /** * Read a block map page call the examiner on every defined mapping in it. * Also recursively call itself to examine an entire tree. * * @param vdo The VDO * @param pagePBN The PBN of the block map page to read * @param height The height of this page in the tree * @param examiner The MappingExaminer to call for each mapped entry * * @return VDO_SUCCESS or an error **/ static int readAndExaminePage(UserVDO *vdo, physical_block_number_t pagePBN, height_t height, MappingExaminer *examiner) { struct block_map_page *page; int result = vdo->layer->allocateIOBuffer(vdo->layer, VDO_BLOCK_SIZE, "block map page", (char **) &page); if (result != VDO_SUCCESS) { return result; } result = readBlockMapPage(vdo->layer, pagePBN, vdo->states.vdo.nonce, page); if (result != VDO_SUCCESS) { vdo_free(page); return result; } if (!page->header.initialized) { vdo_free(page); return VDO_SUCCESS; } struct block_map_slot blockMapSlot = { .pbn = pagePBN, .slot = 0, }; for (; blockMapSlot.slot < VDO_BLOCK_MAP_ENTRIES_PER_PAGE; blockMapSlot.slot++) { struct data_location mapped = vdo_unpack_block_map_entry(&page->entries[blockMapSlot.slot]); result = examiner(blockMapSlot, height, mapped.pbn, mapped.state); if (result != VDO_SUCCESS) { vdo_free(page); return result; } if (!vdo_is_mapped_location(&mapped)) { continue; } if ((height > 0) && isValidDataBlock(vdo, mapped.pbn)) { result = readAndExaminePage(vdo, mapped.pbn, height - 1, examiner); if (result != VDO_SUCCESS) { vdo_free(page); return result; } } } vdo_free(page); return VDO_SUCCESS; } /**********************************************************************/ int examineBlockMapEntries(UserVDO *vdo, MappingExaminer *examiner) { struct block_map_state_2_0 *map = &vdo->states.block_map; int result = VDO_ASSERT((map->root_origin != 0), "block map root origin must be non-zero"); if (result != VDO_SUCCESS) { return result; } result = VDO_ASSERT((map->root_count != 0), "block map root count must be non-zero"); if (result != VDO_SUCCESS) { return result; } height_t height = VDO_BLOCK_MAP_TREE_HEIGHT - 1; for (uint8_t rootIndex = 0; rootIndex < map->root_count; rootIndex++) { result = readAndExaminePage(vdo, rootIndex + map->root_origin, height, examiner); if (result != VDO_SUCCESS) { return result; } } return VDO_SUCCESS; } /** * Find and decode a particular slot from a block map page. * * @param vdo The VDO * @param pbn The PBN of the block map page to read * @param slot The slot to read from the block map page * @param mappedPBNPtr A pointer to the mapped PBN * @param mappedPtr A pointer to the mapped state * * @return VDO_SUCCESS or an error **/ static int readSlotFromPage(UserVDO *vdo, physical_block_number_t pbn, slot_number_t slot, physical_block_number_t *mappedPBNPtr, enum block_mapping_state *mappedStatePtr) { struct block_map_page *page; int result = vdo->layer->allocateIOBuffer(vdo->layer, VDO_BLOCK_SIZE, "page buffer", (char **) &page); if (result != VDO_SUCCESS) { return result; } result = readBlockMapPage(vdo->layer, pbn, vdo->states.vdo.nonce, page); if (result != VDO_SUCCESS) { vdo_free(page); return result; } struct data_location mapped; if (page->header.initialized) { mapped = vdo_unpack_block_map_entry(&page->entries[slot]); } else { mapped = (struct data_location) { .state = VDO_MAPPING_STATE_UNMAPPED, .pbn = VDO_ZERO_BLOCK, }; } *mappedStatePtr = mapped.state; *mappedPBNPtr = mapped.pbn; vdo_free(page); return VDO_SUCCESS; } /**********************************************************************/ int findLBNPage(UserVDO *vdo, logical_block_number_t lbn, physical_block_number_t *pbnPtr) { if (lbn >= vdo->states.vdo.config.logical_blocks) { warnx("VDO has only %llu logical blocks, cannot dump mapping for LBA %llu", (unsigned long long) vdo->states.vdo.config.logical_blocks, (unsigned long long) lbn); return VDO_OUT_OF_RANGE; } struct block_map_state_2_0 *map = &vdo->states.block_map; page_number_t pageNumber = lbn / VDO_BLOCK_MAP_ENTRIES_PER_PAGE; // It's in the tree section of the block map. slot_number_t slots[VDO_BLOCK_MAP_TREE_HEIGHT]; slots[0] = lbn % VDO_BLOCK_MAP_ENTRIES_PER_PAGE; root_count_t rootIndex = pageNumber % map->root_count; pageNumber = pageNumber / map->root_count; for (int i = 1; i < VDO_BLOCK_MAP_TREE_HEIGHT; i++) { slots[i] = pageNumber % VDO_BLOCK_MAP_ENTRIES_PER_PAGE; pageNumber /= VDO_BLOCK_MAP_ENTRIES_PER_PAGE; } physical_block_number_t pbn = map->root_origin + rootIndex; for (int i = VDO_BLOCK_MAP_TREE_HEIGHT - 1; i > 0; i--) { enum block_mapping_state state; int result = readSlotFromPage(vdo, pbn, slots[i], &pbn, &state); if ((result != VDO_SUCCESS) || (pbn == VDO_ZERO_BLOCK) || (state == VDO_MAPPING_STATE_UNMAPPED)) { *pbnPtr = VDO_ZERO_BLOCK; return result; } } *pbnPtr = pbn; return VDO_SUCCESS; } /**********************************************************************/ int findLBNMapping(UserVDO *vdo, logical_block_number_t lbn, physical_block_number_t *pbnPtr, enum block_mapping_state *statePtr) { physical_block_number_t pagePBN; int result = findLBNPage(vdo, lbn, &pagePBN); if (result != VDO_SUCCESS) { return result; } if (pagePBN == VDO_ZERO_BLOCK) { *pbnPtr = VDO_ZERO_BLOCK; *statePtr = VDO_MAPPING_STATE_UNMAPPED; return VDO_SUCCESS; } slot_number_t slot = lbn % VDO_BLOCK_MAP_ENTRIES_PER_PAGE; return readSlotFromPage(vdo, pagePBN, slot, pbnPtr, statePtr); } /**********************************************************************/ int readBlockMapPage(PhysicalLayer *layer, physical_block_number_t pbn, nonce_t nonce, struct block_map_page *page) { int result = layer->reader(layer, pbn, 1, (char *) page); if (result != VDO_SUCCESS) { char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; printf("%llu unreadable : %s", (unsigned long long) pbn, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); return result; } enum block_map_page_validity validity = vdo_validate_block_map_page(page, nonce, pbn); if (validity == VDO_BLOCK_MAP_PAGE_VALID) { return VDO_SUCCESS; } if (validity == VDO_BLOCK_MAP_PAGE_BAD) { warnx("Expected page %llu but got page %llu", (unsigned long long) pbn, (unsigned long long) vdo_get_block_map_page_pbn(page)); } page->header.initialized = false; return VDO_SUCCESS; } vdo-8.3.1.1/utils/vdo/blockMapUtils.h000066400000000000000000000065731476467262700173640ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef BLOCK_MAP_UTILS_H #define BLOCK_MAP_UTILS_H #include "encodings.h" #include "physicalLayer.h" #include "userVDO.h" /** * A function which examines a block map page entry. Functions of this type are * passed to examineBlockMapPages() which will iterate over the entire block * map and call this function once for each non-empty mapping. * * @param slot The block_map_slot where this entry was found * @param height The height of the block map entry in the tree * @param pbn The PBN encoded in the entry * @param state The mapping state encoded in the entry * * @return VDO_SUCCESS or an error code **/ typedef int __must_check MappingExaminer(struct block_map_slot slot, height_t height, physical_block_number_t pbn, enum block_mapping_state state); /** * Apply a mapping examiner to each mapped block map entry in a VDO. * * @param vdo The VDO containing the block map to be examined * @param examiner The examiner to apply to each defined mapping * * @return VDO_SUCCESS or an error code **/ int __must_check examineBlockMapEntries(UserVDO *vdo, MappingExaminer *examiner); /** * Find the PBN for the block map page encoding a particular LBN mapping. * This will return the zero block if there is no mapping. * * @param [in] vdo The VDO * @param [in] lbn The logical block number to look up * @param [out] pbnPtr A pointer to the PBN of the requested block map page * * @return VDO_SUCCESS or an error code **/ int __must_check findLBNPage(UserVDO *vdo, logical_block_number_t lbn, physical_block_number_t *pbnPtr); /** * Look up the mapping for a single LBN in the block map. * * @param [in] vdo The VDO * @param [in] lbn The logical block number to look up * @param [out] pbnPtr A pointer to the mapped PBN * @param [out] statePtr A pointer to the mapping state * * @return VDO_SUCCESS or an error code **/ int __must_check findLBNMapping(UserVDO *vdo, logical_block_number_t lbn, physical_block_number_t *pbnPtr, enum block_mapping_state *statePtr); /** * Read a single block map page into the buffer. The page will be marked * initialized iff the page is valid. * * @param [in] layer The layer from which to read the page * @param [in] pbn The absolute physical block number of the page to * read * @param [in] nonce The VDO nonce * @param [out] page The page structure to read into * * @return VDO_SUCCESS or an error code **/ int __must_check readBlockMapPage(PhysicalLayer *layer, physical_block_number_t pbn, nonce_t nonce, struct block_map_page *page); #endif // BLOCK_MAP_UTILS_H vdo-8.3.1.1/utils/vdo/constants.h000066400000000000000000000052511476467262700166170ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_CONSTANTS_H #define VDO_CONSTANTS_H #include "types.h" enum { /* * The maximum number of contiguous PBNs which will go to a single bio submission queue, * assuming there is more than one queue. */ VDO_BIO_ROTATION_INTERVAL_LIMIT = 1024, /* The number of entries on a block map page */ VDO_BLOCK_MAP_ENTRIES_PER_PAGE = 812, /* The origin of the flat portion of the block map */ VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN = 1, /* * The height of a block map tree. Assuming a root count of 60 and 812 entries per page, * this is big enough to represent almost 95 PB of logical space. */ VDO_BLOCK_MAP_TREE_HEIGHT = 5, /* The default number of bio submission queues. */ DEFAULT_VDO_BIO_SUBMIT_QUEUE_COUNT = 4, /* The number of contiguous PBNs to be submitted to a single bio queue. */ DEFAULT_VDO_BIO_SUBMIT_QUEUE_ROTATE_INTERVAL = 64, /* The number of trees in the arboreal block map */ DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT = 60, /* The default size of the recovery journal, in blocks */ DEFAULT_VDO_RECOVERY_JOURNAL_SIZE = 32 * 1024, /* The default size of each slab journal, in blocks */ DEFAULT_VDO_SLAB_JOURNAL_SIZE = 224, /* Unit test minimum */ MINIMUM_VDO_SLAB_JOURNAL_BLOCKS = 2, /* * The initial size of lbn_operations and pbn_operations, which is based upon the expected * maximum number of outstanding VIOs. This value was chosen to make it highly unlikely * that the maps would need to be resized. */ VDO_LOCK_MAP_CAPACITY = 10000, /* The maximum number of logical zones */ MAX_VDO_LOGICAL_ZONES = 60, /* The maximum number of physical zones */ MAX_VDO_PHYSICAL_ZONES = 16, /* The base-2 logarithm of the maximum blocks in one slab */ MAX_VDO_SLAB_BITS = 23, /* The maximum number of slabs the slab depot supports */ MAX_VDO_SLABS = 8192, /* * The maximum number of block map pages to load simultaneously during recovery or rebuild. */ MAXIMUM_SIMULTANEOUS_VDO_BLOCK_MAP_RESTORATION_READS = 1024, /* The maximum number of entries in the slab summary */ MAXIMUM_VDO_SLAB_SUMMARY_ENTRIES = MAX_VDO_SLABS * MAX_VDO_PHYSICAL_ZONES, /* The maximum number of total threads in a VDO thread configuration. */ MAXIMUM_VDO_THREADS = 100, /* The maximum number of VIOs in the system at once */ MAXIMUM_VDO_USER_VIOS = 2048, /* The only physical block size supported by VDO */ VDO_BLOCK_SIZE = 4096, /* The number of sectors per block */ VDO_SECTORS_PER_BLOCK = 8, /* The size of a sector that will not be torn */ VDO_SECTOR_SIZE = 512, /* The physical block number reserved for storing the zero block */ VDO_ZERO_BLOCK = 0, }; #endif /* VDO_CONSTANTS_H */ vdo-8.3.1.1/utils/vdo/encodings.c000066400000000000000000001371411476467262700165530ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "encodings.h" #include #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "constants.h" #include "status-codes.h" #include "types.h" struct geometry_block { char magic_number[VDO_GEOMETRY_MAGIC_NUMBER_SIZE]; struct packed_header header; u32 checksum; } __packed; static const struct header GEOMETRY_BLOCK_HEADER_5_0 = { .id = VDO_GEOMETRY_BLOCK, .version = { .major_version = 5, .minor_version = 0, }, /* * Note: this size isn't just the payload size following the header, like it is everywhere * else in VDO. */ .size = sizeof(struct geometry_block) + sizeof(struct volume_geometry), }; static const struct header GEOMETRY_BLOCK_HEADER_4_0 = { .id = VDO_GEOMETRY_BLOCK, .version = { .major_version = 4, .minor_version = 0, }, /* * Note: this size isn't just the payload size following the header, like it is everywhere * else in VDO. */ .size = sizeof(struct geometry_block) + sizeof(struct volume_geometry_4_0), }; const u8 VDO_GEOMETRY_MAGIC_NUMBER[VDO_GEOMETRY_MAGIC_NUMBER_SIZE + 1] = "dmvdo001"; #define PAGE_HEADER_4_1_SIZE (8 + 8 + 8 + 1 + 1 + 1 + 1) static const struct version_number BLOCK_MAP_4_1 = { .major_version = 4, .minor_version = 1, }; const struct header VDO_BLOCK_MAP_HEADER_2_0 = { .id = VDO_BLOCK_MAP, .version = { .major_version = 2, .minor_version = 0, }, .size = sizeof(struct block_map_state_2_0), }; const struct header VDO_RECOVERY_JOURNAL_HEADER_7_0 = { .id = VDO_RECOVERY_JOURNAL, .version = { .major_version = 7, .minor_version = 0, }, .size = sizeof(struct recovery_journal_state_7_0), }; const struct header VDO_SLAB_DEPOT_HEADER_2_0 = { .id = VDO_SLAB_DEPOT, .version = { .major_version = 2, .minor_version = 0, }, .size = sizeof(struct slab_depot_state_2_0), }; static const struct header VDO_LAYOUT_HEADER_3_0 = { .id = VDO_LAYOUT, .version = { .major_version = 3, .minor_version = 0, }, .size = sizeof(struct layout_3_0) + (sizeof(struct partition_3_0) * VDO_PARTITION_COUNT), }; static const enum partition_id REQUIRED_PARTITIONS[] = { VDO_BLOCK_MAP_PARTITION, VDO_SLAB_DEPOT_PARTITION, VDO_RECOVERY_JOURNAL_PARTITION, VDO_SLAB_SUMMARY_PARTITION, }; /* * The current version for the data encoded in the super block. This must be changed any time there * is a change to encoding of the component data of any VDO component. */ static const struct version_number VDO_COMPONENT_DATA_41_0 = { .major_version = 41, .minor_version = 0, }; const struct version_number VDO_VOLUME_VERSION_67_0 = { .major_version = 67, .minor_version = 0, }; static const struct header SUPER_BLOCK_HEADER_12_0 = { .id = VDO_SUPER_BLOCK, .version = { .major_version = 12, .minor_version = 0, }, /* This is the minimum size, if the super block contains no components. */ .size = VDO_SUPER_BLOCK_FIXED_SIZE - VDO_ENCODED_HEADER_SIZE, }; /** * validate_version() - Check whether a version matches an expected version. * @expected_version: The expected version. * @actual_version: The version being validated. * @component_name: The name of the component or the calling function (for error logging). * * Logs an error describing a mismatch. * * Return: VDO_SUCCESS if the versions are the same, * VDO_UNSUPPORTED_VERSION if the versions don't match. */ static int __must_check validate_version(struct version_number expected_version, struct version_number actual_version, const char *component_name) { if (!vdo_are_same_version(expected_version, actual_version)) { return vdo_log_error_strerror(VDO_UNSUPPORTED_VERSION, "%s version mismatch, expected %d.%d, got %d.%d", component_name, expected_version.major_version, expected_version.minor_version, actual_version.major_version, actual_version.minor_version); } return VDO_SUCCESS; } /** * vdo_validate_header() - Check whether a header matches expectations. * @expected_header: The expected header. * @actual_header: The header being validated. * @exact_size: If true, the size fields of the two headers must be the same, otherwise it is * required that actual_header.size >= expected_header.size. * @name: The name of the component or the calling function (for error logging). * * Logs an error describing the first mismatch found. * * Return: VDO_SUCCESS if the header meets expectations, * VDO_INCORRECT_COMPONENT if the component ids don't match, * VDO_UNSUPPORTED_VERSION if the versions or sizes don't match. */ int vdo_validate_header(const struct header *expected_header, const struct header *actual_header, bool exact_size, const char *name) { int result; if (expected_header->id != actual_header->id) { return vdo_log_error_strerror(VDO_INCORRECT_COMPONENT, "%s ID mismatch, expected %d, got %d", name, expected_header->id, actual_header->id); } result = validate_version(expected_header->version, actual_header->version, name); if (result != VDO_SUCCESS) return result; if ((expected_header->size > actual_header->size) || (exact_size && (expected_header->size < actual_header->size))) { return vdo_log_error_strerror(VDO_UNSUPPORTED_VERSION, "%s size mismatch, expected %zu, got %zu", name, expected_header->size, actual_header->size); } return VDO_SUCCESS; } static void encode_version_number(u8 *buffer, size_t *offset, struct version_number version) { struct packed_version_number packed = vdo_pack_version_number(version); memcpy(buffer + *offset, &packed, sizeof(packed)); *offset += sizeof(packed); } void vdo_encode_header(u8 *buffer, size_t *offset, const struct header *header) { struct packed_header packed = vdo_pack_header(header); memcpy(buffer + *offset, &packed, sizeof(packed)); *offset += sizeof(packed); } static void decode_version_number(u8 *buffer, size_t *offset, struct version_number *version) { struct packed_version_number packed; memcpy(&packed, buffer + *offset, sizeof(packed)); *offset += sizeof(packed); *version = vdo_unpack_version_number(packed); } void vdo_decode_header(u8 *buffer, size_t *offset, struct header *header) { struct packed_header packed; memcpy(&packed, buffer + *offset, sizeof(packed)); *offset += sizeof(packed); *header = vdo_unpack_header(&packed); } /** * decode_volume_geometry() - Decode the on-disk representation of a volume geometry from a buffer. * @buffer: A buffer to decode from. * @offset: The offset in the buffer at which to decode. * @geometry: The structure to receive the decoded fields. * @version: The geometry block version to decode. */ static void decode_volume_geometry(u8 *buffer, size_t *offset, struct volume_geometry *geometry, u32 version) { u32 unused, mem; enum volume_region_id id; nonce_t nonce; block_count_t bio_offset = 0; bool sparse; /* This is for backwards compatibility. */ decode_u32_le(buffer, offset, &unused); geometry->unused = unused; decode_u64_le(buffer, offset, &nonce); geometry->nonce = nonce; memcpy((unsigned char *) &geometry->uuid, buffer + *offset, sizeof(uuid_t)); *offset += sizeof(uuid_t); if (version > 4) decode_u64_le(buffer, offset, &bio_offset); geometry->bio_offset = bio_offset; for (id = 0; id < VDO_VOLUME_REGION_COUNT; id++) { physical_block_number_t start_block; enum volume_region_id saved_id; decode_u32_le(buffer, offset, &saved_id); decode_u64_le(buffer, offset, &start_block); geometry->regions[id] = (struct volume_region) { .id = saved_id, .start_block = start_block, }; } decode_u32_le(buffer, offset, &mem); *offset += sizeof(u32); sparse = buffer[(*offset)++]; geometry->index_config = (struct index_config) { .mem = mem, .sparse = sparse, }; } /** * encode_volume_geometry() - Encode the on-disk representation of a volume geometry into a buffer. * @buffer: A buffer to store the encoding. * @offset: The offset in the buffer at which to encode. * @geometry: The geometry to encode. * @version: The geometry block version to encode. * * Return: VDO_SUCCESS or an error */ int encode_volume_geometry(u8 *buffer, size_t *offset, const struct volume_geometry *geometry, u32 version) { enum volume_region_id id; const struct header *header; header = ((version <= 4) ? &GEOMETRY_BLOCK_HEADER_4_0 : &GEOMETRY_BLOCK_HEADER_5_0); vdo_encode_header(buffer, offset, header); /* This is for backwards compatibility */ encode_u32_le(buffer, offset, geometry->unused); encode_u64_le(buffer, offset, geometry->nonce); memcpy(buffer + *offset, (unsigned char *) &geometry->uuid, sizeof(uuid_t)); *offset += sizeof(uuid_t); if (version > 4) encode_u64_le(buffer, offset, geometry->bio_offset); for (id = 0; id < VDO_VOLUME_REGION_COUNT; id++) { encode_u32_le(buffer, offset, geometry->regions[id].id); encode_u64_le(buffer, offset, geometry->regions[id].start_block); } encode_u32_le(buffer, offset, geometry->index_config.mem); encode_u32_le(buffer, offset, 0); if (geometry->index_config.sparse) buffer[(*offset)++] = 1; else buffer[(*offset)++] = 0; return VDO_ASSERT(header->size == (*offset + sizeof(u32)), "should have included up to the geometry checksum"); } /** * vdo_parse_geometry_block() - Decode and validate an encoded geometry block. * @block: The encoded geometry block. * @geometry: The structure to receive the decoded fields. */ int __must_check vdo_parse_geometry_block(u8 *block, struct volume_geometry *geometry) { u32 checksum, saved_checksum; struct header header; size_t offset = 0; int result; if (memcmp(block, VDO_GEOMETRY_MAGIC_NUMBER, VDO_GEOMETRY_MAGIC_NUMBER_SIZE) != 0) return VDO_BAD_MAGIC; offset += VDO_GEOMETRY_MAGIC_NUMBER_SIZE; vdo_decode_header(block, &offset, &header); if (header.version.major_version <= 4) { result = vdo_validate_header(&GEOMETRY_BLOCK_HEADER_4_0, &header, true, __func__); } else { result = vdo_validate_header(&GEOMETRY_BLOCK_HEADER_5_0, &header, true, __func__); } if (result != VDO_SUCCESS) return result; decode_volume_geometry(block, &offset, geometry, header.version.major_version); result = VDO_ASSERT(header.size == offset + sizeof(u32), "should have decoded up to the geometry checksum"); if (result != VDO_SUCCESS) return result; /* Decode and verify the checksum. */ checksum = vdo_crc32(block, offset); decode_u32_le(block, &offset, &saved_checksum); return ((checksum == saved_checksum) ? VDO_SUCCESS : VDO_CHECKSUM_MISMATCH); } struct block_map_page *vdo_format_block_map_page(void *buffer, nonce_t nonce, physical_block_number_t pbn, bool initialized) { struct block_map_page *page = buffer; memset(buffer, 0, VDO_BLOCK_SIZE); page->version = vdo_pack_version_number(BLOCK_MAP_4_1); page->header.nonce = __cpu_to_le64(nonce); page->header.pbn = __cpu_to_le64(pbn); page->header.initialized = initialized; return page; } enum block_map_page_validity vdo_validate_block_map_page(struct block_map_page *page, nonce_t nonce, physical_block_number_t pbn) { BUILD_BUG_ON(sizeof(struct block_map_page_header) != PAGE_HEADER_4_1_SIZE); if (!vdo_are_same_version(BLOCK_MAP_4_1, vdo_unpack_version_number(page->version)) || !page->header.initialized || (nonce != __le64_to_cpu(page->header.nonce))) return VDO_BLOCK_MAP_PAGE_INVALID; if (pbn != vdo_get_block_map_page_pbn(page)) return VDO_BLOCK_MAP_PAGE_BAD; return VDO_BLOCK_MAP_PAGE_VALID; } static int decode_block_map_state_2_0(u8 *buffer, size_t *offset, struct block_map_state_2_0 *state) { size_t initial_offset; block_count_t flat_page_count, root_count; physical_block_number_t flat_page_origin, root_origin; struct header header; int result; vdo_decode_header(buffer, offset, &header); result = vdo_validate_header(&VDO_BLOCK_MAP_HEADER_2_0, &header, true, __func__); if (result != VDO_SUCCESS) return result; initial_offset = *offset; decode_u64_le(buffer, offset, &flat_page_origin); result = VDO_ASSERT(flat_page_origin == VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN, "Flat page origin must be %u (recorded as %llu)", VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN, (unsigned long long) state->flat_page_origin); if (result != VDO_SUCCESS) return result; decode_u64_le(buffer, offset, &flat_page_count); result = VDO_ASSERT(flat_page_count == 0, "Flat page count must be 0 (recorded as %llu)", (unsigned long long) state->flat_page_count); if (result != VDO_SUCCESS) return result; decode_u64_le(buffer, offset, &root_origin); decode_u64_le(buffer, offset, &root_count); result = VDO_ASSERT(VDO_BLOCK_MAP_HEADER_2_0.size == *offset - initial_offset, "decoded block map component size must match header size"); if (result != VDO_SUCCESS) return result; *state = (struct block_map_state_2_0) { .flat_page_origin = flat_page_origin, .flat_page_count = flat_page_count, .root_origin = root_origin, .root_count = root_count, }; return VDO_SUCCESS; } static void encode_block_map_state_2_0(u8 *buffer, size_t *offset, struct block_map_state_2_0 state) { size_t initial_offset; vdo_encode_header(buffer, offset, &VDO_BLOCK_MAP_HEADER_2_0); initial_offset = *offset; encode_u64_le(buffer, offset, state.flat_page_origin); encode_u64_le(buffer, offset, state.flat_page_count); encode_u64_le(buffer, offset, state.root_origin); encode_u64_le(buffer, offset, state.root_count); VDO_ASSERT_LOG_ONLY(VDO_BLOCK_MAP_HEADER_2_0.size == *offset - initial_offset, "encoded block map component size must match header size"); } /** * vdo_compute_new_forest_pages() - Compute the number of pages which must be allocated at each * level in order to grow the forest to a new number of entries. * @entries: The new number of entries the block map must address. * * Return: The total number of non-leaf pages required. */ block_count_t vdo_compute_new_forest_pages(root_count_t root_count, struct boundary *old_sizes, block_count_t entries, struct boundary *new_sizes) { page_count_t leaf_pages = max(vdo_compute_block_map_page_count(entries), 1U); page_count_t level_size = DIV_ROUND_UP(leaf_pages, root_count); block_count_t total_pages = 0; height_t height; for (height = 0; height < VDO_BLOCK_MAP_TREE_HEIGHT; height++) { block_count_t new_pages; level_size = DIV_ROUND_UP(level_size, VDO_BLOCK_MAP_ENTRIES_PER_PAGE); new_sizes->levels[height] = level_size; new_pages = level_size; if (old_sizes != NULL) new_pages -= old_sizes->levels[height]; total_pages += (new_pages * root_count); } return total_pages; } /** * encode_recovery_journal_state_7_0() - Encode the state of a recovery journal. * * Return: VDO_SUCCESS or an error code. */ static void encode_recovery_journal_state_7_0(u8 *buffer, size_t *offset, struct recovery_journal_state_7_0 state) { size_t initial_offset; vdo_encode_header(buffer, offset, &VDO_RECOVERY_JOURNAL_HEADER_7_0); initial_offset = *offset; encode_u64_le(buffer, offset, state.journal_start); encode_u64_le(buffer, offset, state.logical_blocks_used); encode_u64_le(buffer, offset, state.block_map_data_blocks); VDO_ASSERT_LOG_ONLY(VDO_RECOVERY_JOURNAL_HEADER_7_0.size == *offset - initial_offset, "encoded recovery journal component size must match header size"); } /** * decode_recovery_journal_state_7_0() - Decode the state of a recovery journal saved in a buffer. * @buffer: The buffer containing the saved state. * @state: A pointer to a recovery journal state to hold the result of a successful decode. * * Return: VDO_SUCCESS or an error code. */ static int __must_check decode_recovery_journal_state_7_0(u8 *buffer, size_t *offset, struct recovery_journal_state_7_0 *state) { struct header header; int result; size_t initial_offset; sequence_number_t journal_start; block_count_t logical_blocks_used, block_map_data_blocks; vdo_decode_header(buffer, offset, &header); result = vdo_validate_header(&VDO_RECOVERY_JOURNAL_HEADER_7_0, &header, true, __func__); if (result != VDO_SUCCESS) return result; initial_offset = *offset; decode_u64_le(buffer, offset, &journal_start); decode_u64_le(buffer, offset, &logical_blocks_used); decode_u64_le(buffer, offset, &block_map_data_blocks); result = VDO_ASSERT(VDO_RECOVERY_JOURNAL_HEADER_7_0.size == *offset - initial_offset, "decoded recovery journal component size must match header size"); if (result != VDO_SUCCESS) return result; *state = (struct recovery_journal_state_7_0) { .journal_start = journal_start, .logical_blocks_used = logical_blocks_used, .block_map_data_blocks = block_map_data_blocks, }; return VDO_SUCCESS; } /** * vdo_get_journal_operation_name() - Get the name of a journal operation. * @operation: The operation to name. * * Return: The name of the operation. */ const char *vdo_get_journal_operation_name(enum journal_operation operation) { switch (operation) { case VDO_JOURNAL_DATA_REMAPPING: return "data remapping"; case VDO_JOURNAL_BLOCK_MAP_REMAPPING: return "block map remapping"; default: return "unknown journal operation"; } } /** * encode_slab_depot_state_2_0() - Encode the state of a slab depot into a buffer. */ static void encode_slab_depot_state_2_0(u8 *buffer, size_t *offset, struct slab_depot_state_2_0 state) { size_t initial_offset; vdo_encode_header(buffer, offset, &VDO_SLAB_DEPOT_HEADER_2_0); initial_offset = *offset; encode_u64_le(buffer, offset, state.slab_config.slab_blocks); encode_u64_le(buffer, offset, state.slab_config.data_blocks); encode_u64_le(buffer, offset, state.slab_config.reference_count_blocks); encode_u64_le(buffer, offset, state.slab_config.slab_journal_blocks); encode_u64_le(buffer, offset, state.slab_config.slab_journal_flushing_threshold); encode_u64_le(buffer, offset, state.slab_config.slab_journal_blocking_threshold); encode_u64_le(buffer, offset, state.slab_config.slab_journal_scrubbing_threshold); encode_u64_le(buffer, offset, state.first_block); encode_u64_le(buffer, offset, state.last_block); buffer[(*offset)++] = state.zone_count; VDO_ASSERT_LOG_ONLY(VDO_SLAB_DEPOT_HEADER_2_0.size == *offset - initial_offset, "encoded block map component size must match header size"); } /** * decode_slab_depot_state_2_0() - Decode slab depot component state version 2.0 from a buffer. * * Return: VDO_SUCCESS or an error code. */ static int decode_slab_depot_state_2_0(u8 *buffer, size_t *offset, struct slab_depot_state_2_0 *state) { struct header header; int result; size_t initial_offset; struct slab_config slab_config; block_count_t count; physical_block_number_t first_block, last_block; zone_count_t zone_count; vdo_decode_header(buffer, offset, &header); result = vdo_validate_header(&VDO_SLAB_DEPOT_HEADER_2_0, &header, true, __func__); if (result != VDO_SUCCESS) return result; initial_offset = *offset; decode_u64_le(buffer, offset, &count); slab_config.slab_blocks = count; decode_u64_le(buffer, offset, &count); slab_config.data_blocks = count; decode_u64_le(buffer, offset, &count); slab_config.reference_count_blocks = count; decode_u64_le(buffer, offset, &count); slab_config.slab_journal_blocks = count; decode_u64_le(buffer, offset, &count); slab_config.slab_journal_flushing_threshold = count; decode_u64_le(buffer, offset, &count); slab_config.slab_journal_blocking_threshold = count; decode_u64_le(buffer, offset, &count); slab_config.slab_journal_scrubbing_threshold = count; decode_u64_le(buffer, offset, &first_block); decode_u64_le(buffer, offset, &last_block); zone_count = buffer[(*offset)++]; result = VDO_ASSERT(VDO_SLAB_DEPOT_HEADER_2_0.size == *offset - initial_offset, "decoded slab depot component size must match header size"); if (result != VDO_SUCCESS) return result; *state = (struct slab_depot_state_2_0) { .slab_config = slab_config, .first_block = first_block, .last_block = last_block, .zone_count = zone_count, }; return VDO_SUCCESS; } /** * vdo_configure_slab_depot() - Configure the slab depot. * @partition: The slab depot partition * @slab_config: The configuration of a single slab. * @zone_count: The number of zones the depot will use. * @state: The state structure to be configured. * * Configures the slab_depot for the specified storage capacity, finding the number of data blocks * that will fit and still leave room for the depot metadata, then return the saved state for that * configuration. * * Return: VDO_SUCCESS or an error code. */ int vdo_configure_slab_depot(const struct partition *partition, struct slab_config slab_config, zone_count_t zone_count, struct slab_depot_state_2_0 *state) { block_count_t total_slab_blocks, total_data_blocks; size_t slab_count; physical_block_number_t last_block; block_count_t slab_size = slab_config.slab_blocks; vdo_log_debug("slabDepot %s(block_count=%llu, first_block=%llu, slab_size=%llu, zone_count=%u)", __func__, (unsigned long long) partition->count, (unsigned long long) partition->offset, (unsigned long long) slab_size, zone_count); /* We do not allow runt slabs, so we waste up to a slab's worth. */ slab_count = (partition->count / slab_size); if (slab_count == 0) return VDO_NO_SPACE; if (slab_count > MAX_VDO_SLABS) return VDO_TOO_MANY_SLABS; total_slab_blocks = slab_count * slab_config.slab_blocks; total_data_blocks = slab_count * slab_config.data_blocks; last_block = partition->offset + total_slab_blocks; *state = (struct slab_depot_state_2_0) { .slab_config = slab_config, .first_block = partition->offset, .last_block = last_block, .zone_count = zone_count, }; vdo_log_debug("slab_depot last_block=%llu, total_data_blocks=%llu, slab_count=%zu, left_over=%llu", (unsigned long long) last_block, (unsigned long long) total_data_blocks, slab_count, (unsigned long long) (partition->count - (last_block - partition->offset))); return VDO_SUCCESS; } /** * vdo_configure_slab() - Measure and initialize the configuration to use for each slab. * @slab_size: The number of blocks per slab. * @slab_journal_blocks: The number of blocks for the slab journal. * @slab_config: The slab configuration to initialize. * * Return: VDO_SUCCESS or an error code. */ int vdo_configure_slab(block_count_t slab_size, block_count_t slab_journal_blocks, struct slab_config *slab_config) { block_count_t ref_blocks, meta_blocks, data_blocks; block_count_t flushing_threshold, remaining, blocking_threshold; block_count_t minimal_extra_space, scrubbing_threshold; if (slab_journal_blocks >= slab_size) return VDO_BAD_CONFIGURATION; /* * This calculation should technically be a recurrence, but the total number of metadata * blocks is currently less than a single block of ref_counts, so we'd gain at most one * data block in each slab with more iteration. */ ref_blocks = vdo_get_saved_reference_count_size(slab_size - slab_journal_blocks); meta_blocks = (ref_blocks + slab_journal_blocks); /* Make sure test code hasn't configured slabs to be too small. */ if (meta_blocks >= slab_size) return VDO_BAD_CONFIGURATION; /* * If the slab size is very small, assume this must be a unit test and override the number * of data blocks to be a power of two (wasting blocks in the slab). Many tests need their * data_blocks fields to be the exact capacity of the configured volume, and that used to * fall out since they use a power of two for the number of data blocks, the slab size was * a power of two, and every block in a slab was a data block. * * TODO: Try to figure out some way of structuring testParameters and unit tests so this * hack isn't needed without having to edit several unit tests every time the metadata size * changes by one block. */ data_blocks = slab_size - meta_blocks; if ((slab_size < 1024) && !is_power_of_2(data_blocks)) data_blocks = ((block_count_t) 1 << ilog2(data_blocks)); /* * Configure the slab journal thresholds. The flush threshold is 168 of 224 blocks in * production, or 3/4ths, so we use this ratio for all sizes. */ flushing_threshold = ((slab_journal_blocks * 3) + 3) / 4; /* * The blocking threshold should be far enough from the flushing threshold to not produce * delays, but far enough from the end of the journal to allow multiple successive recovery * failures. */ remaining = slab_journal_blocks - flushing_threshold; blocking_threshold = flushing_threshold + ((remaining * 5) / 7); /* The scrubbing threshold should be at least 2048 entries before the end of the journal. */ minimal_extra_space = 1 + (MAXIMUM_VDO_USER_VIOS / VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK); scrubbing_threshold = blocking_threshold; if (slab_journal_blocks > minimal_extra_space) scrubbing_threshold = slab_journal_blocks - minimal_extra_space; if (blocking_threshold > scrubbing_threshold) blocking_threshold = scrubbing_threshold; *slab_config = (struct slab_config) { .slab_blocks = slab_size, .data_blocks = data_blocks, .reference_count_blocks = ref_blocks, .slab_journal_blocks = slab_journal_blocks, .slab_journal_flushing_threshold = flushing_threshold, .slab_journal_blocking_threshold = blocking_threshold, .slab_journal_scrubbing_threshold = scrubbing_threshold}; return VDO_SUCCESS; } /** * vdo_decode_slab_journal_entry() - Decode a slab journal entry. * @block: The journal block holding the entry. * @entry_count: The number of the entry. * * Return: The decoded entry. */ struct slab_journal_entry vdo_decode_slab_journal_entry(struct packed_slab_journal_block *block, journal_entry_count_t entry_count) { struct slab_journal_entry entry = vdo_unpack_slab_journal_entry(&block->payload.entries[entry_count]); if (block->header.has_block_map_increments && ((block->payload.full_entries.entry_types[entry_count / 8] & ((u8) 1 << (entry_count % 8))) != 0)) entry.operation = VDO_JOURNAL_BLOCK_MAP_REMAPPING; return entry; } /** * allocate_partition() - Allocate a partition and add it to a layout. * @layout: The layout containing the partition. * @id: The id of the partition. * @offset: The offset into the layout at which the partition begins. * @size: The size of the partition in blocks. * * Return: VDO_SUCCESS or an error. */ static int allocate_partition(struct layout *layout, u8 id, physical_block_number_t offset, block_count_t size) { struct partition *partition; int result; result = vdo_allocate(1, struct partition, __func__, &partition); if (result != VDO_SUCCESS) return result; partition->id = id; partition->offset = offset; partition->count = size; partition->next = layout->head; layout->head = partition; return VDO_SUCCESS; } /** * make_partition() - Create a new partition from the beginning or end of the unused space in a * layout. * @layout: The layout. * @id: The id of the partition to make. * @size: The number of blocks to carve out; if 0, all remaining space will be used. * @beginning: True if the partition should start at the beginning of the unused space. * * Return: A success or error code, particularly VDO_NO_SPACE if there are fewer than size blocks * remaining. */ static int __must_check make_partition(struct layout *layout, enum partition_id id, block_count_t size, bool beginning) { int result; physical_block_number_t offset; block_count_t free_blocks = layout->last_free - layout->first_free; if (size == 0) { if (free_blocks == 0) return VDO_NO_SPACE; size = free_blocks; } else if (size > free_blocks) { return VDO_NO_SPACE; } result = vdo_get_partition(layout, id, NULL); if (result != VDO_UNKNOWN_PARTITION) return VDO_PARTITION_EXISTS; offset = beginning ? layout->first_free : (layout->last_free - size); result = allocate_partition(layout, id, offset, size); if (result != VDO_SUCCESS) return result; layout->num_partitions++; if (beginning) layout->first_free += size; else layout->last_free = layout->last_free - size; return VDO_SUCCESS; } /** * vdo_initialize_layout() - Lay out the partitions of a vdo. * @size: The entire size of the vdo. * @offset: The start of the layout on the underlying storage in blocks. * @block_map_blocks: The size of the block map partition. * @journal_blocks: The size of the journal partition. * @summary_blocks: The size of the slab summary partition. * @layout: The layout to initialize. * * Return: VDO_SUCCESS or an error. */ int vdo_initialize_layout(block_count_t size, physical_block_number_t offset, block_count_t block_map_blocks, block_count_t journal_blocks, block_count_t summary_blocks, struct layout *layout) { int result; block_count_t necessary_size = (offset + block_map_blocks + journal_blocks + summary_blocks); if (necessary_size > size) return vdo_log_error_strerror(VDO_NO_SPACE, "Not enough space to make a VDO"); *layout = (struct layout) { .start = offset, .size = size, .first_free = offset, .last_free = size, .num_partitions = 0, .head = NULL, }; result = make_partition(layout, VDO_BLOCK_MAP_PARTITION, block_map_blocks, true); if (result != VDO_SUCCESS) { vdo_uninitialize_layout(layout); return result; } result = make_partition(layout, VDO_SLAB_SUMMARY_PARTITION, summary_blocks, false); if (result != VDO_SUCCESS) { vdo_uninitialize_layout(layout); return result; } result = make_partition(layout, VDO_RECOVERY_JOURNAL_PARTITION, journal_blocks, false); if (result != VDO_SUCCESS) { vdo_uninitialize_layout(layout); return result; } result = make_partition(layout, VDO_SLAB_DEPOT_PARTITION, 0, true); if (result != VDO_SUCCESS) vdo_uninitialize_layout(layout); return result; } /** * vdo_uninitialize_layout() - Clean up a layout. * @layout: The layout to clean up. * * All partitions created by this layout become invalid pointers. */ void vdo_uninitialize_layout(struct layout *layout) { while (layout->head != NULL) { struct partition *part = layout->head; layout->head = part->next; vdo_free(part); } memset(layout, 0, sizeof(struct layout)); } /** * vdo_get_partition() - Get a partition by id. * @layout: The layout from which to get a partition. * @id: The id of the partition. * @partition_ptr: A pointer to hold the partition. * * Return: VDO_SUCCESS or an error. */ int vdo_get_partition(struct layout *layout, enum partition_id id, struct partition **partition_ptr) { struct partition *partition; for (partition = layout->head; partition != NULL; partition = partition->next) { if (partition->id == id) { if (partition_ptr != NULL) *partition_ptr = partition; return VDO_SUCCESS; } } return VDO_UNKNOWN_PARTITION; } /** * vdo_get_known_partition() - Get a partition by id from a validated layout. * @layout: The layout from which to get a partition. * @id: The id of the partition. * * Return: the partition */ struct partition *vdo_get_known_partition(struct layout *layout, enum partition_id id) { struct partition *partition; int result = vdo_get_partition(layout, id, &partition); VDO_ASSERT_LOG_ONLY(result == VDO_SUCCESS, "layout has expected partition: %u", id); return partition; } static void encode_layout(u8 *buffer, size_t *offset, const struct layout *layout) { const struct partition *partition; size_t initial_offset; struct header header = VDO_LAYOUT_HEADER_3_0; BUILD_BUG_ON(sizeof(enum partition_id) != sizeof(u8)); VDO_ASSERT_LOG_ONLY(layout->num_partitions <= U8_MAX, "layout partition count must fit in a byte"); vdo_encode_header(buffer, offset, &header); initial_offset = *offset; encode_u64_le(buffer, offset, layout->first_free); encode_u64_le(buffer, offset, layout->last_free); buffer[(*offset)++] = layout->num_partitions; VDO_ASSERT_LOG_ONLY(sizeof(struct layout_3_0) == *offset - initial_offset, "encoded size of a layout header must match structure"); for (partition = layout->head; partition != NULL; partition = partition->next) { buffer[(*offset)++] = partition->id; encode_u64_le(buffer, offset, partition->offset); /* This field only exists for backwards compatibility */ encode_u64_le(buffer, offset, 0); encode_u64_le(buffer, offset, partition->count); } VDO_ASSERT_LOG_ONLY(header.size == *offset - initial_offset, "encoded size of a layout must match header size"); } static int decode_layout(u8 *buffer, size_t *offset, physical_block_number_t start, block_count_t size, struct layout *layout) { struct header header; struct layout_3_0 layout_header; struct partition *partition; size_t initial_offset; physical_block_number_t first_free, last_free; u8 partition_count; u8 i; int result; vdo_decode_header(buffer, offset, &header); /* Layout is variable size, so only do a minimum size check here. */ result = vdo_validate_header(&VDO_LAYOUT_HEADER_3_0, &header, false, __func__); if (result != VDO_SUCCESS) return result; initial_offset = *offset; decode_u64_le(buffer, offset, &first_free); decode_u64_le(buffer, offset, &last_free); partition_count = buffer[(*offset)++]; layout_header = (struct layout_3_0) { .first_free = first_free, .last_free = last_free, .partition_count = partition_count, }; result = VDO_ASSERT(sizeof(struct layout_3_0) == *offset - initial_offset, "decoded size of a layout header must match structure"); if (result != VDO_SUCCESS) return result; layout->start = start; layout->size = size; layout->first_free = layout_header.first_free; layout->last_free = layout_header.last_free; layout->num_partitions = layout_header.partition_count; if (layout->num_partitions > VDO_PARTITION_COUNT) { return vdo_log_error_strerror(VDO_UNKNOWN_PARTITION, "layout has extra partitions"); } for (i = 0; i < layout->num_partitions; i++) { u8 id; u64 partition_offset, count; id = buffer[(*offset)++]; decode_u64_le(buffer, offset, &partition_offset); *offset += sizeof(u64); decode_u64_le(buffer, offset, &count); result = allocate_partition(layout, id, partition_offset, count); if (result != VDO_SUCCESS) { vdo_uninitialize_layout(layout); return result; } } /* Validate that the layout has all (and only) the required partitions */ for (i = 0; i < VDO_PARTITION_COUNT; i++) { result = vdo_get_partition(layout, REQUIRED_PARTITIONS[i], &partition); if (result != VDO_SUCCESS) { vdo_uninitialize_layout(layout); return vdo_log_error_strerror(result, "layout is missing required partition %u", REQUIRED_PARTITIONS[i]); } start += partition->count; } if (start != size) { vdo_uninitialize_layout(layout); return vdo_log_error_strerror(UDS_BAD_STATE, "partitions do not cover the layout"); } return VDO_SUCCESS; } /** * pack_vdo_config() - Convert a vdo_config to its packed on-disk representation. * @config: The vdo config to convert. * * Return: The platform-independent representation of the config. */ static struct packed_vdo_config pack_vdo_config(struct vdo_config config) { return (struct packed_vdo_config) { .logical_blocks = __cpu_to_le64(config.logical_blocks), .physical_blocks = __cpu_to_le64(config.physical_blocks), .slab_size = __cpu_to_le64(config.slab_size), .recovery_journal_size = __cpu_to_le64(config.recovery_journal_size), .slab_journal_blocks = __cpu_to_le64(config.slab_journal_blocks), }; } /** * pack_vdo_component() - Convert a vdo_component to its packed on-disk representation. * @component: The VDO component data to convert. * * Return: The platform-independent representation of the component. */ static struct packed_vdo_component_41_0 pack_vdo_component(const struct vdo_component component) { return (struct packed_vdo_component_41_0) { .state = __cpu_to_le32(component.state), .complete_recoveries = __cpu_to_le64(component.complete_recoveries), .read_only_recoveries = __cpu_to_le64(component.read_only_recoveries), .config = pack_vdo_config(component.config), .nonce = __cpu_to_le64(component.nonce), }; } static void encode_vdo_component(u8 *buffer, size_t *offset, struct vdo_component component) { struct packed_vdo_component_41_0 packed; encode_version_number(buffer, offset, VDO_COMPONENT_DATA_41_0); packed = pack_vdo_component(component); memcpy(buffer + *offset, &packed, sizeof(packed)); *offset += sizeof(packed); } /** * unpack_vdo_config() - Convert a packed_vdo_config to its native in-memory representation. * @config: The packed vdo config to convert. * * Return: The native in-memory representation of the vdo config. */ static struct vdo_config unpack_vdo_config(struct packed_vdo_config config) { return (struct vdo_config) { .logical_blocks = __le64_to_cpu(config.logical_blocks), .physical_blocks = __le64_to_cpu(config.physical_blocks), .slab_size = __le64_to_cpu(config.slab_size), .recovery_journal_size = __le64_to_cpu(config.recovery_journal_size), .slab_journal_blocks = __le64_to_cpu(config.slab_journal_blocks), }; } /** * unpack_vdo_component_41_0() - Convert a packed_vdo_component_41_0 to its native in-memory * representation. * @component: The packed vdo component data to convert. * * Return: The native in-memory representation of the component. */ static struct vdo_component unpack_vdo_component_41_0(struct packed_vdo_component_41_0 component) { return (struct vdo_component) { .state = __le32_to_cpu(component.state), .complete_recoveries = __le64_to_cpu(component.complete_recoveries), .read_only_recoveries = __le64_to_cpu(component.read_only_recoveries), .config = unpack_vdo_config(component.config), .nonce = __le64_to_cpu(component.nonce), }; } /** * decode_vdo_component() - Decode the component data for the vdo itself out of the super block. * * Return: VDO_SUCCESS or an error. */ static int decode_vdo_component(u8 *buffer, size_t *offset, struct vdo_component *component) { struct version_number version; struct packed_vdo_component_41_0 packed; int result; decode_version_number(buffer, offset, &version); result = validate_version(version, VDO_COMPONENT_DATA_41_0, "VDO component data"); if (result != VDO_SUCCESS) return result; memcpy(&packed, buffer + *offset, sizeof(packed)); *offset += sizeof(packed); *component = unpack_vdo_component_41_0(packed); return VDO_SUCCESS; } /** * vdo_validate_config() - Validate constraints on a VDO config. * @config: The VDO config. * @physical_block_count: The minimum block count of the underlying storage. * @logical_block_count: The expected logical size of the VDO, or 0 if the logical size may be * unspecified. * * Return: A success or error code. */ int vdo_validate_config(const struct vdo_config *config, block_count_t physical_block_count, block_count_t logical_block_count) { struct slab_config slab_config; int result; result = VDO_ASSERT(config->slab_size > 0, "slab size unspecified"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(is_power_of_2(config->slab_size), "slab size must be a power of two"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->slab_size <= (1 << MAX_VDO_SLAB_BITS), "slab size must be less than or equal to 2^%d", MAX_VDO_SLAB_BITS); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->slab_journal_blocks >= MINIMUM_VDO_SLAB_JOURNAL_BLOCKS, "slab journal size meets minimum size"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->slab_journal_blocks <= config->slab_size, "slab journal size is within expected bound"); if (result != VDO_SUCCESS) return result; result = vdo_configure_slab(config->slab_size, config->slab_journal_blocks, &slab_config); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT((slab_config.data_blocks >= 1), "slab must be able to hold at least one block"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->physical_blocks > 0, "physical blocks unspecified"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->physical_blocks <= MAXIMUM_VDO_PHYSICAL_BLOCKS, "physical block count %llu exceeds maximum %llu", (unsigned long long) config->physical_blocks, (unsigned long long) MAXIMUM_VDO_PHYSICAL_BLOCKS); if (result != VDO_SUCCESS) return VDO_OUT_OF_RANGE; /* * This can't check equality because FileLayer et al can only known about the storage size, * which may not match the super block size. */ if (physical_block_count < config->physical_blocks) { vdo_log_error("A physical size of %llu blocks was specified, but that is smaller than the %llu blocks configured in the vdo super block", (unsigned long long) physical_block_count, (unsigned long long) config->physical_blocks); return VDO_PARAMETER_MISMATCH; } if (logical_block_count > 0) { result = VDO_ASSERT((config->logical_blocks > 0), "logical blocks unspecified"); if (result != VDO_SUCCESS) return result; if (logical_block_count != config->logical_blocks) { vdo_log_error("A logical size of %llu blocks was specified, but that differs from the %llu blocks configured in the vdo super block", (unsigned long long) logical_block_count, (unsigned long long) config->logical_blocks); return VDO_PARAMETER_MISMATCH; } } result = VDO_ASSERT(config->logical_blocks <= MAXIMUM_VDO_LOGICAL_BLOCKS, "logical blocks too large"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(config->recovery_journal_size > 0, "recovery journal size unspecified"); if (result != VDO_SUCCESS) return result; result = VDO_ASSERT(is_power_of_2(config->recovery_journal_size), "recovery journal size must be a power of two"); if (result != VDO_SUCCESS) return result; return result; } /** * vdo_destroy_component_states() - Clean up any allocations in a vdo_component_states. * @states: The component states to destroy. */ void vdo_destroy_component_states(struct vdo_component_states *states) { if (states == NULL) return; vdo_uninitialize_layout(&states->layout); } /** * decode_components() - Decode the components now that we know the component data is a version we * understand. * @buffer: The buffer being decoded. * @offset: The offset to start decoding from. * @geometry: The vdo geometry * @states: An object to hold the successfully decoded state. * * Return: VDO_SUCCESS or an error. */ static int __must_check decode_components(u8 *buffer, size_t *offset, struct volume_geometry *geometry, struct vdo_component_states *states) { int result; decode_vdo_component(buffer, offset, &states->vdo); result = decode_layout(buffer, offset, vdo_get_data_region_start(*geometry) + 1, states->vdo.config.physical_blocks, &states->layout); if (result != VDO_SUCCESS) return result; result = decode_recovery_journal_state_7_0(buffer, offset, &states->recovery_journal); if (result != VDO_SUCCESS) return result; result = decode_slab_depot_state_2_0(buffer, offset, &states->slab_depot); if (result != VDO_SUCCESS) return result; result = decode_block_map_state_2_0(buffer, offset, &states->block_map); if (result != VDO_SUCCESS) return result; VDO_ASSERT_LOG_ONLY(*offset == VDO_COMPONENT_DATA_OFFSET + VDO_COMPONENT_DATA_SIZE, "All decoded component data was used"); return VDO_SUCCESS; } /** * vdo_decode_component_states() - Decode the payload of a super block. * @buffer: The buffer containing the encoded super block contents. * @geometry: The vdo geometry * @states: A pointer to hold the decoded states. * * Return: VDO_SUCCESS or an error. */ int vdo_decode_component_states(u8 *buffer, struct volume_geometry *geometry, struct vdo_component_states *states) { int result; size_t offset = VDO_COMPONENT_DATA_OFFSET; /* This is for backwards compatibility. */ decode_u32_le(buffer, &offset, &states->unused); /* Check the VDO volume version */ decode_version_number(buffer, &offset, &states->volume_version); result = validate_version(VDO_VOLUME_VERSION_67_0, states->volume_version, "volume"); if (result != VDO_SUCCESS) return result; result = decode_components(buffer, &offset, geometry, states); if (result != VDO_SUCCESS) vdo_uninitialize_layout(&states->layout); return result; } /** * vdo_validate_component_states() - Validate the decoded super block configuration. * @states: The state decoded from the super block. * @geometry_nonce: The nonce from the geometry block. * @physical_size: The minimum block count of the underlying storage. * @logical_size: The expected logical size of the VDO, or 0 if the logical size may be * unspecified. * * Return: VDO_SUCCESS or an error if the configuration is invalid. */ int vdo_validate_component_states(struct vdo_component_states *states, nonce_t geometry_nonce, block_count_t physical_size, block_count_t logical_size) { if (geometry_nonce != states->vdo.nonce) { return vdo_log_error_strerror(VDO_BAD_NONCE, "Geometry nonce %llu does not match superblock nonce %llu", (unsigned long long) geometry_nonce, (unsigned long long) states->vdo.nonce); } return vdo_validate_config(&states->vdo.config, physical_size, logical_size); } /** * vdo_encode_component_states() - Encode the state of all vdo components in the super block. */ static void vdo_encode_component_states(u8 *buffer, size_t *offset, const struct vdo_component_states *states) { /* This is for backwards compatibility. */ encode_u32_le(buffer, offset, states->unused); encode_version_number(buffer, offset, states->volume_version); encode_vdo_component(buffer, offset, states->vdo); encode_layout(buffer, offset, &states->layout); encode_recovery_journal_state_7_0(buffer, offset, states->recovery_journal); encode_slab_depot_state_2_0(buffer, offset, states->slab_depot); encode_block_map_state_2_0(buffer, offset, states->block_map); VDO_ASSERT_LOG_ONLY(*offset == VDO_COMPONENT_DATA_OFFSET + VDO_COMPONENT_DATA_SIZE, "All super block component data was encoded"); } /** * vdo_encode_super_block() - Encode a super block into its on-disk representation. */ void vdo_encode_super_block(u8 *buffer, struct vdo_component_states *states) { u32 checksum; struct header header = SUPER_BLOCK_HEADER_12_0; size_t offset = 0; header.size += VDO_COMPONENT_DATA_SIZE; vdo_encode_header(buffer, &offset, &header); vdo_encode_component_states(buffer, &offset, states); checksum = vdo_crc32(buffer, offset); encode_u32_le(buffer, &offset, checksum); /* * Even though the buffer is a full block, to avoid the potential corruption from a torn * write, the entire encoding must fit in the first sector. */ VDO_ASSERT_LOG_ONLY(offset <= VDO_SECTOR_SIZE, "entire superblock must fit in one sector"); } /** * vdo_decode_super_block() - Decode a super block from its on-disk representation. */ int vdo_decode_super_block(u8 *buffer) { struct header header; int result; u32 checksum, saved_checksum; size_t offset = 0; /* Decode and validate the header. */ vdo_decode_header(buffer, &offset, &header); result = vdo_validate_header(&SUPER_BLOCK_HEADER_12_0, &header, false, __func__); if (result != VDO_SUCCESS) return result; if (header.size > VDO_COMPONENT_DATA_SIZE + sizeof(u32)) { /* * We can't check release version or checksum until we know the content size, so we * have to assume a version mismatch on unexpected values. */ return vdo_log_error_strerror(VDO_UNSUPPORTED_VERSION, "super block contents too large: %zu", header.size); } /* Skip past the component data for now, to verify the checksum. */ offset += VDO_COMPONENT_DATA_SIZE; checksum = vdo_crc32(buffer, offset); decode_u32_le(buffer, &offset, &saved_checksum); result = VDO_ASSERT(offset == VDO_SUPER_BLOCK_FIXED_SIZE + VDO_COMPONENT_DATA_SIZE, "must have decoded entire superblock payload"); if (result != VDO_SUCCESS) return result; return ((checksum != saved_checksum) ? VDO_CHECKSUM_MISMATCH : VDO_SUCCESS); } vdo-8.3.1.1/utils/vdo/encodings.h000066400000000000000000001254761476467262700165700ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_ENCODINGS_H #define VDO_ENCODINGS_H #include #include #include #include "numeric.h" #include "physicalLayer.h" #include "constants.h" #include "types.h" /* * An in-memory representation of a version number for versioned structures on disk. * * A version number consists of two portions, a major version and a minor version. Any format * change which does not require an explicit upgrade step from the previous version should * increment the minor version. Any format change which either requires an explicit upgrade step, * or is wholly incompatible (i.e. can not be upgraded to), should increment the major version, and * set the minor version to 0. */ struct version_number { u32 major_version; u32 minor_version; }; /* * A packed, machine-independent, on-disk representation of a version_number. Both fields are * stored in little-endian byte order. */ struct packed_version_number { __le32 major_version; __le32 minor_version; } __packed; /* The registry of component ids for use in headers */ #define VDO_SUPER_BLOCK 0 #define VDO_LAYOUT 1 #define VDO_RECOVERY_JOURNAL 2 #define VDO_SLAB_DEPOT 3 #define VDO_BLOCK_MAP 4 #define VDO_GEOMETRY_BLOCK 5 /* The header for versioned data stored on disk. */ struct header { u32 id; /* The component this is a header for */ struct version_number version; /* The version of the data format */ size_t size; /* The size of the data following this header */ }; /* A packed, machine-independent, on-disk representation of a component header. */ struct packed_header { __le32 id; struct packed_version_number version; __le64 size; } __packed; enum { VDO_GEOMETRY_BLOCK_LOCATION = 0, VDO_GEOMETRY_MAGIC_NUMBER_SIZE = 8, VDO_DEFAULT_GEOMETRY_BLOCK_VERSION = 5, }; struct index_config { u32 mem; u32 unused; bool sparse; } __packed; enum volume_region_id { VDO_INDEX_REGION = 0, VDO_DATA_REGION = 1, VDO_VOLUME_REGION_COUNT, }; struct volume_region { /* The ID of the region */ enum volume_region_id id; /* * The absolute starting offset on the device. The region continues until the next region * begins. */ physical_block_number_t start_block; } __packed; struct volume_geometry { /* For backwards compatibility */ u32 unused; /* The nonce of this volume */ nonce_t nonce; /* The uuid of this volume */ uuid_t uuid; /* The block offset to be applied to bios */ block_count_t bio_offset; /* The regions in ID order */ struct volume_region regions[VDO_VOLUME_REGION_COUNT]; /* The index config */ struct index_config index_config; } __packed; /* This volume geometry struct is used for sizing only */ struct volume_geometry_4_0 { /* For backwards compatibility */ u32 unused; /* The nonce of this volume */ nonce_t nonce; /* The uuid of this volume */ uuid_t uuid; /* The regions in ID order */ struct volume_region regions[VDO_VOLUME_REGION_COUNT]; /* The index config */ struct index_config index_config; } __packed; extern const u8 VDO_GEOMETRY_MAGIC_NUMBER[VDO_GEOMETRY_MAGIC_NUMBER_SIZE + 1]; /** * DOC: Block map entries * * The entry for each logical block in the block map is encoded into five bytes, which saves space * in both the on-disk and in-memory layouts. It consists of the 36 low-order bits of a * physical_block_number_t (addressing 256 terabytes with a 4KB block size) and a 4-bit encoding of * a block_mapping_state. * * Of the 8 high bits of the 5-byte structure: * * Bits 7..4: The four highest bits of the 36-bit physical block number * Bits 3..0: The 4-bit block_mapping_state * * The following 4 bytes are the low order bytes of the physical block number, in little-endian * order. * * Conversion functions to and from a data location are provided. */ struct block_map_entry { #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ unsigned mapping_state : 4; unsigned pbn_high_nibble : 4; #else unsigned pbn_high_nibble : 4; unsigned mapping_state : 4; #endif __le32 pbn_low_word; } __packed; struct block_map_page_header { __le64 nonce; __le64 pbn; /* May be non-zero on disk */ u8 unused_long_word[8]; /* Whether this page has been written twice to disk */ bool initialized; /* Always zero on disk */ u8 unused_byte1; /* May be non-zero on disk */ u8 unused_byte2; u8 unused_byte3; } __packed; struct block_map_page { struct packed_version_number version; struct block_map_page_header header; struct block_map_entry entries[]; } __packed; enum block_map_page_validity { VDO_BLOCK_MAP_PAGE_VALID, VDO_BLOCK_MAP_PAGE_INVALID, /* Valid page found in the wrong location on disk */ VDO_BLOCK_MAP_PAGE_BAD, }; struct block_map_state_2_0 { physical_block_number_t flat_page_origin; block_count_t flat_page_count; physical_block_number_t root_origin; block_count_t root_count; } __packed; struct boundary { page_number_t levels[VDO_BLOCK_MAP_TREE_HEIGHT]; }; extern const struct header VDO_BLOCK_MAP_HEADER_2_0; /* The state of the recovery journal as encoded in the VDO super block. */ struct recovery_journal_state_7_0 { /* Sequence number to start the journal */ sequence_number_t journal_start; /* Number of logical blocks used by VDO */ block_count_t logical_blocks_used; /* Number of block map pages allocated */ block_count_t block_map_data_blocks; } __packed; extern const struct header VDO_RECOVERY_JOURNAL_HEADER_7_0; typedef u16 journal_entry_count_t; /* * A recovery journal entry stores three physical locations: a data location that is the value of a * single mapping in the block map tree, and the two locations of the block map pages and slots * that are acquiring and releasing a reference to the location. The journal entry also stores an * operation code that says whether the mapping is for a logical block or for the block map tree * itself. */ struct recovery_journal_entry { struct block_map_slot slot; struct data_location mapping; struct data_location unmapping; enum journal_operation operation; }; /* The packed, on-disk representation of a recovery journal entry. */ struct packed_recovery_journal_entry { /* * In little-endian bit order: * Bits 15..12: The four highest bits of the 36-bit physical block number of the block map * tree page * Bits 11..2: The 10-bit block map page slot number * Bit 1..0: The journal_operation of the entry (this actually only requires 1 bit, but * it is convenient to keep the extra bit as part of this field. */ #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ unsigned operation : 2; unsigned slot_low : 6; unsigned slot_high : 4; unsigned pbn_high_nibble : 4; #else unsigned slot_low : 6; unsigned operation : 2; unsigned pbn_high_nibble : 4; unsigned slot_high : 4; #endif /* * Bits 47..16: The 32 low-order bits of the block map page PBN, in little-endian byte * order */ __le32 pbn_low_word; /* * Bits 87..48: The five-byte block map entry encoding the location that will be stored in * the block map page slot */ struct block_map_entry mapping; /* * Bits 127..88: The five-byte block map entry encoding the location that was stored in the * block map page slot */ struct block_map_entry unmapping; } __packed; /* The packed, on-disk representation of an old format recovery journal entry. */ struct packed_recovery_journal_entry_1 { /* * In little-endian bit order: * Bits 15..12: The four highest bits of the 36-bit physical block number of the block map * tree page * Bits 11..2: The 10-bit block map page slot number * Bits 1..0: The 2-bit journal_operation of the entry * */ #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ unsigned operation : 2; unsigned slot_low : 6; unsigned slot_high : 4; unsigned pbn_high_nibble : 4; #else unsigned slot_low : 6; unsigned operation : 2; unsigned pbn_high_nibble : 4; unsigned slot_high : 4; #endif /* * Bits 47..16: The 32 low-order bits of the block map page PBN, in little-endian byte * order */ __le32 pbn_low_word; /* * Bits 87..48: The five-byte block map entry encoding the location that was or will be * stored in the block map page slot */ struct block_map_entry block_map_entry; } __packed; enum journal_operation_1 { VDO_JOURNAL_DATA_DECREMENT = 0, VDO_JOURNAL_DATA_INCREMENT = 1, VDO_JOURNAL_BLOCK_MAP_DECREMENT = 2, VDO_JOURNAL_BLOCK_MAP_INCREMENT = 3, } __packed; struct recovery_block_header { sequence_number_t block_map_head; /* Block map head sequence number */ sequence_number_t slab_journal_head; /* Slab journal head seq. number */ sequence_number_t sequence_number; /* Sequence number for this block */ nonce_t nonce; /* A given VDO instance's nonce */ block_count_t logical_blocks_used; /* Logical blocks in use */ block_count_t block_map_data_blocks; /* Allocated block map pages */ journal_entry_count_t entry_count; /* Number of entries written */ u8 check_byte; /* The protection check byte */ u8 recovery_count; /* Number of recoveries completed */ enum vdo_metadata_type metadata_type; /* Metadata type */ }; /* * The packed, on-disk representation of a recovery journal block header. All fields are kept in * little-endian byte order. */ struct packed_journal_header { /* Block map head 64-bit sequence number */ __le64 block_map_head; /* Slab journal head 64-bit sequence number */ __le64 slab_journal_head; /* The 64-bit sequence number for this block */ __le64 sequence_number; /* A given VDO instance's 64-bit nonce */ __le64 nonce; /* 8-bit metadata type (should always be one for the recovery journal) */ u8 metadata_type; /* 16-bit count of the entries encoded in the block */ __le16 entry_count; /* 64-bit count of the logical blocks used when this block was opened */ __le64 logical_blocks_used; /* 64-bit count of the block map blocks used when this block was opened */ __le64 block_map_data_blocks; /* The protection check byte */ u8 check_byte; /* The number of recoveries completed */ u8 recovery_count; } __packed; struct packed_journal_sector { /* The protection check byte */ u8 check_byte; /* The number of recoveries completed */ u8 recovery_count; /* The number of entries in this sector */ u8 entry_count; /* Journal entries for this sector */ struct packed_recovery_journal_entry entries[]; } __packed; enum { /* The number of entries in each sector (except the last) when filled */ RECOVERY_JOURNAL_ENTRIES_PER_SECTOR = ((VDO_SECTOR_SIZE - sizeof(struct packed_journal_sector)) / sizeof(struct packed_recovery_journal_entry)), RECOVERY_JOURNAL_ENTRIES_PER_BLOCK = RECOVERY_JOURNAL_ENTRIES_PER_SECTOR * 7, /* The number of entries in a v1 recovery journal block. */ RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK = 311, /* The number of entries in each v1 sector (except the last) when filled */ RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR = ((VDO_SECTOR_SIZE - sizeof(struct packed_journal_sector)) / sizeof(struct packed_recovery_journal_entry_1)), /* The number of entries in the last sector when a block is full */ RECOVERY_JOURNAL_1_ENTRIES_IN_LAST_SECTOR = (RECOVERY_JOURNAL_1_ENTRIES_PER_BLOCK % RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR), }; /* A type representing a reference count of a block. */ typedef u8 vdo_refcount_t; /* The absolute position of an entry in a recovery journal or slab journal. */ struct journal_point { sequence_number_t sequence_number; journal_entry_count_t entry_count; }; /* A packed, platform-independent encoding of a struct journal_point. */ struct packed_journal_point { /* * The packed representation is the little-endian 64-bit representation of the low-order 48 * bits of the sequence number, shifted up 16 bits, or'ed with the 16-bit entry count. * * Very long-term, the top 16 bits of the sequence number may not always be zero, as this * encoding assumes--see BZ 1523240. */ __le64 encoded_point; } __packed; /* Special vdo_refcount_t values. */ #define EMPTY_REFERENCE_COUNT 0 enum { MAXIMUM_REFERENCE_COUNT = 254, PROVISIONAL_REFERENCE_COUNT = 255, }; enum { COUNTS_PER_SECTOR = ((VDO_SECTOR_SIZE - sizeof(struct packed_journal_point)) / sizeof(vdo_refcount_t)), COUNTS_PER_BLOCK = COUNTS_PER_SECTOR * VDO_SECTORS_PER_BLOCK, }; /* The format of each sector of a reference_block on disk. */ struct packed_reference_sector { struct packed_journal_point commit_point; vdo_refcount_t counts[COUNTS_PER_SECTOR]; } __packed; struct packed_reference_block { struct packed_reference_sector sectors[VDO_SECTORS_PER_BLOCK]; }; struct slab_depot_state_2_0 { struct slab_config slab_config; physical_block_number_t first_block; physical_block_number_t last_block; zone_count_t zone_count; } __packed; extern const struct header VDO_SLAB_DEPOT_HEADER_2_0; /* * vdo_slab journal blocks may have one of two formats, depending upon whether or not any of the * entries in the block are block map increments. Since the steady state for a VDO is that all of * the necessary block map pages will be allocated, most slab journal blocks will have only data * entries. Such blocks can hold more entries, hence the two formats. */ /* A single slab journal entry */ struct slab_journal_entry { slab_block_number sbn; enum journal_operation operation; bool increment; }; /* A single slab journal entry in its on-disk form */ typedef struct { u8 offset_low8; u8 offset_mid8; #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ unsigned offset_high7 : 7; unsigned increment : 1; #else unsigned increment : 1; unsigned offset_high7 : 7; #endif } __packed packed_slab_journal_entry; /* The unpacked representation of the header of a slab journal block */ struct slab_journal_block_header { /* Sequence number for head of journal */ sequence_number_t head; /* Sequence number for this block */ sequence_number_t sequence_number; /* The nonce for a given VDO instance */ nonce_t nonce; /* Recovery journal point for last entry */ struct journal_point recovery_point; /* Metadata type */ enum vdo_metadata_type metadata_type; /* Whether this block contains block map increments */ bool has_block_map_increments; /* The number of entries in the block */ journal_entry_count_t entry_count; }; /* * The packed, on-disk representation of a slab journal block header. All fields are kept in * little-endian byte order. */ struct packed_slab_journal_block_header { /* 64-bit sequence number for head of journal */ __le64 head; /* 64-bit sequence number for this block */ __le64 sequence_number; /* Recovery journal point for the last entry, packed into 64 bits */ struct packed_journal_point recovery_point; /* The 64-bit nonce for a given VDO instance */ __le64 nonce; /* 8-bit metadata type (should always be two, for the slab journal) */ u8 metadata_type; /* Whether this block contains block map increments */ bool has_block_map_increments; /* 16-bit count of the entries encoded in the block */ __le16 entry_count; } __packed; enum { VDO_SLAB_JOURNAL_PAYLOAD_SIZE = VDO_BLOCK_SIZE - sizeof(struct packed_slab_journal_block_header), VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK = (VDO_SLAB_JOURNAL_PAYLOAD_SIZE * 8) / 25, VDO_SLAB_JOURNAL_ENTRY_TYPES_SIZE = ((VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK - 1) / 8) + 1, VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK = (VDO_SLAB_JOURNAL_PAYLOAD_SIZE / sizeof(packed_slab_journal_entry)), }; /* The payload of a slab journal block which has block map increments */ struct full_slab_journal_entries { /* The entries themselves */ packed_slab_journal_entry entries[VDO_SLAB_JOURNAL_FULL_ENTRIES_PER_BLOCK]; /* The bit map indicating which entries are block map increments */ u8 entry_types[VDO_SLAB_JOURNAL_ENTRY_TYPES_SIZE]; } __packed; typedef union { /* Entries which include block map increments */ struct full_slab_journal_entries full_entries; /* Entries which are only data updates */ packed_slab_journal_entry entries[VDO_SLAB_JOURNAL_ENTRIES_PER_BLOCK]; /* Ensure the payload fills to the end of the block */ u8 space[VDO_SLAB_JOURNAL_PAYLOAD_SIZE]; } __packed slab_journal_payload; struct packed_slab_journal_block { struct packed_slab_journal_block_header header; slab_journal_payload payload; } __packed; /* The offset of a slab journal tail block. */ typedef u8 tail_block_offset_t; struct slab_summary_entry { /* Bits 7..0: The offset of the tail block within the slab journal */ tail_block_offset_t tail_block_offset; #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ /* Bits 13..8: A hint about the fullness of the slab */ unsigned int fullness_hint : 6; /* Bit 14: Whether the ref_counts must be loaded from the layer */ unsigned int load_ref_counts : 1; /* Bit 15: The believed cleanliness of this slab */ unsigned int is_dirty : 1; #else /* Bit 15: The believed cleanliness of this slab */ unsigned int is_dirty : 1; /* Bit 14: Whether the ref_counts must be loaded from the layer */ unsigned int load_ref_counts : 1; /* Bits 13..8: A hint about the fullness of the slab */ unsigned int fullness_hint : 6; #endif } __packed; enum { VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS = 6, VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK = VDO_BLOCK_SIZE / sizeof(struct slab_summary_entry), VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE = MAX_VDO_SLABS / VDO_SLAB_SUMMARY_ENTRIES_PER_BLOCK, VDO_SLAB_SUMMARY_BLOCKS = VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE * MAX_VDO_PHYSICAL_ZONES, }; struct layout { physical_block_number_t start; block_count_t size; physical_block_number_t first_free; physical_block_number_t last_free; size_t num_partitions; struct partition *head; }; struct partition { enum partition_id id; /* The id of this partition */ physical_block_number_t offset; /* The offset into the layout of this partition */ block_count_t count; /* The number of blocks in the partition */ struct partition *next; /* A pointer to the next partition in the layout */ }; struct layout_3_0 { physical_block_number_t first_free; physical_block_number_t last_free; u8 partition_count; } __packed; struct partition_3_0 { enum partition_id id; physical_block_number_t offset; physical_block_number_t base; /* unused but retained for backwards compatibility */ block_count_t count; } __packed; /* * The configuration of the VDO service. */ struct vdo_config { block_count_t logical_blocks; /* number of logical blocks */ block_count_t physical_blocks; /* number of physical blocks */ block_count_t slab_size; /* number of blocks in a slab */ block_count_t recovery_journal_size; /* number of recovery journal blocks */ block_count_t slab_journal_blocks; /* number of slab journal blocks */ }; /** The maximum logical space is 4 petabytes, which is 1 terablock. */ static const block_count_t MAXIMUM_VDO_LOGICAL_BLOCKS = 1024ULL * 1024 * 1024 * 1024; /** The maximum physical space is 256 terabytes, which is 64 gigablocks. */ static const block_count_t MAXIMUM_VDO_PHYSICAL_BLOCKS = 1024ULL * 1024 * 1024 * 64; /* This is the structure that captures the vdo fields saved as a super block component. */ struct vdo_component { enum vdo_state state; u64 complete_recoveries; u64 read_only_recoveries; struct vdo_config config; nonce_t nonce; }; /* * A packed, machine-independent, on-disk representation of the vdo_config in the VDO component * data in the super block. */ struct packed_vdo_config { __le64 logical_blocks; __le64 physical_blocks; __le64 slab_size; __le64 recovery_journal_size; __le64 slab_journal_blocks; } __packed; /* * A packed, machine-independent, on-disk representation of version 41.0 of the VDO component data * in the super block. */ struct packed_vdo_component_41_0 { __le32 state; __le64 complete_recoveries; __le64 read_only_recoveries; struct packed_vdo_config config; __le64 nonce; } __packed; /* * The version of the on-disk format of a VDO volume. This should be incremented any time the * on-disk representation of any VDO structure changes. Changes which require only online upgrade * steps should increment the minor version. Changes which require an offline upgrade or which can * not be upgraded to at all should increment the major version and set the minor version to 0. */ extern const struct version_number VDO_VOLUME_VERSION_67_0; enum { VDO_ENCODED_HEADER_SIZE = sizeof(struct packed_header), BLOCK_MAP_COMPONENT_ENCODED_SIZE = VDO_ENCODED_HEADER_SIZE + sizeof(struct block_map_state_2_0), RECOVERY_JOURNAL_COMPONENT_ENCODED_SIZE = VDO_ENCODED_HEADER_SIZE + sizeof(struct recovery_journal_state_7_0), SLAB_DEPOT_COMPONENT_ENCODED_SIZE = VDO_ENCODED_HEADER_SIZE + sizeof(struct slab_depot_state_2_0), VDO_PARTITION_COUNT = 4, VDO_LAYOUT_ENCODED_SIZE = (VDO_ENCODED_HEADER_SIZE + sizeof(struct layout_3_0) + (sizeof(struct partition_3_0) * VDO_PARTITION_COUNT)), VDO_SUPER_BLOCK_FIXED_SIZE = VDO_ENCODED_HEADER_SIZE + sizeof(u32), VDO_MAX_COMPONENT_DATA_SIZE = VDO_SECTOR_SIZE - VDO_SUPER_BLOCK_FIXED_SIZE, VDO_COMPONENT_ENCODED_SIZE = (sizeof(struct packed_version_number) + sizeof(struct packed_vdo_component_41_0)), VDO_COMPONENT_DATA_OFFSET = VDO_ENCODED_HEADER_SIZE, VDO_COMPONENT_DATA_SIZE = (sizeof(u32) + sizeof(struct packed_version_number) + VDO_COMPONENT_ENCODED_SIZE + VDO_LAYOUT_ENCODED_SIZE + RECOVERY_JOURNAL_COMPONENT_ENCODED_SIZE + SLAB_DEPOT_COMPONENT_ENCODED_SIZE + BLOCK_MAP_COMPONENT_ENCODED_SIZE), }; /* The entirety of the component data encoded in the VDO super block. */ struct vdo_component_states { /* For backwards compatibility */ u32 unused; /* The VDO volume version */ struct version_number volume_version; /* Components */ struct vdo_component vdo; struct block_map_state_2_0 block_map; struct recovery_journal_state_7_0 recovery_journal; struct slab_depot_state_2_0 slab_depot; /* Our partitioning of the underlying storage */ struct layout layout; }; /** * vdo_are_same_version() - Check whether two version numbers are the same. * @version_a: The first version. * @version_b: The second version. * * Return: true if the two versions are the same. */ static inline bool vdo_are_same_version(struct version_number version_a, struct version_number version_b) { return ((version_a.major_version == version_b.major_version) && (version_a.minor_version == version_b.minor_version)); } /** * vdo_is_upgradable_version() - Check whether an actual version is upgradable to an expected * version. * @expected_version: The expected version. * @actual_version: The version being validated. * * An actual version is upgradable if its major number is expected but its minor number differs, * and the expected version's minor number is greater than the actual version's minor number. * * Return: true if the actual version is upgradable. */ static inline bool vdo_is_upgradable_version(struct version_number expected_version, struct version_number actual_version) { return ((expected_version.major_version == actual_version.major_version) && (expected_version.minor_version > actual_version.minor_version)); } int __must_check vdo_validate_header(const struct header *expected_header, const struct header *actual_header, bool exact_size, const char *component_name); void vdo_encode_header(u8 *buffer, size_t *offset, const struct header *header); void vdo_decode_header(u8 *buffer, size_t *offset, struct header *header); /** * vdo_pack_version_number() - Convert a version_number to its packed on-disk representation. * @version: The version number to convert. * * Return: the platform-independent representation of the version */ static inline struct packed_version_number vdo_pack_version_number(struct version_number version) { return (struct packed_version_number) { .major_version = __cpu_to_le32(version.major_version), .minor_version = __cpu_to_le32(version.minor_version), }; } /** * vdo_unpack_version_number() - Convert a packed_version_number to its native in-memory * representation. * @version: The version number to convert. * * Return: The platform-independent representation of the version. */ static inline struct version_number vdo_unpack_version_number(struct packed_version_number version) { return (struct version_number) { .major_version = __le32_to_cpu(version.major_version), .minor_version = __le32_to_cpu(version.minor_version), }; } /** * vdo_pack_header() - Convert a component header to its packed on-disk representation. * @header: The header to convert. * * Return: the platform-independent representation of the header */ static inline struct packed_header vdo_pack_header(const struct header *header) { return (struct packed_header) { .id = __cpu_to_le32(header->id), .version = vdo_pack_version_number(header->version), .size = __cpu_to_le64(header->size), }; } /** * vdo_unpack_header() - Convert a packed_header to its native in-memory representation. * @header: The header to convert. * * Return: The platform-independent representation of the version. */ static inline struct header vdo_unpack_header(const struct packed_header *header) { return (struct header) { .id = __le32_to_cpu(header->id), .version = vdo_unpack_version_number(header->version), .size = __le64_to_cpu(header->size), }; } /** * vdo_get_index_region_start() - Get the start of the index region from a geometry. * @geometry: The geometry. * * Return: The start of the index region. */ static inline physical_block_number_t __must_check vdo_get_index_region_start(struct volume_geometry geometry) { return geometry.regions[VDO_INDEX_REGION].start_block; } /** * vdo_get_data_region_start() - Get the start of the data region from a geometry. * @geometry: The geometry. * * Return: The start of the data region. */ static inline physical_block_number_t __must_check vdo_get_data_region_start(struct volume_geometry geometry) { return geometry.regions[VDO_DATA_REGION].start_block; } /** * vdo_get_index_region_size() - Get the size of the index region from a geometry. * @geometry: The geometry. * * Return: The size of the index region. */ static inline physical_block_number_t __must_check vdo_get_index_region_size(struct volume_geometry geometry) { return vdo_get_data_region_start(geometry) - vdo_get_index_region_start(geometry); } int __must_check vdo_parse_geometry_block(unsigned char *block, struct volume_geometry *geometry); int __must_check encode_volume_geometry(u8 *buffer, size_t *offset, const struct volume_geometry *geometry, u32 version); static inline bool vdo_is_state_compressed(const enum block_mapping_state mapping_state) { return (mapping_state > VDO_MAPPING_STATE_UNCOMPRESSED); } static inline struct block_map_entry vdo_pack_block_map_entry(physical_block_number_t pbn, enum block_mapping_state mapping_state) { return (struct block_map_entry) { .mapping_state = (mapping_state & 0x0F), .pbn_high_nibble = ((pbn >> 32) & 0x0F), .pbn_low_word = __cpu_to_le32(pbn & UINT_MAX), }; } static inline struct data_location vdo_unpack_block_map_entry(const struct block_map_entry *entry) { physical_block_number_t low32 = __le32_to_cpu(entry->pbn_low_word); physical_block_number_t high4 = entry->pbn_high_nibble; return (struct data_location) { .pbn = ((high4 << 32) | low32), .state = entry->mapping_state, }; } static inline bool vdo_is_mapped_location(const struct data_location *location) { return (location->state != VDO_MAPPING_STATE_UNMAPPED); } static inline bool vdo_is_valid_location(const struct data_location *location) { if (location->pbn == VDO_ZERO_BLOCK) return !vdo_is_state_compressed(location->state); else return vdo_is_mapped_location(location); } static inline physical_block_number_t __must_check vdo_get_block_map_page_pbn(const struct block_map_page *page) { return __le64_to_cpu(page->header.pbn); } struct block_map_page *vdo_format_block_map_page(void *buffer, nonce_t nonce, physical_block_number_t pbn, bool initialized); enum block_map_page_validity __must_check vdo_validate_block_map_page(struct block_map_page *page, nonce_t nonce, physical_block_number_t pbn); static inline page_count_t vdo_compute_block_map_page_count(block_count_t entries) { return DIV_ROUND_UP(entries, VDO_BLOCK_MAP_ENTRIES_PER_PAGE); } block_count_t __must_check vdo_compute_new_forest_pages(root_count_t root_count, struct boundary *old_sizes, block_count_t entries, struct boundary *new_sizes); /** * vdo_pack_recovery_journal_entry() - Return the packed, on-disk representation of a recovery * journal entry. * @entry: The journal entry to pack. * * Return: The packed representation of the journal entry. */ static inline struct packed_recovery_journal_entry vdo_pack_recovery_journal_entry(const struct recovery_journal_entry *entry) { return (struct packed_recovery_journal_entry) { .operation = entry->operation, .slot_low = entry->slot.slot & 0x3F, .slot_high = (entry->slot.slot >> 6) & 0x0F, .pbn_high_nibble = (entry->slot.pbn >> 32) & 0x0F, .pbn_low_word = __cpu_to_le32(entry->slot.pbn & UINT_MAX), .mapping = vdo_pack_block_map_entry(entry->mapping.pbn, entry->mapping.state), .unmapping = vdo_pack_block_map_entry(entry->unmapping.pbn, entry->unmapping.state), }; } /** * vdo_unpack_recovery_journal_entry() - Unpack the on-disk representation of a recovery journal * entry. * @entry: The recovery journal entry to unpack. * * Return: The unpacked entry. */ static inline struct recovery_journal_entry vdo_unpack_recovery_journal_entry(const struct packed_recovery_journal_entry *entry) { physical_block_number_t low32 = __le32_to_cpu(entry->pbn_low_word); physical_block_number_t high4 = entry->pbn_high_nibble; return (struct recovery_journal_entry) { .operation = entry->operation, .slot = { .pbn = ((high4 << 32) | low32), .slot = (entry->slot_low | (entry->slot_high << 6)), }, .mapping = vdo_unpack_block_map_entry(&entry->mapping), .unmapping = vdo_unpack_block_map_entry(&entry->unmapping), }; } const char * __must_check vdo_get_journal_operation_name(enum journal_operation operation); /** * vdo_is_valid_recovery_journal_sector() - Determine whether the header of the given sector could * describe a valid sector for the given journal block * header. * @header: The unpacked block header to compare against. * @sector: The packed sector to check. * @sector_number: The number of the sector being checked. * * Return: true if the sector matches the block header. */ static inline bool __must_check vdo_is_valid_recovery_journal_sector(const struct recovery_block_header *header, const struct packed_journal_sector *sector, u8 sector_number) { if ((header->check_byte != sector->check_byte) || (header->recovery_count != sector->recovery_count)) return false; if (header->metadata_type == VDO_METADATA_RECOVERY_JOURNAL_2) return sector->entry_count <= RECOVERY_JOURNAL_ENTRIES_PER_SECTOR; if (sector_number == 7) return sector->entry_count <= RECOVERY_JOURNAL_1_ENTRIES_IN_LAST_SECTOR; return sector->entry_count <= RECOVERY_JOURNAL_1_ENTRIES_PER_SECTOR; } /** * vdo_compute_recovery_journal_block_number() - Compute the physical block number of the recovery * journal block which would have a given sequence * number. * @journal_size: The size of the journal. * @sequence_number: The sequence number. * * Return: The pbn of the journal block which would the specified sequence number. */ static inline physical_block_number_t __must_check vdo_compute_recovery_journal_block_number(block_count_t journal_size, sequence_number_t sequence_number) { /* * Since journal size is a power of two, the block number modulus can just be extracted * from the low-order bits of the sequence. */ return (sequence_number & (journal_size - 1)); } /** * vdo_get_journal_block_sector() - Find the recovery journal sector from the block header and * sector number. * @header: The header of the recovery journal block. * @sector_number: The index of the sector (1-based). * * Return: A packed recovery journal sector. */ static inline struct packed_journal_sector * __must_check vdo_get_journal_block_sector(struct packed_journal_header *header, int sector_number) { char *sector_data = ((char *) header) + (VDO_SECTOR_SIZE * sector_number); return (struct packed_journal_sector *) sector_data; } /** * vdo_pack_recovery_block_header() - Generate the packed representation of a recovery block * header. * @header: The header containing the values to encode. * @packed: The header into which to pack the values. */ static inline void vdo_pack_recovery_block_header(const struct recovery_block_header *header, struct packed_journal_header *packed) { *packed = (struct packed_journal_header) { .block_map_head = __cpu_to_le64(header->block_map_head), .slab_journal_head = __cpu_to_le64(header->slab_journal_head), .sequence_number = __cpu_to_le64(header->sequence_number), .nonce = __cpu_to_le64(header->nonce), .logical_blocks_used = __cpu_to_le64(header->logical_blocks_used), .block_map_data_blocks = __cpu_to_le64(header->block_map_data_blocks), .entry_count = __cpu_to_le16(header->entry_count), .check_byte = header->check_byte, .recovery_count = header->recovery_count, .metadata_type = header->metadata_type, }; } /** * vdo_unpack_recovery_block_header() - Decode the packed representation of a recovery block * header. * @packed: The packed header to decode. * * Return: The unpacked header. */ static inline struct recovery_block_header vdo_unpack_recovery_block_header(const struct packed_journal_header *packed) { return (struct recovery_block_header) { .block_map_head = __le64_to_cpu(packed->block_map_head), .slab_journal_head = __le64_to_cpu(packed->slab_journal_head), .sequence_number = __le64_to_cpu(packed->sequence_number), .nonce = __le64_to_cpu(packed->nonce), .logical_blocks_used = __le64_to_cpu(packed->logical_blocks_used), .block_map_data_blocks = __le64_to_cpu(packed->block_map_data_blocks), .entry_count = __le16_to_cpu(packed->entry_count), .check_byte = packed->check_byte, .recovery_count = packed->recovery_count, .metadata_type = packed->metadata_type, }; } /** * vdo_compute_slab_count() - Compute the number of slabs a depot with given parameters would have. * @first_block: PBN of the first data block. * @last_block: PBN of the last data block. * @slab_size_shift: Exponent for the number of blocks per slab. * * Return: The number of slabs. */ static inline slab_count_t vdo_compute_slab_count(physical_block_number_t first_block, physical_block_number_t last_block, unsigned int slab_size_shift) { return (slab_count_t) ((last_block - first_block) >> slab_size_shift); } int __must_check vdo_configure_slab_depot(const struct partition *partition, struct slab_config slab_config, zone_count_t zone_count, struct slab_depot_state_2_0 *state); int __must_check vdo_configure_slab(block_count_t slab_size, block_count_t slab_journal_blocks, struct slab_config *slab_config); /** * vdo_get_saved_reference_count_size() - Get the number of blocks required to save a reference * counts state covering the specified number of data * blocks. * @block_count: The number of physical data blocks that can be referenced. * * Return: The number of blocks required to save reference counts with the given block count. */ static inline block_count_t vdo_get_saved_reference_count_size(block_count_t block_count) { return DIV_ROUND_UP(block_count, COUNTS_PER_BLOCK); } /** * vdo_get_slab_journal_start_block() - Get the physical block number of the start of the slab * journal relative to the start block allocator partition. * @slab_config: The slab configuration of the VDO. * @origin: The first block of the slab. */ static inline physical_block_number_t __must_check vdo_get_slab_journal_start_block(const struct slab_config *slab_config, physical_block_number_t origin) { return origin + slab_config->data_blocks + slab_config->reference_count_blocks; } /** * vdo_advance_journal_point() - Move the given journal point forward by one entry. * @point: The journal point to adjust. * @entries_per_block: The number of entries in one full block. */ static inline void vdo_advance_journal_point(struct journal_point *point, journal_entry_count_t entries_per_block) { point->entry_count++; if (point->entry_count == entries_per_block) { point->sequence_number++; point->entry_count = 0; } } /** * vdo_before_journal_point() - Check whether the first point precedes the second point. * @first: The first journal point. * @second: The second journal point. * * Return: true if the first point precedes the second point. */ static inline bool vdo_before_journal_point(const struct journal_point *first, const struct journal_point *second) { return ((first->sequence_number < second->sequence_number) || ((first->sequence_number == second->sequence_number) && (first->entry_count < second->entry_count))); } /** * vdo_pack_journal_point() - Encode the journal location represented by a * journal_point into a packed_journal_point. * @unpacked: The unpacked input point. * @packed: The packed output point. */ static inline void vdo_pack_journal_point(const struct journal_point *unpacked, struct packed_journal_point *packed) { packed->encoded_point = __cpu_to_le64((unpacked->sequence_number << 16) | unpacked->entry_count); } /** * vdo_unpack_journal_point() - Decode the journal location represented by a packed_journal_point * into a journal_point. * @packed: The packed input point. * @unpacked: The unpacked output point. */ static inline void vdo_unpack_journal_point(const struct packed_journal_point *packed, struct journal_point *unpacked) { u64 native = __le64_to_cpu(packed->encoded_point); unpacked->sequence_number = (native >> 16); unpacked->entry_count = (native & 0xffff); } /** * vdo_pack_slab_journal_block_header() - Generate the packed representation of a slab block * header. * @header: The header containing the values to encode. * @packed: The header into which to pack the values. */ static inline void vdo_pack_slab_journal_block_header(const struct slab_journal_block_header *header, struct packed_slab_journal_block_header *packed) { packed->head = __cpu_to_le64(header->head); packed->sequence_number = __cpu_to_le64(header->sequence_number); packed->nonce = __cpu_to_le64(header->nonce); packed->entry_count = __cpu_to_le16(header->entry_count); packed->metadata_type = header->metadata_type; packed->has_block_map_increments = header->has_block_map_increments; vdo_pack_journal_point(&header->recovery_point, &packed->recovery_point); } /** * vdo_unpack_slab_journal_block_header() - Decode the packed representation of a slab block * header. * @packed: The packed header to decode. * @header: The header into which to unpack the values. */ static inline void vdo_unpack_slab_journal_block_header(const struct packed_slab_journal_block_header *packed, struct slab_journal_block_header *header) { *header = (struct slab_journal_block_header) { .head = __le64_to_cpu(packed->head), .sequence_number = __le64_to_cpu(packed->sequence_number), .nonce = __le64_to_cpu(packed->nonce), .entry_count = __le16_to_cpu(packed->entry_count), .metadata_type = packed->metadata_type, .has_block_map_increments = packed->has_block_map_increments, }; vdo_unpack_journal_point(&packed->recovery_point, &header->recovery_point); } /** * vdo_pack_slab_journal_entry() - Generate the packed encoding of a slab journal entry. * @packed: The entry into which to pack the values. * @sbn: The slab block number of the entry to encode. * @is_increment: The increment flag. */ static inline void vdo_pack_slab_journal_entry(packed_slab_journal_entry *packed, slab_block_number sbn, bool is_increment) { packed->offset_low8 = (sbn & 0x0000FF); packed->offset_mid8 = (sbn & 0x00FF00) >> 8; packed->offset_high7 = (sbn & 0x7F0000) >> 16; packed->increment = is_increment ? 1 : 0; } /** * vdo_unpack_slab_journal_entry() - Decode the packed representation of a slab journal entry. * @packed: The packed entry to decode. * * Return: The decoded slab journal entry. */ static inline struct slab_journal_entry __must_check vdo_unpack_slab_journal_entry(const packed_slab_journal_entry *packed) { struct slab_journal_entry entry; entry.sbn = packed->offset_high7; entry.sbn <<= 8; entry.sbn |= packed->offset_mid8; entry.sbn <<= 8; entry.sbn |= packed->offset_low8; entry.operation = VDO_JOURNAL_DATA_REMAPPING; entry.increment = packed->increment; return entry; } struct slab_journal_entry __must_check vdo_decode_slab_journal_entry(struct packed_slab_journal_block *block, journal_entry_count_t entry_count); /** * vdo_get_slab_summary_hint_shift() - Compute the shift for slab summary hints. * @slab_size_shift: Exponent for the number of blocks per slab. * * Return: The hint shift. */ static inline u8 __must_check vdo_get_slab_summary_hint_shift(unsigned int slab_size_shift) { return ((slab_size_shift > VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS) ? (slab_size_shift - VDO_SLAB_SUMMARY_FULLNESS_HINT_BITS) : 0); } int __must_check vdo_initialize_layout(block_count_t size, physical_block_number_t offset, block_count_t block_map_blocks, block_count_t journal_blocks, block_count_t summary_blocks, struct layout *layout); void vdo_uninitialize_layout(struct layout *layout); int __must_check vdo_get_partition(struct layout *layout, enum partition_id id, struct partition **partition_ptr); struct partition * __must_check vdo_get_known_partition(struct layout *layout, enum partition_id id); int vdo_validate_config(const struct vdo_config *config, block_count_t physical_block_count, block_count_t logical_block_count); void vdo_destroy_component_states(struct vdo_component_states *states); int __must_check vdo_decode_component_states(u8 *buffer, struct volume_geometry *geometry, struct vdo_component_states *states); int __must_check vdo_validate_component_states(struct vdo_component_states *states, nonce_t geometry_nonce, block_count_t physical_size, block_count_t logical_size); void vdo_encode_super_block(u8 *buffer, struct vdo_component_states *states); int __must_check vdo_decode_super_block(u8 *buffer); /* We start with 0L and postcondition with ~0L to match our historical usage in userspace. */ static inline u32 vdo_crc32(const void *buf, unsigned long len) { /* * Different from the kernelspace wrapper because the kernel implementation doesn't * precondition or postcondition the data; the userspace implementation does. So, despite * the difference in these two implementations, they actually do the same checksum. */ return crc32(~0L, buf, len); } #endif /* VDO_ENCODINGS_H */ vdo-8.3.1.1/utils/vdo/fileLayer.c000066400000000000000000000306421476467262700165140ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "fileLayer.h" #include #include #include #include #include "fileUtils.h" #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "syscalls.h" #include "constants.h" #include "status-codes.h" typedef struct fileLayer { PhysicalLayer common; block_count_t blockCount; block_count_t fileOffset; int fd; size_t alignment; char name[]; } FileLayer; /**********************************************************************/ static inline FileLayer *asFileLayer(PhysicalLayer *layer) { STATIC_ASSERT(offsetof(FileLayer, common) == 0); return (FileLayer *) layer; } /**********************************************************************/ static block_count_t getBlockCount(PhysicalLayer *header) { return asFileLayer(header)->blockCount; } /** * An implementation of buffer_allocator that creates a buffer * properly aligned for direct I/O to the device under the file layer. * * @param [in] layer The file layer in question * @param [in] bytes The size of the buffer, in bytes * @param [in] why The occasion for allocating the buffer * @param [out] buffer_ptr A pointer to hold the buffer * * @return a success or error code **/ static int allocateIOBuffer(PhysicalLayer *header, size_t bytes, const char *why, char **bufferPtr) { if ((bytes % VDO_BLOCK_SIZE) != 0) { return vdo_log_error_strerror(UDS_INVALID_ARGUMENT, "IO buffers must be" " a multiple of the VDO block size"); } return vdo_allocate_memory(bytes, asFileLayer(header)->alignment, why, bufferPtr); } /** * Check if the provided buffer is properly aligned for the device * under the file layer; if so, return it, otherwise allocate a new, properly * aligned buffer and return that. * * @param [in] layer The file layer in question * @param [in] buffer A buffer to use, if aligned * @param [in] bytes The size of the buffer, in bytes * @param [in] why The occasion for allocating the buffer * @param [out] alignedBufferPtr A pointer to hold the buffer * * @return a success or error code **/ static int makeAlignedBuffer(FileLayer *layer, char *buffer, size_t bytes, const char *what, char **alignedBufferPtr) { if ((((uintptr_t) buffer) % layer->alignment) == 0) { *alignedBufferPtr = buffer; return VDO_SUCCESS; } return allocateIOBuffer(&layer->common, bytes, what, alignedBufferPtr); } /** * Perform an I/O using a properly aligned buffer. * * @param [in] layer The layer from which to read or write * @param [in] startBlock The physical block number of the start of the * extent * @param [in] blockCount The number of blocks in the extent * @param [in] read Whether the I/O to perform is a read * @param [in/out] buffer The buffer to read into or write from * * @return VDO_SUCCESS or an error code **/ static int performIO(FileLayer *layer, physical_block_number_t startBlock, size_t bytes, bool read, char *buffer) { // Make sure we cast so we get a proper 64 bit value on the calculation off_t offset = (off_t) startBlock * VDO_BLOCK_SIZE; ssize_t n; for (; bytes > 0; bytes -= n) { n = (read ? pread(layer->fd, buffer, bytes, offset) : pwrite(layer->fd, buffer, bytes, offset)); if (n <= 0) { if (n == 0) { errno = VDO_UNEXPECTED_EOF; } return vdo_log_error_strerror(errno, "p%s %s @%zd", (read ? "read" : "write"), layer->name, offset); } offset += n; buffer += n; } return VDO_SUCCESS; } /**********************************************************************/ static int fileReader(PhysicalLayer *header, physical_block_number_t startBlock, size_t blockCount, char *buffer) { FileLayer *layer = asFileLayer(header); startBlock += layer->fileOffset; if (startBlock + blockCount > layer->blockCount) { return VDO_OUT_OF_RANGE; } vdo_log_debug("FL: Reading %zu blocks from block %llu", blockCount, (unsigned long long) startBlock); // Make sure we cast so we get a proper 64 bit value on the calculation char *alignedBuffer; size_t bytes = VDO_BLOCK_SIZE * blockCount; int result = makeAlignedBuffer(layer, buffer, bytes, "aligned read buffer", &alignedBuffer); if (result != VDO_SUCCESS) { return result; } result = performIO(layer, startBlock, bytes, true, alignedBuffer); if (alignedBuffer != buffer) { memcpy(buffer, alignedBuffer, bytes); vdo_free(alignedBuffer); } return result; } /**********************************************************************/ static int fileWriter(PhysicalLayer *header, physical_block_number_t startBlock, size_t blockCount, char *buffer) { FileLayer *layer = asFileLayer(header); startBlock += layer->fileOffset; if (startBlock + blockCount > layer->blockCount) { return VDO_OUT_OF_RANGE; } vdo_log_debug("FL: Writing %zu blocks from block %llu", blockCount, (unsigned long long) startBlock); // Make sure we cast so we get a proper 64 bit value on the calculation size_t bytes = blockCount * VDO_BLOCK_SIZE; char *alignedBuffer; int result = makeAlignedBuffer(layer, buffer, bytes, "aligned write buffer", &alignedBuffer); if (result != VDO_SUCCESS) { return result; } if (alignedBuffer != buffer) { memcpy(alignedBuffer, buffer, bytes); } result = performIO(layer, startBlock, bytes, false, alignedBuffer); if (alignedBuffer != buffer) { vdo_free(alignedBuffer); } return result; } /**********************************************************************/ static int noWriter(PhysicalLayer *header __attribute__((unused)), physical_block_number_t startBlock __attribute__((unused)), size_t blockCount __attribute__((unused)), char *buffer __attribute__((unused))) { return EPERM; } /**********************************************************************/ static int isBlockDevice(const char *path, bool *device) { struct stat statbuf; int result = logging_stat_missing_ok(path, &statbuf, __func__); if (result == UDS_SUCCESS) { *device = (bool) (S_ISBLK(statbuf.st_mode)); } return result; } /** * Free a FileLayer and NULL out the reference to it. * * Implements layer_destructor. * * @param layerPtr A pointer to the layer to free **/ static void freeLayer(PhysicalLayer **layerPtr) { PhysicalLayer *layer = *layerPtr; if (layer == NULL) { return; } FileLayer *fileLayer = asFileLayer(layer); try_sync_and_close_file(fileLayer->fd); vdo_free(fileLayer); *layerPtr = NULL; } /** * Internal constructor to make a file layer. * * @param [in] name the name of the underlying file * @param [in] readOnly whether the layer is not allowed to write * @param [in] blockCount the span of the file, in blocks (may be zero for * read-only layers in which case it is computed) * @param [in] fileOffset the block offset to apply to I/O operations * @param [out] layerPtr the pointer to hold the result * * @return a success or error code **/ static int setupFileLayer(const char *name, bool readOnly, block_count_t blockCount, block_count_t fileOffset, PhysicalLayer **layerPtr) { int result = VDO_ASSERT(layerPtr != NULL, "layerPtr must not be NULL"); if (result != VDO_SUCCESS) { return result; } size_t nameLen = strlen(name) + 1; FileLayer *layer = NULL; result = vdo_allocate_extended(FileLayer, nameLen, char, "file layer", &layer); if (result != VDO_SUCCESS) { return result; } layer->blockCount = blockCount; layer->fileOffset = fileOffset; strcpy(layer->name, name); bool exists = false; result = file_exists(layer->name, &exists); if (result != UDS_SUCCESS) { vdo_free(layer); return result; } if (!exists) { vdo_free(layer); return ENOENT; } enum file_access access = readOnly ? FU_READ_ONLY_DIRECT : FU_READ_WRITE_DIRECT; result = open_file(layer->name, access, &layer->fd); if (result != UDS_SUCCESS) { vdo_free(layer); return result; } bool blockDevice = false; result = isBlockDevice(layer->name, &blockDevice); if (result != UDS_SUCCESS) { try_close_file(layer->fd); vdo_free(layer); return result; } // Determine the block size of the file or device struct stat statbuf; result = logging_fstat(layer->fd, &statbuf, __func__); if (result != UDS_SUCCESS) { try_close_file(layer->fd); vdo_free(layer); return result; } // Make sure the physical blocks == size of the block device block_count_t deviceBlocks; if (blockDevice) { uint64_t bytes; if (ioctl(layer->fd, BLKGETSIZE64, &bytes) < 0) { result = vdo_log_error_strerror(errno, "get size of %s", layer->name); try_close_file(layer->fd); vdo_free(layer); return result; } deviceBlocks = bytes / VDO_BLOCK_SIZE; } else { deviceBlocks = statbuf.st_size / VDO_BLOCK_SIZE; } if (layer->blockCount == 0) { layer->blockCount = deviceBlocks; } else if (layer->blockCount != deviceBlocks) { result = vdo_log_error_strerror(VDO_PARAMETER_MISMATCH, "physical size %ld 4k blocks must match" " physical size %ld 4k blocks of %s", layer->blockCount, deviceBlocks, layer->name); try_close_file(layer->fd); vdo_free(layer); return result; } layer->alignment = statbuf.st_blksize; layer->common.destroy = freeLayer; layer->common.getBlockCount = getBlockCount; layer->common.allocateIOBuffer = allocateIOBuffer; layer->common.reader = fileReader; layer->common.writer = readOnly ? noWriter : fileWriter; *layerPtr = &layer->common; return VDO_SUCCESS; } /**********************************************************************/ int makeFileLayer(const char *name, block_count_t blockCount, PhysicalLayer **layerPtr) { return setupFileLayer(name, false, blockCount, 0, layerPtr); } /**********************************************************************/ int makeReadOnlyFileLayer(const char *name, PhysicalLayer **layerPtr) { return setupFileLayer(name, true, 0, 0, layerPtr); } /**********************************************************************/ int makeOffsetFileLayer(const char *name, block_count_t blockCount, block_count_t fileOffset, PhysicalLayer **layerPtr) { return setupFileLayer(name, false, blockCount, fileOffset, layerPtr); } vdo-8.3.1.1/utils/vdo/fileLayer.h000066400000000000000000000042071476467262700165170ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef FILE_LAYER_H #define FILE_LAYER_H #include "physicalLayer.h" /** * Make a file layer implementation of a physical layer. * * @param [in] name the name of the underlying file * @param [in] blockCount the span of the file, in blocks * @param [out] layerPtr the pointer to hold the result * * @return a success or error code **/ int __must_check makeFileLayer(const char *name, block_count_t blockCount, PhysicalLayer **layerPtr); /** * Make a read only file layer implementation of a physical layer. * * @param [in] name the name of the underlying file * @param [out] layerPtr the pointer to hold the result * * @return a success or error code **/ int __must_check makeReadOnlyFileLayer(const char *name, PhysicalLayer **layerPtr); /** * Make an offset file layer implementation of a physical layer. * * @param [in] name the name of the underlying file * @param [in] blockCount the span of the file, in blocks * @param [in] fileOffset the block offset to apply to I/O operations * @param [out] layerPtr the pointer to hold the result * * @return a success or error code **/ int makeOffsetFileLayer(const char *name, block_count_t blockCount, block_count_t fileOffset, PhysicalLayer **layerPtr) __attribute__((warn_unused_result)); #endif // FILE_LAYER_H vdo-8.3.1.1/utils/vdo/man/000077500000000000000000000000001476467262700152025ustar00rootroot00000000000000vdo-8.3.1.1/utils/vdo/man/Makefile000066400000000000000000000024111476467262700166400ustar00rootroot00000000000000# # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # INSTALLFILES= \ adaptlvm.8 \ vdoaudit.8 \ vdodebugmetadata.8 \ vdodumpblockmap.8 \ vdodumpmetadata.8 \ vdoforcerebuild.8 \ vdoformat.8 \ vdolistmetadata.8 \ vdoreadonly.8 \ vdorecover.8 \ vdostats.8 INSTALL = install INSTALLOWNER ?= -o root -g root mandir ?= /usr/man INSTALLDIR=$(DESTDIR)/$(mandir) .PHONY: all clean install all:; clean:; install: $(INSTALL) $(INSTALLOWNER) -d $(INSTALLDIR)/man8 for i in $(INSTALLFILES); do \ $(INSTALL) $(INSTALLOWNER) -m 644 $$i $(INSTALLDIR)/man8; \ done vdo-8.3.1.1/utils/vdo/man/adaptlvm.8000066400000000000000000000007711476467262700171100ustar00rootroot00000000000000.TH ADAPTLVM 8 "2022-01-27" "Red Hat" \" -*- nroff -*- .SH NAME adaptlvm \- Adapts an LVMVDO base storage to RW .SH SYNOPSIS .B adaptlvm .I filename .SH DESCRIPTION .B adaptlvm Tears down an LVMVDO stack to expose the VDO backing device as read only. .PP .SH OPTIONS .TP .B setRW Set the backing storage as Read/Write .TP .B setRO Sets the backing storage back to Read-Only .TP .B volume_group/logical_volume The volume group name and logical volume name for the LVMVDO volume . .SH SEE ALSO .BR vdo (8). vdo-8.3.1.1/utils/vdo/man/vdoaudit.8000066400000000000000000000024001476467262700171060ustar00rootroot00000000000000.TH VDOAUDIT 8 "2023-03-28" "Red Hat" \" -*- nroff -*- .SH NAME vdoaudit \- confirm the reference counts of a VDO device .SH SYNOPSIS .B vdoaudit .RI [ options... ] .I filename .SH DESCRIPTION .B vdoaudit adds up the logical block references to all physical blocks of a VDO device found in \fIfilename\fP, then compares that sum to the stored number of logical blocks. It also confirms all of the actual reference counts on all physical blocks against the stored reference counts. Finally, it validates that the slab summary approximation of the free blocks in each slab is correct. .PP .I filename must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). .PP If \-\-verbose is specified, a line item will be reported for each inconsistency; otherwise a summary of the problems will be displayed. .SH OPTIONS .TP .B \-\-help Print this help message and exit. .TP .B \-\-summary Display a summary of any problems found on the volume. .TP .B \-\-verbose Display a line item for each inconsistency found on the volume. .TP .B \-\-version Show the version of vdoaudit. . .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # vdoaudit --verbose /dev/mapper/vdo1-vdo0pool_vdata # lvchange -an vdo1/vdo0pool_vdata .fi .\" .SH NOTES .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8) vdo-8.3.1.1/utils/vdo/man/vdodebugmetadata.8000066400000000000000000000022161476467262700205740ustar00rootroot00000000000000.TH VDODEBUGMETADATA 8 "2020-05-08" "Red Hat" \" -*- nroff -*- .SH NAME vdodebugmetadata \- load a metadata dump of a VDO device .SH SYNOPSIS .B vdodebugmetadata .RB [ \-\-pbn=\fIpbn\fP \&.\|.\|.\&] .RB [ \-\-searchLBN=\fIlbn\fP \&.\|.\|.\&] .I filename .SH DESCRIPTION .B vdodebugmetadata loads the metadata regions dumped by \fBvdodumpmetadata\fP. It should be run under GDB, with a breakpoint on the function \%doNothing. .PP Variables \%vdo, \%slabSummary, \%slabs, and \%recoveryJournal are available, providing access to the VDO super block state, the slab summary blocks, all slab journal and reference blocks per slab, and all recovery journal blocks. .PP Please note that this tool does not provide access to block map pages. .SH OPTIONS .TP \-\-pbn Print the slab journal entries for the given PBN. This option may be specified up to 255 times. .TP \-\-searchLBN Print the recovery journal entries for the given LBN. This includes PBN, increment/decrement, mapping state, recovery journal position information, and whether the recovery journal block is valid. This option may be specified up to 255 times. .SH SEE ALSO .BR vdo (8), .BR vdodumpmetadata(8). vdo-8.3.1.1/utils/vdo/man/vdodumpblockmap.8000066400000000000000000000014101476467262700204560ustar00rootroot00000000000000.TH VDODUMPBLOCKMAP 8 "2023-03-28" "Red Hat" \" -*- nroff -*- .SH NAME vdodumpblockmap \- dump the LBA->PBA mappings of a VDO device .SH SYNOPSIS .B vdodumpblockmap .RB [ \-\-lba=\fIlba\fP ] .I filename .SH DESCRIPTION .B vdodumpblockmap dumps all (or only the specified) LBA->PBA mappings from a cleanly shut down VDO device. .PP .I filename must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). .SH OPTIONS .TP .B \-\-help Print this help message and exit. .TP .B \-\-lba Dump only the mapping for the specified LBA. .TP .B \-\-version Show the version of vdodumpblockmap. . .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # vdodumpblockmap /dev/mapper/vdo1-vdo0pool_vdata # lvchange -an vdo1/vdo0pool_vdata .fi .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8) vdo-8.3.1.1/utils/vdo/man/vdodumpmetadata.8000066400000000000000000000023161476467262700204540ustar00rootroot00000000000000.TH VDODUMPMETADATA 8 "2023-03-38" "Red Hat" \" -*- nroff -*- .SH NAME vdodumpmetadata \- dump the metadata regions from a VDO device .SH SYNOPSIS .B vdodumpmetadata .RB [ \-\-no\-block\-map ] .RB [ \-\-lbn=\fIlbn\fP ] .I vdoBacking outputFile .SH DESCRIPTION .B vdodumpmetadata dumps the metadata regions of a VDO device to another file, to enable save and transfer of metadata from a VDO without transfer of the entire backing store. .PP .I vdoBacking must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). .PP .B vdodumpmetadata will produce a large output file. The expected size is roughly equal to VDO's metadata size. A rough estimate of the storage needed is 1.4 GB per TB of logical space. .SH OPTIONS .TP \-\-no\-block\-map Omit the block map. The output file will be of size no higher than 130MB + (9 MB per slab). .TP \-\-lbn Saves the block map page associated with the specified LBN in the output file. This option may be specified up to 255 times. Implies \-\-no\-block\-map. .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # vdodumpmetadata /dev/mapper/vdo1-vdo0pool_vdata vdo1-meta-dump # lvchange -an vdo1/vdo0pool_vdata .fi .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8), .BR vdodebugmetadata (8) vdo-8.3.1.1/utils/vdo/man/vdoforcerebuild.8000066400000000000000000000020131476467262700204450ustar00rootroot00000000000000.TH VDOFORCEREBUILD 8 "2023-04-14" "Red Hat" \" -*- nroff -*- .SH NAME vdoforcerebuild \- prepare a VDO device to exit read-only mode .SH SYNOPSIS .B vdoforcerebuild .I filename .SH DESCRIPTION .B vdoforcerebuild forces an existing VDO device to exit read-only mode and to attempt to regenerate as much metadata as possible. .PP .I filename must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). Since \fBlvchange\fP(8) will only mount that as read-only, a writable version of that device must be manually created, as shown in the example below. .PP .SH OPTIONS .TP .B \-\-help Print this help message and exit. .TP .B \-\-version Show the version of vdoforcerebuild. .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # dmsetup table vdo1-vdo0pool_vdata > vdata.table # lvchange -an vdo1/vdo0pool_vdata # dmsetup create vdo1-vdo0pool_vdata --table "`cat vdata.table`" # vdoforcerebuild /dev/mapper/vdo1-vdo0pool_vdata # dmsetup remove vdo1-vdo0pool_vdata .fi .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8), .BR dmsetup (8) vdo-8.3.1.1/utils/vdo/man/vdoformat.8000066400000000000000000000037531476467262700173040ustar00rootroot00000000000000.TH VDOFORMAT 8 "2017-09-12" "Red Hat" \" -*- nroff -*- .SH NAME vdoformat \- format a VDO device .SH SYNOPSIS .B vdoformat .RI [ options... ] .I filename .SH DESCRIPTION .B vdoformat formats the file named by .I filename as a VDO device. This is analogous to low-level device formatting. The device will not be formatted if it already contains a VDO, unless the --force flag is used. .PP .B vdoformat can also modify some of the formatting parameters. .SH OPTIONS .TP .B \-\-format Format the block device, even if there is already a VDO formatted thereupon. .TP .B \-\-help Print this help message and exit. .TP .B \-\-logical\-size=\fIsize\fP Set the logical (provisioned) size of the VDO device to \fIsize\fP. A size suffix of K for kilobytes, M for megabytes, G for gigabytes, T for terabytes, or P for petabytes is optional. The default unit is megabytes. .TP .B \-\-slab\-bits=\fIbits\fP Set the free space allocator's slab size to 2^\fIbits\fP 4 KB blocks. \fIbits\fP must be a value between 13 and 23 (inclusive), corresponding to a slab size between 32 MB and 32 GB. The default value is 19 which results in a slab size of 2 GB. This allocator manages the space VDO uses to store user data. The maximum number of slabs in the system is 8192, so this value determines the maximum physical size of a VDO volume. One slab is the minimum amount by which a VDO volume can be grown. Smaller slabs also increase the potential for parallelism if the device has multiple physical threads. Therefore, this value should be set as small as possible, given the eventual maximal size of the volume. .TP .B \-\-uds\-memory\-size=\fIgigabytes\fP Specify the amount of memory, in gigabytes, to devote to the index. Accepted options are .25, .5, .75, and all positive integers. .TP .B \-\-uds\-sparse Specify whether or not to use a sparse index. .TP .B \-\-verbose Describe what is being formatted and with what parameters. .TP .B \-\-version Show the version of vdoformat. . .\" .SH EXAMPLES .\" .SH NOTES .SH SEE ALSO .BR vdo (8). vdo-8.3.1.1/utils/vdo/man/vdolistmetadata.8000066400000000000000000000015611476467262700204630ustar00rootroot00000000000000.TH VDOLISTMETADATA 8 "2023-03-28" "Red Hat" \" -*- nroff -*- .SH NAME vdolistmetadata \- list the metadata regions on a VDO device .SH SYNOPSIS .B vdolistmetadata .I filename .SH DESCRIPTION .B vdolistmetadata lists the metadata regions of a VDO device as ranges of block numbers. Each range is on a separate line of the form: .EX startBlock .. endBlock: label .EE Both endpoints are included in the range, and are the zero-based indexes of 4KB VDO metadata blocks on the backing device. .PP .I filename must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). .SH OPTIONS .TP .B \-\-help Print this help message and exit. .TP .B \-\-version Show the version of vdolistmetadata. . .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # vdolistmetadata /dev/mapper/vdo1-vdo0pool_vdata # lvchange -an vdo1/vdo0pool_vdata .fi .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8) vdo-8.3.1.1/utils/vdo/man/vdoreadonly.8000066400000000000000000000016621476467262700176260ustar00rootroot00000000000000.TH VDOREADONLY 8 "2023-04-14" "Red Hat" \" -*- nroff -*- .SH NAME vdoreadonly \- puts a VDO device into read-only mode .SH SYNOPSIS .B vdoreadonly .I filename .SH DESCRIPTION .B vdoreadonly forces an existing VDO device into read-only mode. .PP .I filename must be the path of the VDODataLV as described in \fBlvmvdo\fP(7). Since \fBlvchange\fP(8) will only mount that as read-only, a writable version of that device must be manually created, as shown in the example below. .PP .SH OPTIONS .TP .B \-\-help Print this help message and exit. .TP .B \-\-version Show the version of vdoreadonly. . .SH EXAMPLE .nf # lvchange -ay vdo1/vdo0pool_vdata # dmsetup table vdo1-vdo0pool_vdata > vdata.table # lvchange -an vdo1/vdo0pool_vdata # dmsetup create vdo1-vdo0pool_vdata --table "`cat vdata.table`" # vdoreadonly /dev/mapper/vdo1-vdo0pool_vdata # dmsetup remove vdo1-vdo0pool_vdata .fi .SH SEE ALSO .BR lvmvdo (7), .BR lvchange (8), .BR dmsetup (8) vdo-8.3.1.1/utils/vdo/man/vdorecover.8000066400000000000000000000006501476467262700174520ustar00rootroot00000000000000.TH VDORECOVER 8 "2022-07-15" "Red Hat" \" -*- nroff -*- .SH NAME vdorecover \- Recovers available storage by discarding a full VDO volume. .SH SYNOPSIS .B vdorecover .I vdo_device .SH DESCRIPTION .B vdorecover Recovers available physical space on a full VDO volume by mounting it temporarily with snapshots and sending discards to it. .PP .SH OPTIONS .TP .B vdo_device The VDO device to recover . .SH SEE ALSO .BR vdo (8). vdo-8.3.1.1/utils/vdo/man/vdostats.8000066400000000000000000000326331476467262700171510ustar00rootroot00000000000000.TH VDOSTATS 8 "2020-02-18" "Red Hat" \" -*- nroff -*- .SH NAME vdostats \- Display configuration and statistics of VDO volumes .SH SYNOPSIS .B vdostats [\fI\,options ...\/\fR] [\fI\,device [device ...]\/\fR] .SH DESCRIPTION \fBvdostats\fR displays configuration and statistics information for the given VDO devices. If no devices are given, it displays information about all VDO devices. .TP The VDO devices must be running in order for configuration and statistics information to be reported. .SH OPTIONS .TP \fB\-h\fR, \fB\-\-help\fR Show help message and exit. .TP \fB\-a\fR, \fB\-\-all\fR This option is only for backwards compatibility. It is now equivalent to \fB\-\-verbose\fR. .TP \fB\-\-human\-readable\fR Display block values in readable form (Base 2: 1 KB = 2^10 bytes = 1024 bytes). .TP \fB\-\-si\fR Modifies the output of the \fB\-\-human\-readable\fR option to use SI units (Base 10: 1 KB = 10^3 bytes = 1000 bytes). If the \fB\-\-human\-readable\fR option is not supplied, this option has no effect. .TP \fB\-v\fR, \fB\-\-verbose\fR Displays the utilization and block I/O (bios) statistics for the selected VDO devices. .TP \fB\-V\fR, \fB\-\-version\fR Prints the vdostats version number and exits .SH OUTPUT The default output format is a table with the following columns, similar to that of the Linux \fBdf\fR utility: .TP .B Device The path to the VDO volume .TP .B 1K\-blocks The total number of 1K blocks allocated for a VDO volume (= physical volume size * block size / 1024) .TP .B Used The total number of 1K blocks used on a VDO volume (= physical blocks used * block size / 1024) .TP .B Available The total number of 1K blocks available on a VDO volume (= physical blocks free * block size / 1024) .TP .B Use% The percentage of physical blocks used on a VDO volume (= used blocks / allocated blocks * 100) .TP .B Space Saving% The percentage of physical blocks saved on a VDO volume (= [logical blocks used - physical blocks used] / logical blocks used) .SH VERBOSE OUTPUT The \fB\-\-verbose\fR option displays VDO device statistics in YAML format for the specified VDO devices. The following fields will continue to be reported in future releases. Management tools should not rely upon the order in which any of the statistics are reported. .TP .B version The version of these statistics. .TP .B release version The release version of the VDO. .TP .B data blocks used The number of physical blocks currently in use by a VDO volume to store data. .TP .B overhead blocks used The number of physical blocks currently in use by a VDO volume to store VDO metadata. .TP .B logical blocks used The number of logical blocks currently mapped. .TP .B physical blocks The total number of physical blocks allocated for a VDO volume. .TP .B logical blocks The maximum number of logical blocks that can be mapped by a VDO volume. .TP .B 1K-blocks The total number of 1K blocks allocated for a VDO volume (= physical volume size * block size / 1024) .TP .B 1K-blocks used The total number of 1K blocks used on a VDO volume (= physical blocks used * block size / 1024) .TP .B 1K-blocks available The total number of 1K blocks available on a VDO volume (= physical blocks free * block size / 1024) .TP .B used percent The percentage of physical blocks used on a VDO volume (= used blocks / allocated blocks * 100) .TP .B saving percent The percentage of physical blocks saved on a VDO volume (= [logical blocks used - physical blocks used] / logical blocks used) .TP .B block map cache size The size of the block map cache, in bytes. .TP .B write policy The write policy (sync, async, or async-unsafe). This is configured via \fBvdo modify \-\-writePolicy=\fIpolicy\fR. .TP .B block size The block size of a VDO volume, in bytes. .TP .B completed recovery count The number of times a VDO volume has recovered from an unclean shutdown. .TP .B read-only recovery count The number of times a VDO volume has been recovered from read-only mode (via \fBvdo start \-\-forceRebuild\fR). .TP .B operating mode Indicates whether a VDO volume is operating normally, is in recovery mode, or is in read-only mode. .TP .B recovery progress (%) Indicates online recovery progress, or \fBN/A\fR if the volume is not in recovery mode. .TP .B compressed fragments written The number of compressed fragments that have been written since the VDO volume was last restarted. .TP .B compressed blocks written The number of physical blocks of compressed data that have been written since the VDO volume was last restarted. .PP The remaining fields are primarily intended for software support and are subject to change in future releases; management tools should not rely upon them. .TP .B compressed fragments in packer The number of compressed fragments being processed that have not yet been written. .TP .B slab count The total number of slabs. .TP .B slabs opened The total number of slabs from which blocks have ever been allocated. .TP .B slabs reopened The number of times slabs have been re-opened since the VDO was started. .TP .B journal disk full count The number of times a request could not make a recovery journal entry because the recovery journal was full. .TP .B journal commits requested count The number of times the recovery journal requested slab journal commits. .TP .B journal entries batching The number of journal entry writes started minus the number of journal entries written. .TP .B journal entries started The number of journal entries which have been made in memory. .TP .B journal entries writing The number of journal entries in submitted writes minus the number of journal entries committed to storage. .TP .B journal entries written The total number of journal entries for which a write has been issued. .TP .B journal entries committed The number of journal entries written to storage. .TP .B journal blocks batching The number of journal block writes started minus the number of journal blocks written. .TP .B journal blocks started The number of journal blocks which have been touched in memory. .TP .B journal blocks writing The number of journal blocks written (with metadatata in active memory) minus the number of journal blocks committed. .TP .B journal blocks written The total number of journal blocks for which a write has been issued. .TP .B journal blocks committed The number of journal blocks written to storage. .TP .B slab journal disk full count The number of times an on-disk slab journal was full. .TP .B slab journal flush count The number of times an entry was added to a slab journal that was over the flush threshold. .TP .B slab journal blocked count The number of times an entry was added to a slab journal that was over the blocking threshold. .TP .B slab journal blocks written The number of slab journal block writes issued. .TP .B slab journal tail busy count The number of times write requests blocked waiting for a slab journal write. .TP .B slab summary blocks written The number of slab summary block writes issued. .TP .B reference blocks written The number of reference block writes issued. .TP .B block map dirty pages The number of dirty pages in the block map cache. .TP .B block map clean pages The number of clean pages in the block map cache. .TP .B block map free pages The number of free pages in the block map cache. .TP .B block map failed pages The number of block map cache pages that have write errors. .TP .B block map incoming pages The number of block map cache pages that are being read into the cache. .TP .B block map outgoing pages The number of block map cache pages that are being written. .TP .B block map cache pressure The number of times a free page was not available when needed. .TP .B block map read count The total number of block map page reads. .TP .B block map write count The total number of block map page writes. .TP .B block map failed reads The total number of block map read errors. .TP .B block map failed writes The total number of block map write errors. .TP .B block map reclaimed The total number of block map pages that were reclaimed. .TP .B block map read outgoing The total number of block map reads for pages that were being written. .TP .B block map found in cache The total number of block map cache hits. .TP .B block map discard required The total number of block map requests that required a page to be discarded. .TP .B block map wait for page The total number of requests that had to wait for a page. .TP .B block map fetch required The total number of requests that required a page fetch. .TP .B block map pages loaded The total number of page fetches. .TP .B block map pages saved The total number of page saves. .TP .B block map flush count The total number of flushes issued by the block map. .TP .B invalid advice PBN count The number of times the index returned invalid advice .TP .B no space error count The number of write requests which failed due to the VDO volume being out of space. .TP .B read only error count The number of write requests which failed due to the VDO volume being in read-only mode. .TP .B instance The VDO instance. .TP .B 512 byte emulation Indicates whether 512 byte emulation is on or off for the volume. .TP .B current VDO IO requests in progress The number of I/O requests the VDO is current processing. .TP .B maximum VDO IO requests in progress The maximum number of simultaneous I/O requests the VDO has processed. .TP .B current dedupe queries The number of deduplication queries currently in flight. .TP .B maximum dedupe queries The maximum number of in-flight deduplication queries. .TP .B dedupe advice valid The number of times deduplication advice was correct. .TP .B dedupe advice stale The number of times deduplication advice was incorrect. .TP .B dedupe advice timeouts The number of times deduplication queries timed out. .TP .B concurrent data matches The number of writes with the same data as another in-flight write. .TP .B concurrent hash collisions The number of writes whose hash collided with an in-flight write. .TP .B flush out The number of flush requests submitted by VDO to the underlying storage. .TP .B write amplification ratio The average number of block writes to the underlying storage per block written to the VDO device. .PP .B bios in... .br .B bios in partial... .br .B bios out... .br .B bios meta... .br .B bios journal... .br .B bios page cache... .br .B bios out completed... .br .B bios meta completed... .br .B bios journal completed... .br .B bios page cache completed... .br .B bios acknowledged... .br .B bios acknowledged partial... .br .B bios in progress... .br .RS These statistics count the number of bios in each category with a given flag. The categories are: .TP .B bios in The number of block I/O requests received by VDO. .TP .B bios in partial The number of partial block I/O requests received by VDO. Applies only to 512-byte emulation mode. .TP .B bios out The number of non-metadata block I/O requests submitted by VDO to the storage device. .TP .B bios meta The number of metadata block I/O requests submitted by VDO to the storage device. .TP .B bios journal The number of recovery journal block I/O requests submitted by VDO to the storage device. .TP .B bios page cache The number of block map I/O requests submitted by VDO to the storage device. .TP .B bios out completed The number of non-metadata block I/O requests completed by the storage device. .TP .B bios meta completed The number of metadata block I/O requests completed by the storage device. .TP .B bios journal completed The number of recovery journal block I/O requests completed by the storage device. .TP .B bios page cache completed The number of block map I/O requests completed by the storage device. .TP .B bios acknowledged The number of block I/O requests acknowledged by VDO. .TP .B bios acknowledged partial The number of partial block I/O requests acknowledged by VDO. Applies only to 512-byte emulation mode. .TP .B bios in progress The number of bios submitted to the VDO which have not yet been acknowledged. .PP There are five types of flags: .TP .B read The number of non-write bios (bios without the REQ_WRITE flag set) .TP .B write The number of write bios (bios with the REQ_WRITE flag set) .TP .B discard The number of bios with a REQ_DISCARD flag set .TP .B flush The number of flush bios (bios with the REQ_FLUSH flag set) .TP .B fua The number of "force unit access" bios (bios with the REQ_FUA flag set) .PP Note that all bios will be counted as either read or write bios, depending on the REQ_WRITE flag setting, regardless of whether any of the other flags are set. .RE . .TP .B KVDO module bytes used The current count of bytes allocated by the kernel VDO module. .TP .B KVDO module peak bytes used The peak count of bytes allocated by the kernel VDO module, since the module was loaded. .SH EXAMPLES The following example shows sample output if no options are provided: .PP .EX Device 1K-blocks Used Available Use% Space Saving% /dev/mapper/my_vdo 1932562432 427698104 1504864328 22% 21% .EE .PP With the \fB\-\-human\-readable\fR option, block counts are converted to conventional units (1 KB = 1024 bytes): .PP .EX Device Size Used Available Use% Space Saving% /dev/mapper/my_vdo 1.8T 407.9G 1.4T 22% 21% .EE .PP With the \fB\-\-si\fR option as well, the block counts are reported using SI units (1 KB = 1000 bytes): .PP .EX Device Size Used Available Use% Space Saving% /dev/mapper/my_vdo 2.0T 438G 1.5T 22% 21% .EE .\" Add example of verbose mode? .\" The VDO integration manual didn't have one. .SH NOTES The output may be incomplete when the command is run by an unprivileged user. .SH SEE ALSO .BR vdo (8). vdo-8.3.1.1/utils/vdo/messageStatsReader.c000066400000000000000000000730611476467262700203700ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "string-utils.h" #include "statistics.h" #include "status-codes.h" #include "vdoStats.h" static int skip_string(char **buf, char *skip) { char *tmp = NULL; tmp = strstr(*buf, skip); if (tmp == NULL) { return VDO_UNEXPECTED_EOF; } *buf = tmp + strlen(skip); return VDO_SUCCESS; } static int read_u64(char **buf, u64 *value) { int count = sscanf(*buf, "%lu, ", value); if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_u32(char **buf, u32 *value) { int count = sscanf(*buf, "%u, ", value); if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_block_count_t(char **buf, block_count_t *value) { int count = sscanf(*buf, "%lu, ", value); if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_string(char **buf, char *value) { int count = sscanf(*buf, "%[^,], ", value); if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_bool(char **buf, bool *value) { int temp; int count = sscanf(*buf, "%d, ", &temp); *value = (bool)temp; if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_u8(char **buf, u8 *value) { int count = sscanf(*buf, "%hhu, ", value); if (count != 1) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int read_block_allocator_statistics(char **buf, struct block_allocator_statistics *stats) { int result = 0; /** The total number of slabs from which blocks may be allocated */ result = skip_string(buf, "slabCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->slab_count); if (result != VDO_SUCCESS) { return result; } /** The total number of slabs from which blocks have ever been allocated */ result = skip_string(buf, "slabsOpened : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->slabs_opened); if (result != VDO_SUCCESS) { return result; } /** The number of times since loading that a slab has been re-opened */ result = skip_string(buf, "slabsReopened : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->slabs_reopened); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_commit_statistics(char **buf, struct commit_statistics *stats) { int result = 0; /** The total number of items on which processing has started */ result = skip_string(buf, "started : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->started); if (result != VDO_SUCCESS) { return result; } /** The total number of items for which a write operation has been issued */ result = skip_string(buf, "written : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->written); if (result != VDO_SUCCESS) { return result; } /** The total number of items for which a write operation has completed */ result = skip_string(buf, "committed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->committed); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_recovery_journal_statistics(char **buf, struct recovery_journal_statistics *stats) { int result = 0; /** Number of times the on-disk journal was full */ result = skip_string(buf, "diskFull : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->disk_full); if (result != VDO_SUCCESS) { return result; } /** Number of times the recovery journal requested slab journal commits. */ result = skip_string(buf, "slabJournalCommitsRequested : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->slab_journal_commits_requested); if (result != VDO_SUCCESS) { return result; } /** Write/Commit totals for individual journal entries */ result = skip_string(buf, "entries : "); if (result != VDO_SUCCESS) { return result; } result = read_commit_statistics(buf, &stats->entries); if (result != VDO_SUCCESS) { return result; } /** Write/Commit totals for journal blocks */ result = skip_string(buf, "blocks : "); if (result != VDO_SUCCESS) { return result; } result = read_commit_statistics(buf, &stats->blocks); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_packer_statistics(char **buf, struct packer_statistics *stats) { int result = 0; /** Number of compressed data items written since startup */ result = skip_string(buf, "compressedFragmentsWritten : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->compressed_fragments_written); if (result != VDO_SUCCESS) { return result; } /** Number of blocks containing compressed items written since startup */ result = skip_string(buf, "compressedBlocksWritten : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->compressed_blocks_written); if (result != VDO_SUCCESS) { return result; } /** Number of VIOs that are pending in the packer */ result = skip_string(buf, "compressedFragmentsInPacker : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->compressed_fragments_in_packer); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_slab_journal_statistics(char **buf, struct slab_journal_statistics *stats) { int result = 0; /** Number of times the on-disk journal was full */ result = skip_string(buf, "diskFullCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->disk_full_count); if (result != VDO_SUCCESS) { return result; } /** Number of times an entry was added over the flush threshold */ result = skip_string(buf, "flushCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->flush_count); if (result != VDO_SUCCESS) { return result; } /** Number of times an entry was added over the block threshold */ result = skip_string(buf, "blockedCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->blocked_count); if (result != VDO_SUCCESS) { return result; } /** Number of times a tail block was written */ result = skip_string(buf, "blocksWritten : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->blocks_written); if (result != VDO_SUCCESS) { return result; } /** Number of times we had to wait for the tail to write */ result = skip_string(buf, "tailBusyCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->tail_busy_count); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_slab_summary_statistics(char **buf, struct slab_summary_statistics *stats) { int result = 0; /** Number of blocks written */ result = skip_string(buf, "blocksWritten : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->blocks_written); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_ref_counts_statistics(char **buf, struct ref_counts_statistics *stats) { int result = 0; /** Number of reference blocks written */ result = skip_string(buf, "blocksWritten : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->blocks_written); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_block_map_statistics(char **buf, struct block_map_statistics *stats) { int result = 0; /** number of dirty (resident) pages */ result = skip_string(buf, "dirtyPages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->dirty_pages); if (result != VDO_SUCCESS) { return result; } /** number of clean (resident) pages */ result = skip_string(buf, "cleanPages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->clean_pages); if (result != VDO_SUCCESS) { return result; } /** number of free pages */ result = skip_string(buf, "freePages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->free_pages); if (result != VDO_SUCCESS) { return result; } /** number of pages in failed state */ result = skip_string(buf, "failedPages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->failed_pages); if (result != VDO_SUCCESS) { return result; } /** number of pages incoming */ result = skip_string(buf, "incomingPages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->incoming_pages); if (result != VDO_SUCCESS) { return result; } /** number of pages outgoing */ result = skip_string(buf, "outgoingPages : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->outgoing_pages); if (result != VDO_SUCCESS) { return result; } /** how many times free page not avail */ result = skip_string(buf, "cachePressure : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->cache_pressure); if (result != VDO_SUCCESS) { return result; } /** number of get_vdo_page() calls for read */ result = skip_string(buf, "readCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->read_count); if (result != VDO_SUCCESS) { return result; } /** number of get_vdo_page() calls for write */ result = skip_string(buf, "writeCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->write_count); if (result != VDO_SUCCESS) { return result; } /** number of times pages failed to read */ result = skip_string(buf, "failedReads : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->failed_reads); if (result != VDO_SUCCESS) { return result; } /** number of times pages failed to write */ result = skip_string(buf, "failedWrites : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->failed_writes); if (result != VDO_SUCCESS) { return result; } /** number of gets that are reclaimed */ result = skip_string(buf, "reclaimed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->reclaimed); if (result != VDO_SUCCESS) { return result; } /** number of gets for outgoing pages */ result = skip_string(buf, "readOutgoing : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->read_outgoing); if (result != VDO_SUCCESS) { return result; } /** number of gets that were already there */ result = skip_string(buf, "foundInCache : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->found_in_cache); if (result != VDO_SUCCESS) { return result; } /** number of gets requiring discard */ result = skip_string(buf, "discardRequired : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->discard_required); if (result != VDO_SUCCESS) { return result; } /** number of gets enqueued for their page */ result = skip_string(buf, "waitForPage : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->wait_for_page); if (result != VDO_SUCCESS) { return result; } /** number of gets that have to fetch */ result = skip_string(buf, "fetchRequired : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->fetch_required); if (result != VDO_SUCCESS) { return result; } /** number of page fetches */ result = skip_string(buf, "pagesLoaded : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->pages_loaded); if (result != VDO_SUCCESS) { return result; } /** number of page saves */ result = skip_string(buf, "pagesSaved : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->pages_saved); if (result != VDO_SUCCESS) { return result; } /** the number of flushes issued */ result = skip_string(buf, "flushCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->flush_count); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_hash_lock_statistics(char **buf, struct hash_lock_statistics *stats) { int result = 0; /** Number of times the UDS advice proved correct */ result = skip_string(buf, "dedupeAdviceValid : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->dedupe_advice_valid); if (result != VDO_SUCCESS) { return result; } /** Number of times the UDS advice proved incorrect */ result = skip_string(buf, "dedupeAdviceStale : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->dedupe_advice_stale); if (result != VDO_SUCCESS) { return result; } /** Number of writes with the same data as another in-flight write */ result = skip_string(buf, "concurrentDataMatches : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->concurrent_data_matches); if (result != VDO_SUCCESS) { return result; } /** Number of writes whose hash collided with an in-flight write */ result = skip_string(buf, "concurrentHashCollisions : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->concurrent_hash_collisions); if (result != VDO_SUCCESS) { return result; } /** Current number of dedupe queries that are in flight */ result = skip_string(buf, "currDedupeQueries : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->curr_dedupe_queries); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_error_statistics(char **buf, struct error_statistics *stats) { int result = 0; /** number of times VDO got an invalid dedupe advice PBN from UDS */ result = skip_string(buf, "invalidAdvicePBNCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->invalid_advice_pbn_count); if (result != VDO_SUCCESS) { return result; } /** number of times a VIO completed with a VDO_NO_SPACE error */ result = skip_string(buf, "noSpaceErrorCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->no_space_error_count); if (result != VDO_SUCCESS) { return result; } /** number of times a VIO completed with a VDO_READ_ONLY error */ result = skip_string(buf, "readOnlyErrorCount : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->read_only_error_count); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_bio_stats(char **buf, struct bio_stats *stats) { int result = 0; /** Number of REQ_OP_READ bios */ result = skip_string(buf, "read : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->read); if (result != VDO_SUCCESS) { return result; } /** Number of REQ_OP_WRITE bios with data */ result = skip_string(buf, "write : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->write); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_PREFLUSH and containing no data */ result = skip_string(buf, "emptyFlush : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->empty_flush); if (result != VDO_SUCCESS) { return result; } /** Number of REQ_OP_DISCARD bios */ result = skip_string(buf, "discard : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->discard); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_PREFLUSH */ result = skip_string(buf, "flush : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->flush); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_FUA */ result = skip_string(buf, "fua : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->fua); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_memory_usage(char **buf, struct memory_usage *stats) { int result = 0; /** Tracked bytes currently allocated. */ result = skip_string(buf, "bytesUsed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->bytes_used); if (result != VDO_SUCCESS) { return result; } /** Maximum tracked bytes allocated. */ result = skip_string(buf, "peakBytesUsed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->peak_bytes_used); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_index_statistics(char **buf, struct index_statistics *stats) { int result = 0; /** Number of records stored in the index */ result = skip_string(buf, "entriesIndexed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->entries_indexed); if (result != VDO_SUCCESS) { return result; } /** Number of post calls that found an existing entry */ result = skip_string(buf, "postsFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->posts_found); if (result != VDO_SUCCESS) { return result; } /** Number of post calls that added a new entry */ result = skip_string(buf, "postsNotFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->posts_not_found); if (result != VDO_SUCCESS) { return result; } /** Number of query calls that found an existing entry */ result = skip_string(buf, "queriesFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->queries_found); if (result != VDO_SUCCESS) { return result; } /** Number of query calls that added a new entry */ result = skip_string(buf, "queriesNotFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->queries_not_found); if (result != VDO_SUCCESS) { return result; } /** Number of update calls that found an existing entry */ result = skip_string(buf, "updatesFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->updates_found); if (result != VDO_SUCCESS) { return result; } /** Number of update calls that added a new entry */ result = skip_string(buf, "updatesNotFound : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->updates_not_found); if (result != VDO_SUCCESS) { return result; } /** Number of entries discarded */ result = skip_string(buf, "entriesDiscarded : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->entries_discarded); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int read_vdo_statistics(char **buf, struct vdo_statistics *stats) { int result = 0; result = skip_string(buf, "version : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->version); if (result != VDO_SUCCESS) { return result; } /** Number of blocks used for data */ result = skip_string(buf, "dataBlocksUsed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->data_blocks_used); if (result != VDO_SUCCESS) { return result; } /** Number of blocks used for VDO metadata */ result = skip_string(buf, "overheadBlocksUsed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->overhead_blocks_used); if (result != VDO_SUCCESS) { return result; } /** Number of logical blocks that are currently mapped to physical blocks */ result = skip_string(buf, "logicalBlocksUsed : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->logical_blocks_used); if (result != VDO_SUCCESS) { return result; } /** number of physical blocks */ result = skip_string(buf, "physicalBlocks : "); if (result != VDO_SUCCESS) { return result; } result = read_block_count_t(buf, &stats->physical_blocks); if (result != VDO_SUCCESS) { return result; } /** number of logical blocks */ result = skip_string(buf, "logicalBlocks : "); if (result != VDO_SUCCESS) { return result; } result = read_block_count_t(buf, &stats->logical_blocks); if (result != VDO_SUCCESS) { return result; } /** Size of the block map page cache, in bytes */ result = skip_string(buf, "blockMapCacheSize : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->block_map_cache_size); if (result != VDO_SUCCESS) { return result; } /** The physical block size */ result = skip_string(buf, "blockSize : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->block_size); if (result != VDO_SUCCESS) { return result; } /** Number of times the VDO has successfully recovered */ result = skip_string(buf, "completeRecoveries : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->complete_recoveries); if (result != VDO_SUCCESS) { return result; } /** Number of times the VDO has recovered from read-only mode */ result = skip_string(buf, "readOnlyRecoveries : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->read_only_recoveries); if (result != VDO_SUCCESS) { return result; } /** String describing the operating mode of the VDO */ result = skip_string(buf, "mode : "); if (result != VDO_SUCCESS) { return result; } result = read_string(buf, stats->mode); if (result != VDO_SUCCESS) { return result; } /** Whether the VDO is in recovery mode */ result = skip_string(buf, "inRecoveryMode : "); if (result != VDO_SUCCESS) { return result; } result = read_bool(buf, &stats->in_recovery_mode); if (result != VDO_SUCCESS) { return result; } /** What percentage of recovery mode work has been completed */ result = skip_string(buf, "recoveryPercentage : "); if (result != VDO_SUCCESS) { return result; } result = read_u8(buf, &stats->recovery_percentage); if (result != VDO_SUCCESS) { return result; } /** The statistics for the compressed block packer */ result = skip_string(buf, "packer : "); if (result != VDO_SUCCESS) { return result; } result = read_packer_statistics(buf, &stats->packer); if (result != VDO_SUCCESS) { return result; } /** Counters for events in the block allocator */ result = skip_string(buf, "allocator : "); if (result != VDO_SUCCESS) { return result; } result = read_block_allocator_statistics(buf, &stats->allocator); if (result != VDO_SUCCESS) { return result; } /** Counters for events in the recovery journal */ result = skip_string(buf, "journal : "); if (result != VDO_SUCCESS) { return result; } result = read_recovery_journal_statistics(buf, &stats->journal); if (result != VDO_SUCCESS) { return result; } /** The statistics for the slab journals */ result = skip_string(buf, "slabJournal : "); if (result != VDO_SUCCESS) { return result; } result = read_slab_journal_statistics(buf, &stats->slab_journal); if (result != VDO_SUCCESS) { return result; } /** The statistics for the slab summary */ result = skip_string(buf, "slabSummary : "); if (result != VDO_SUCCESS) { return result; } result = read_slab_summary_statistics(buf, &stats->slab_summary); if (result != VDO_SUCCESS) { return result; } /** The statistics for the reference counts */ result = skip_string(buf, "refCounts : "); if (result != VDO_SUCCESS) { return result; } result = read_ref_counts_statistics(buf, &stats->ref_counts); if (result != VDO_SUCCESS) { return result; } /** The statistics for the block map */ result = skip_string(buf, "blockMap : "); if (result != VDO_SUCCESS) { return result; } result = read_block_map_statistics(buf, &stats->block_map); if (result != VDO_SUCCESS) { return result; } /** The dedupe statistics from hash locks */ result = skip_string(buf, "hashLock : "); if (result != VDO_SUCCESS) { return result; } result = read_hash_lock_statistics(buf, &stats->hash_lock); if (result != VDO_SUCCESS) { return result; } /** Counts of error conditions */ result = skip_string(buf, "errors : "); if (result != VDO_SUCCESS) { return result; } result = read_error_statistics(buf, &stats->errors); if (result != VDO_SUCCESS) { return result; } /** The VDO instance */ result = skip_string(buf, "instance : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->instance); if (result != VDO_SUCCESS) { return result; } /** Current number of active VIOs */ result = skip_string(buf, "currentVIOsInProgress : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->current_vios_in_progress); if (result != VDO_SUCCESS) { return result; } /** Maximum number of active VIOs */ result = skip_string(buf, "maxVIOs : "); if (result != VDO_SUCCESS) { return result; } result = read_u32(buf, &stats->max_vios); if (result != VDO_SUCCESS) { return result; } /** Number of times the UDS index was too slow in responding */ result = skip_string(buf, "dedupeAdviceTimeouts : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->dedupe_advice_timeouts); if (result != VDO_SUCCESS) { return result; } /** Number of flush requests submitted to the storage device */ result = skip_string(buf, "flushOut : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->flush_out); if (result != VDO_SUCCESS) { return result; } /** Logical block size */ result = skip_string(buf, "logicalBlockSize : "); if (result != VDO_SUCCESS) { return result; } result = read_u64(buf, &stats->logical_block_size); if (result != VDO_SUCCESS) { return result; } /** Bios submitted into VDO from above */ result = skip_string(buf, "biosIn : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_in); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosInPartial : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_in_partial); if (result != VDO_SUCCESS) { return result; } /** Bios submitted onward for user data */ result = skip_string(buf, "biosOut : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_out); if (result != VDO_SUCCESS) { return result; } /** Bios submitted onward for metadata */ result = skip_string(buf, "biosMeta : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_meta); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosJournal : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_journal); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosPageCache : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_page_cache); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosOutCompleted : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_out_completed); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosMetaCompleted : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_meta_completed); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosJournalCompleted : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_journal_completed); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosPageCacheCompleted : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_page_cache_completed); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosAcknowledged : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_acknowledged); if (result != VDO_SUCCESS) { return result; } result = skip_string(buf, "biosAcknowledgedPartial : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_acknowledged_partial); if (result != VDO_SUCCESS) { return result; } /** Current number of bios in progress */ result = skip_string(buf, "biosInProgress : "); if (result != VDO_SUCCESS) { return result; } result = read_bio_stats(buf, &stats->bios_in_progress); if (result != VDO_SUCCESS) { return result; } /** Memory usage stats. */ result = skip_string(buf, "memoryUsage : "); if (result != VDO_SUCCESS) { return result; } result = read_memory_usage(buf, &stats->memory_usage); if (result != VDO_SUCCESS) { return result; } /** The statistics for the UDS index */ result = skip_string(buf, "index : "); if (result != VDO_SUCCESS) { return result; } result = read_index_statistics(buf, &stats->index); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } int read_vdo_stats(char *buf, struct vdo_statistics *stats) { return(read_vdo_statistics(&buf, stats)); } vdo-8.3.1.1/utils/vdo/parseUtils.c000066400000000000000000000115311476467262700167270ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "parseUtils.h" #include #include #include #include #include "status-codes.h" /**********************************************************************/ int parseUInt(const char *arg, unsigned int lowest, unsigned int highest, unsigned int *numPtr) { char *endPtr; errno = 0; unsigned long n = strtoul(arg, &endPtr, 0); if ((errno == ERANGE) || (errno == EINVAL) || (endPtr == arg) || (*endPtr != '\0') || (n < lowest) || (n > highest)) { return VDO_OUT_OF_RANGE; } if (numPtr != NULL) { *numPtr = n; } return VDO_SUCCESS; } /**********************************************************************/ int parseInt(const char *arg, int *numPtr) { char *endPtr; errno = 0; long n = strtol(arg, &endPtr, 0); if ((errno == ERANGE) || (errno == EINVAL) || (endPtr == arg) || (*endPtr != '\0')) { return VDO_OUT_OF_RANGE; } if (numPtr != NULL) { *numPtr = n; } return VDO_SUCCESS; } /**********************************************************************/ int __must_check parseUInt64(const char *arg, uint64_t *numPtr) { char *endPtr; errno = 0; unsigned long long temp = strtoull(arg, &endPtr, 10); if ((errno == ERANGE) || (errno == EINVAL) || (*endPtr != '\0')) { return VDO_OUT_OF_RANGE; } uint64_t n = temp; if (temp != (unsigned long long) n) { return VDO_OUT_OF_RANGE; } *numPtr = n; return VDO_SUCCESS; } /** * Return the binary exponent corresponding to a unit code. * * @param unitCode The code, which is 'b' or 'B' for bytes, 'm' or 'M' * for megabytes, etc. * * @return The binary exponent corresponding to the code, * or -1 if the code is not valid **/ static int getBinaryExponent(char unitCode) { const char *UNIT_CODES = "BKMGTP"; const char *where = index(UNIT_CODES, toupper(unitCode)); if (where == NULL) { return -1; } // Each successive code is another factor of 2^10 bytes. return (10 * (where - UNIT_CODES)); } /**********************************************************************/ int parseSize(const char *arg, bool lvmMode, uint64_t *sizePtr) { char *endPtr; errno = 0; unsigned long long size = strtoull(arg, &endPtr, 0); if ((errno == ERANGE) || (errno == EINVAL) || (endPtr == arg)) { return VDO_OUT_OF_RANGE; } int exponent; if (*endPtr == '\0') { // No units specified; SI mode defaults to bytes, LVM mode to megabytes. exponent = lvmMode ? 20 : 0; } else { // Parse the unit code. exponent = getBinaryExponent(*endPtr++); if (exponent < 0) { return VDO_OUT_OF_RANGE; } if (*endPtr != '\0') { return VDO_OUT_OF_RANGE; } } // Scale the size by the specified units, checking for overflow. uint64_t actualSize = size << exponent; if (size != (actualSize >> exponent)) { return VDO_OUT_OF_RANGE; } *sizePtr = actualSize; return VDO_SUCCESS; } static int parseMem(char *string, uint32_t *sizePtr) { uds_memory_config_size_t mem; if (strcmp(string, "0.25") == 0) { mem = UDS_MEMORY_CONFIG_256MB; } else if ((strcmp(string, "0.5") == 0) || (strcmp(string, "0.50") == 0)) { mem = UDS_MEMORY_CONFIG_512MB; } else if (strcmp(string, "0.75") == 0) { mem = UDS_MEMORY_CONFIG_768MB; } else { int number; if (parseInt(string, &number) != VDO_SUCCESS) { return -EINVAL; } mem = number; } *sizePtr = (uint32_t) mem; return VDO_SUCCESS; } /**********************************************************************/ int parseIndexConfig(UdsConfigStrings *configStrings, struct index_config *configPtr) { struct index_config config; memset(&config, 0, sizeof(config)); config.mem = UDS_MEMORY_CONFIG_256MB; if (configStrings->memorySize != NULL) { uint32_t mem; int result = parseMem(configStrings->memorySize, &mem); if (result != VDO_SUCCESS) { return result; } config.mem = mem; } if (configStrings->sparse != NULL) { config.sparse = (strcmp(configStrings->sparse, "0") != 0); } *configPtr = config; return VDO_SUCCESS; } vdo-8.3.1.1/utils/vdo/parseUtils.h000066400000000000000000000052021476467262700167320ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef PARSE_UTILS_H #define PARSE_UTILS_H #include #include #include "indexer.h" #include "encodings.h" typedef struct { char *sparse; char *memorySize; } UdsConfigStrings; /** * Parse a string argument as an unsigned int. * * @param [in] arg The argument to parse * @param [in] lowest The lowest allowed value * @param [in] highest The highest allowed value * @param [out] numPtr A pointer to return the parsed integer. * * @return VDO_SUCCESS or VDO_OUT_OF_RANGE. **/ int __must_check parseUInt(const char *arg, unsigned int lowest, unsigned int highest, unsigned int *numPtr); /** * Parse a string argument as a signed int. * * @param [in] arg The argument to parse * @param [out] numPtr A pointer to return the parsed integer. * * @return VDO_SUCCESS or VDO_OUT_OF_RANGE. **/ int parseInt(const char *arg, int *numPtr); /** * Parse a string argument as a decimal uint64_t. * * @param [in] arg The argument to parse * @param [out] numPtr A pointer to return the parsed value. * * @return VDO_SUCCESS or VDO_OUT_OF_RANGE. **/ int __must_check parseUInt64(const char *arg, uint64_t *numPtr); /** * Parse a string argument as a size, optionally using LVM's concept * of size suffixes. * * @param [in] arg The argument to parse * @param [in] lvmMode Whether to parse suffixes as LVM or SI. * @param [out] sizePtr A pointer to return the parsed size, in bytes * * @return VDO_SUCCESS or VDO_OUT_OF_RANGE. **/ int __must_check parseSize(const char *arg, bool lvmMode, uint64_t *sizePtr); /** * Parse UdsConfigStrings into a index_config. * * @param [in] configStrings The UDS config strings read. * @param [out] configPtr A pointer to return the struct index_config. **/ int __must_check parseIndexConfig(UdsConfigStrings *configStrings, struct index_config *configPtr); #endif // PARSE_UTILS_H vdo-8.3.1.1/utils/vdo/physicalLayer.h000066400000000000000000000063021476467262700174120ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef PHYSICAL_LAYER_H #define PHYSICAL_LAYER_H #include "types.h" typedef struct physicalLayer PhysicalLayer; /** * A function to destroy a physical layer and NULL out the reference to it. * * @param layer_ptr A pointer to the layer to destroy **/ typedef void layer_destructor(PhysicalLayer **layer_ptr); /** * A function to report the block count of a physicalLayer. * * @param layer The layer * * @return The block count of the layer **/ typedef block_count_t block_count_getter(PhysicalLayer *layer); /** * A function which can allocate a buffer suitable for use in an * extent_reader or extent_writer. * * @param [in] layer The physical layer in question * @param [in] bytes The size of the buffer, in bytes. * @param [in] why The occasion for allocating the buffer * @param [out] buffer_ptr A pointer to hold the buffer * * @return a success or error code **/ typedef int buffer_allocator(PhysicalLayer *layer, size_t bytes, const char *why, char **buffer_ptr); /** * A function which can read an extent from a physicalLayer. * * @param [in] layer The physical layer from which to read * @param [in] startBlock The physical block number of the start of the * extent * @param [in] blockCount The number of blocks in the extent * @param [out] buffer A buffer to hold the extent * * @return a success or error code **/ typedef int extent_reader(PhysicalLayer *layer, physical_block_number_t startBlock, size_t blockCount, char *buffer); /** * A function which can write an extent to a physicalLayer. * * @param [in] layer The physical layer to which to write * @param [in] startBlock The physical block number of the start of the * extent * @param [in] blockCount The number of blocks in the extent * @param [in] buffer The buffer which contains the data * * @return a success or error code **/ typedef int extent_writer(PhysicalLayer *layer, physical_block_number_t startBlock, size_t blockCount, char *buffer); /** * An abstraction representing the underlying physical layer. **/ struct physicalLayer { /* Management interface */ layer_destructor *destroy; /* Synchronous interface */ block_count_getter *getBlockCount; /* Synchronous IO interface */ buffer_allocator *allocateIOBuffer; extent_reader *reader; extent_writer *writer; }; #endif /* PHYSICAL_LAYER_H */ vdo-8.3.1.1/utils/vdo/slabSummaryReader.c000066400000000000000000000067271476467262700202310ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "slabSummaryReader.h" #include #include "memory-alloc.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "physicalLayer.h" #include "userVDO.h" /**********************************************************************/ int readSlabSummary(UserVDO *vdo, struct slab_summary_entry **entriesPtr) { zone_count_t zones = vdo->states.slab_depot.zone_count; if (zones == 0) { return VDO_SUCCESS; } struct slab_summary_entry *entries; block_count_t summary_blocks = VDO_SLAB_SUMMARY_BLOCKS_PER_ZONE; int result = vdo->layer->allocateIOBuffer(vdo->layer, summary_blocks * VDO_BLOCK_SIZE, "slab summary entries", (char **) &entries); if (result != VDO_SUCCESS) { warnx("Could not create in-memory slab summary"); return result; } struct partition *slab_summary_partition; result = vdo_get_partition(&vdo->states.layout, VDO_SLAB_SUMMARY_PARTITION, &slab_summary_partition); if (result != VDO_SUCCESS) { warnx("Could not find slab summary partition"); return result; } physical_block_number_t origin = slab_summary_partition->offset; result = vdo->layer->reader(vdo->layer, origin, summary_blocks, (char *) entries); if (result != VDO_SUCCESS) { warnx("Could not read summary data"); vdo_free(entries); return result; } // If there is more than one zone, read and combine the other zone's data // with the data already read from the first zone. if (zones > 1) { struct slab_summary_entry *buffer; result = vdo->layer->allocateIOBuffer(vdo->layer, summary_blocks * VDO_BLOCK_SIZE, "slab summary entries", (char **) &buffer); if (result != VDO_SUCCESS) { warnx("Could not create slab summary buffer"); vdo_free(entries); return result; } for (zone_count_t zone = 1; zone < zones; zone++) { origin += summary_blocks; result = vdo->layer->reader(vdo->layer, origin, summary_blocks, (char *) buffer); if (result != VDO_SUCCESS) { warnx("Could not read summary data"); vdo_free(buffer); vdo_free(entries); return result; } for (slab_count_t entry_number = zone; entry_number < MAX_VDO_SLABS; entry_number += zones) { memcpy(entries + entry_number, buffer + entry_number, sizeof(struct slab_summary_entry)); } } vdo_free(buffer); } *entriesPtr = entries; return VDO_SUCCESS; } vdo-8.3.1.1/utils/vdo/slabSummaryReader.h000066400000000000000000000023631476467262700202260ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef SLAB_SUMMARY_READER_H #define SLAB_SUMMARY_READER_H #include "encodings.h" #include "types.h" #include "userVDO.h" /** * Read the contents of the slab summary into a single set of summary entries. * * @param [in] vdo The vdo from which to read the summary * @param [out] entries_ptr A pointer to hold the loaded entries * * @return VDO_SUCCESS or an error code **/ int __must_check readSlabSummary(UserVDO *vdo, struct slab_summary_entry **entriesPtr); #endif // SLAB_SUMMARY_UTILS_H vdo-8.3.1.1/utils/vdo/statistics.h000066400000000000000000000217251476467262700170010ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ /* * * If you add new statistics, be sure to update the following files: * * ./message-stats.c * ./pool-sysfs-stats.c * ../user/messageStatsReader.c * ../user/vdoStatsWriter.c * ../../../perl/Permabit/Statistics/Definitions.pm */ #ifndef STATISTICS_H #define STATISTICS_H #include "types.h" enum { STATISTICS_VERSION = 36, }; struct block_allocator_statistics { /* The total number of slabs from which blocks may be allocated */ u64 slab_count; /* The total number of slabs from which blocks have ever been allocated */ u64 slabs_opened; /* The number of times since loading that a slab has been re-opened */ u64 slabs_reopened; }; /** * Counters for tracking the number of items written (blocks, requests, etc.) * that keep track of totals at steps in the write pipeline. Three counters * allow the number of buffered, in-memory items and the number of in-flight, * unacknowledged writes to be derived, while still tracking totals for * reporting purposes */ struct commit_statistics { /* The total number of items on which processing has started */ u64 started; /* The total number of items for which a write operation has been issued */ u64 written; /* The total number of items for which a write operation has completed */ u64 committed; }; /** Counters for events in the recovery journal */ struct recovery_journal_statistics { /* Number of times the on-disk journal was full */ u64 disk_full; /* Number of times the recovery journal requested slab journal commits. */ u64 slab_journal_commits_requested; /* Write/Commit totals for individual journal entries */ struct commit_statistics entries; /* Write/Commit totals for journal blocks */ struct commit_statistics blocks; }; /** The statistics for the compressed block packer. */ struct packer_statistics { /* Number of compressed data items written since startup */ u64 compressed_fragments_written; /* Number of blocks containing compressed items written since startup */ u64 compressed_blocks_written; /* Number of VIOs that are pending in the packer */ u64 compressed_fragments_in_packer; }; /** The statistics for the slab journals. */ struct slab_journal_statistics { /* Number of times the on-disk journal was full */ u64 disk_full_count; /* Number of times an entry was added over the flush threshold */ u64 flush_count; /* Number of times an entry was added over the block threshold */ u64 blocked_count; /* Number of times a tail block was written */ u64 blocks_written; /* Number of times we had to wait for the tail to write */ u64 tail_busy_count; }; /** The statistics for the slab summary. */ struct slab_summary_statistics { /* Number of blocks written */ u64 blocks_written; }; /** The statistics for the reference counts. */ struct ref_counts_statistics { /* Number of reference blocks written */ u64 blocks_written; }; /** The statistics for the block map. */ struct block_map_statistics { /* number of dirty (resident) pages */ u32 dirty_pages; /* number of clean (resident) pages */ u32 clean_pages; /* number of free pages */ u32 free_pages; /* number of pages in failed state */ u32 failed_pages; /* number of pages incoming */ u32 incoming_pages; /* number of pages outgoing */ u32 outgoing_pages; /* how many times free page not avail */ u32 cache_pressure; /* number of get_vdo_page() calls for read */ u64 read_count; /* number of get_vdo_page() calls for write */ u64 write_count; /* number of times pages failed to read */ u64 failed_reads; /* number of times pages failed to write */ u64 failed_writes; /* number of gets that are reclaimed */ u64 reclaimed; /* number of gets for outgoing pages */ u64 read_outgoing; /* number of gets that were already there */ u64 found_in_cache; /* number of gets requiring discard */ u64 discard_required; /* number of gets enqueued for their page */ u64 wait_for_page; /* number of gets that have to fetch */ u64 fetch_required; /* number of page fetches */ u64 pages_loaded; /* number of page saves */ u64 pages_saved; /* the number of flushes issued */ u64 flush_count; }; /** The dedupe statistics from hash locks */ struct hash_lock_statistics { /* Number of times the UDS advice proved correct */ u64 dedupe_advice_valid; /* Number of times the UDS advice proved incorrect */ u64 dedupe_advice_stale; /* Number of writes with the same data as another in-flight write */ u64 concurrent_data_matches; /* Number of writes whose hash collided with an in-flight write */ u64 concurrent_hash_collisions; /* Current number of dedupe queries that are in flight */ u32 curr_dedupe_queries; }; /** Counts of error conditions in VDO. */ struct error_statistics { /* number of times VDO got an invalid dedupe advice PBN from UDS */ u64 invalid_advice_pbn_count; /* number of times a VIO completed with a VDO_NO_SPACE error */ u64 no_space_error_count; /* number of times a VIO completed with a VDO_READ_ONLY error */ u64 read_only_error_count; }; struct bio_stats { /* Number of REQ_OP_READ bios */ u64 read; /* Number of REQ_OP_WRITE bios with data */ u64 write; /* Number of bios tagged with REQ_PREFLUSH and containing no data */ u64 empty_flush; /* Number of REQ_OP_DISCARD bios */ u64 discard; /* Number of bios tagged with REQ_PREFLUSH */ u64 flush; /* Number of bios tagged with REQ_FUA */ u64 fua; }; struct memory_usage { /* Tracked bytes currently allocated. */ u64 bytes_used; /* Maximum tracked bytes allocated. */ u64 peak_bytes_used; }; /** UDS index statistics */ struct index_statistics { /* Number of records stored in the index */ u64 entries_indexed; /* Number of post calls that found an existing entry */ u64 posts_found; /* Number of post calls that added a new entry */ u64 posts_not_found; /* Number of query calls that found an existing entry */ u64 queries_found; /* Number of query calls that added a new entry */ u64 queries_not_found; /* Number of update calls that found an existing entry */ u64 updates_found; /* Number of update calls that added a new entry */ u64 updates_not_found; /* Number of entries discarded */ u64 entries_discarded; }; /** The statistics of the vdo service. */ struct vdo_statistics { u32 version; /* Number of blocks used for data */ u64 data_blocks_used; /* Number of blocks used for VDO metadata */ u64 overhead_blocks_used; /* Number of logical blocks that are currently mapped to physical blocks */ u64 logical_blocks_used; /* number of physical blocks */ block_count_t physical_blocks; /* number of logical blocks */ block_count_t logical_blocks; /* Size of the block map page cache, in bytes */ u64 block_map_cache_size; /* The physical block size */ u64 block_size; /* Number of times the VDO has successfully recovered */ u64 complete_recoveries; /* Number of times the VDO has recovered from read-only mode */ u64 read_only_recoveries; /* String describing the operating mode of the VDO */ char mode[15]; /* Whether the VDO is in recovery mode */ bool in_recovery_mode; /* What percentage of recovery mode work has been completed */ u8 recovery_percentage; /* The statistics for the compressed block packer */ struct packer_statistics packer; /* Counters for events in the block allocator */ struct block_allocator_statistics allocator; /* Counters for events in the recovery journal */ struct recovery_journal_statistics journal; /* The statistics for the slab journals */ struct slab_journal_statistics slab_journal; /* The statistics for the slab summary */ struct slab_summary_statistics slab_summary; /* The statistics for the reference counts */ struct ref_counts_statistics ref_counts; /* The statistics for the block map */ struct block_map_statistics block_map; /* The dedupe statistics from hash locks */ struct hash_lock_statistics hash_lock; /* Counts of error conditions */ struct error_statistics errors; /* The VDO instance */ u32 instance; /* Current number of active VIOs */ u32 current_vios_in_progress; /* Maximum number of active VIOs */ u32 max_vios; /* Number of times the UDS index was too slow in responding */ u64 dedupe_advice_timeouts; /* Number of flush requests submitted to the storage device */ u64 flush_out; /* Logical block size */ u64 logical_block_size; /* Bios submitted into VDO from above */ struct bio_stats bios_in; struct bio_stats bios_in_partial; /* Bios submitted onward for user data */ struct bio_stats bios_out; /* Bios submitted onward for metadata */ struct bio_stats bios_meta; struct bio_stats bios_journal; struct bio_stats bios_page_cache; struct bio_stats bios_out_completed; struct bio_stats bios_meta_completed; struct bio_stats bios_journal_completed; struct bio_stats bios_page_cache_completed; struct bio_stats bios_acknowledged; struct bio_stats bios_acknowledged_partial; /* Current number of bios in progress */ struct bio_stats bios_in_progress; /* Memory usage stats. */ struct memory_usage memory_usage; /* The statistics for the UDS index */ struct index_statistics index; }; #endif /* not STATISTICS_H */ vdo-8.3.1.1/utils/vdo/status-codes.c000066400000000000000000000102261476467262700172120ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat */ #include "status-codes.h" #include #include "errors.h" #include "logger.h" #include "permassert.h" #include "thread-utils.h" const struct error_info vdo_status_list[] = { { "VDO_NOT_IMPLEMENTED", "Not implemented" }, { "VDO_OUT_OF_RANGE", "Out of range" }, { "VDO_REF_COUNT_INVALID", "Reference count would become invalid" }, { "VDO_NO_SPACE", "Out of space" }, { "VDO_BAD_CONFIGURATION", "Bad configuration option" }, { "VDO_COMPONENT_BUSY", "Prior operation still in progress" }, { "VDO_BAD_PAGE", "Corrupt or incorrect page" }, { "VDO_UNSUPPORTED_VERSION", "Unsupported component version" }, { "VDO_INCORRECT_COMPONENT", "Component id mismatch in decoder" }, { "VDO_PARAMETER_MISMATCH", "Parameters have conflicting values" }, { "VDO_UNKNOWN_PARTITION", "No partition exists with a given id" }, { "VDO_PARTITION_EXISTS", "A partition already exists with a given id" }, { "VDO_INCREMENT_TOO_SMALL", "Physical block growth of too few blocks" }, { "VDO_CHECKSUM_MISMATCH", "Incorrect checksum" }, { "VDO_LOCK_ERROR", "A lock is held incorrectly" }, { "VDO_READ_ONLY", "The device is in read-only mode" }, { "VDO_SHUTTING_DOWN", "The device is shutting down" }, { "VDO_CORRUPT_JOURNAL", "Recovery journal corrupted" }, { "VDO_TOO_MANY_SLABS", "Exceeds maximum number of slabs supported" }, { "VDO_INVALID_FRAGMENT", "Compressed block fragment is invalid" }, { "VDO_RETRY_AFTER_REBUILD", "Retry operation after rebuilding finishes" }, { "VDO_BAD_MAPPING", "Invalid page mapping" }, { "VDO_BIO_CREATION_FAILED", "Bio creation failed" }, { "VDO_BAD_MAGIC", "Bad magic number" }, { "VDO_BAD_NONCE", "Bad nonce" }, { "VDO_JOURNAL_OVERFLOW", "Journal sequence number overflow" }, { "VDO_INVALID_ADMIN_STATE", "Invalid operation for current state" }, { "VDO_UNEXPECTED_EOF", "Unexpected EOF on block read" }, { "VDO_NOT_READ_ONLY", "The device is not in read-only mode" }, }; static atomic_t vdo_status_codes_registered = ATOMIC_INIT(0); static int status_code_registration_result; static void do_status_code_registration(void) { int result; BUILD_BUG_ON((VDO_STATUS_CODE_LAST - VDO_STATUS_CODE_BASE) != ARRAY_SIZE(vdo_status_list)); result = uds_register_error_block("VDO Status", VDO_STATUS_CODE_BASE, VDO_STATUS_CODE_BLOCK_END, vdo_status_list, sizeof(vdo_status_list)); /* * The following test handles cases where libvdo is statically linked against both the test * modules and the test driver (because multiple instances of this module call their own * copy of this function once each, resulting in multiple calls to register_error_block * which is shared in libuds). */ if (result == UDS_DUPLICATE_NAME) result = UDS_SUCCESS; status_code_registration_result = (result == UDS_SUCCESS) ? VDO_SUCCESS : result; } /** * vdo_register_status_codes() - Register the VDO status codes if needed. * Return: A success or error code. */ int vdo_register_status_codes(void) { vdo_perform_once(&vdo_status_codes_registered, do_status_code_registration); return status_code_registration_result; } /** * vdo_status_to_errno() - Given an error code, return a value we can return to the OS. * @error: The error code to convert. * * The input error code may be a system-generated value (such as -EIO), an errno macro used in our * code (such as EIO), or a UDS or VDO status code; the result must be something the rest of the OS * can consume (negative errno values such as -EIO, in the case of the kernel). * * Return: A system error code value. */ int vdo_status_to_errno(int error) { char error_name[VDO_MAX_ERROR_NAME_SIZE]; char error_message[VDO_MAX_ERROR_MESSAGE_SIZE]; /* 0 is success, negative a system error code */ if (likely(error <= 0)) return error; if (error < 1024) return -error; /* VDO or UDS error */ switch (error) { case VDO_NO_SPACE: return -ENOSPC; case VDO_READ_ONLY: return -EIO; default: vdo_log_info("%s: mapping internal status code %d (%s: %s) to EIO", __func__, error, uds_string_error_name(error, error_name, sizeof(error_name)), uds_string_error(error, error_message, sizeof(error_message))); return -EIO; } } vdo-8.3.1.1/utils/vdo/status-codes.h000066400000000000000000000050271476467262700172220ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_STATUS_CODES_H #define VDO_STATUS_CODES_H #include "errors.h" enum { UDS_ERRORS_BLOCK_SIZE = UDS_ERROR_CODE_BLOCK_END - UDS_ERROR_CODE_BASE, VDO_ERRORS_BLOCK_START = UDS_ERROR_CODE_BLOCK_END, VDO_ERRORS_BLOCK_END = VDO_ERRORS_BLOCK_START + UDS_ERRORS_BLOCK_SIZE, }; /* VDO-specific status codes. */ enum vdo_status_codes { /* base of all VDO errors */ VDO_STATUS_CODE_BASE = VDO_ERRORS_BLOCK_START, /* we haven't written this yet */ VDO_NOT_IMPLEMENTED = VDO_STATUS_CODE_BASE, /* input out of range */ VDO_OUT_OF_RANGE, /* an invalid reference count would result */ VDO_REF_COUNT_INVALID, /* a free block could not be allocated */ VDO_NO_SPACE, /* improper or missing configuration option */ VDO_BAD_CONFIGURATION, /* prior operation still in progress */ VDO_COMPONENT_BUSY, /* page contents incorrect or corrupt data */ VDO_BAD_PAGE, /* unsupported version of some component */ VDO_UNSUPPORTED_VERSION, /* component id mismatch in decoder */ VDO_INCORRECT_COMPONENT, /* parameters have conflicting values */ VDO_PARAMETER_MISMATCH, /* no partition exists with a given id */ VDO_UNKNOWN_PARTITION, /* a partition already exists with a given id */ VDO_PARTITION_EXISTS, /* physical block growth of too few blocks */ VDO_INCREMENT_TOO_SMALL, /* incorrect checksum */ VDO_CHECKSUM_MISMATCH, /* a lock is held incorrectly */ VDO_LOCK_ERROR, /* the VDO is in read-only mode */ VDO_READ_ONLY, /* the VDO is shutting down */ VDO_SHUTTING_DOWN, /* the recovery journal has corrupt entries or corrupt metadata */ VDO_CORRUPT_JOURNAL, /* exceeds maximum number of slabs supported */ VDO_TOO_MANY_SLABS, /* a compressed block fragment is invalid */ VDO_INVALID_FRAGMENT, /* action is unsupported while rebuilding */ VDO_RETRY_AFTER_REBUILD, /* a block map entry is invalid */ VDO_BAD_MAPPING, /* bio_add_page failed */ VDO_BIO_CREATION_FAILED, /* bad magic number */ VDO_BAD_MAGIC, /* bad nonce */ VDO_BAD_NONCE, /* sequence number overflow */ VDO_JOURNAL_OVERFLOW, /* the VDO is not in a state to perform an admin operation */ VDO_INVALID_ADMIN_STATE, /* unexpected EOF on block read */ VDO_UNEXPECTED_EOF, /* the VDO is not in read-only mode */ VDO_NOT_READ_ONLY, /* one more than last error code */ VDO_STATUS_CODE_LAST, VDO_STATUS_CODE_BLOCK_END = VDO_ERRORS_BLOCK_END }; extern const struct error_info vdo_status_list[]; int vdo_register_status_codes(void); int vdo_status_to_errno(int error); #endif /* VDO_STATUS_CODES_H */ vdo-8.3.1.1/utils/vdo/types.h000066400000000000000000000122711476467262700157470ustar00rootroot00000000000000/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright 2023 Red Hat */ #ifndef VDO_TYPES_H #define VDO_TYPES_H #include #include #include "funnel-queue.h" /* A size type in blocks. */ typedef u64 block_count_t; /* The size of a block. */ typedef u16 block_size_t; /* A counter for data_vios */ typedef u16 data_vio_count_t; /* A height within a tree. */ typedef u8 height_t; /* The logical block number as used by the consumer. */ typedef u64 logical_block_number_t; /* The type of the nonce used to identify instances of VDO. */ typedef u64 nonce_t; /* A size in pages. */ typedef u32 page_count_t; /* A page number. */ typedef u32 page_number_t; /* * The physical (well, less logical) block number at which the block is found on the underlying * device. */ typedef u64 physical_block_number_t; /* A count of tree roots. */ typedef u8 root_count_t; /* A number of sectors. */ typedef u8 sector_count_t; /* A sequence number. */ typedef u64 sequence_number_t; /* The offset of a block within a slab. */ typedef u32 slab_block_number; /* A size type in slabs. */ typedef u16 slab_count_t; /* A slot in a bin or block map page. */ typedef u16 slot_number_t; /* typedef thread_count_t - A thread counter. */ typedef u8 thread_count_t; /* typedef thread_id_t - A thread ID, vdo threads are numbered sequentially from 0. */ typedef u8 thread_id_t; /* A zone counter */ typedef u8 zone_count_t; /* The following enums are persisted on storage, so the values must be preserved. */ /* The current operating mode of the VDO. */ enum vdo_state { VDO_DIRTY = 0, VDO_NEW = 1, VDO_CLEAN = 2, VDO_READ_ONLY_MODE = 3, VDO_FORCE_REBUILD = 4, VDO_RECOVERING = 5, VDO_REPLAYING = 6, /* VDO_REPLAYING is never set anymore, but retained for upgrade */ VDO_REBUILD_FOR_UPGRADE = 7, /* Keep VDO_STATE_COUNT at the bottom. */ VDO_STATE_COUNT }; /** * vdo_state_requires_read_only_rebuild() - Check whether a vdo_state indicates * that a read-only rebuild is required. * @state: The vdo_state to check. * * Return: true if the state indicates a rebuild is required */ static inline bool __must_check vdo_state_requires_read_only_rebuild(enum vdo_state state) { return ((state == VDO_FORCE_REBUILD) || (state == VDO_REBUILD_FOR_UPGRADE)); } /** * vdo_state_requires_recovery() - Check whether a vdo state indicates that recovery is needed. * @state: The state to check. * * Return: true if the state indicates a recovery is required */ static inline bool __must_check vdo_state_requires_recovery(enum vdo_state state) { return ((state == VDO_DIRTY) || (state == VDO_REPLAYING) || (state == VDO_RECOVERING)); } /* * The current operation on a physical block (from the point of view of the recovery journal, slab * journals, and reference counts. */ enum journal_operation { VDO_JOURNAL_DATA_REMAPPING = 0, VDO_JOURNAL_BLOCK_MAP_REMAPPING = 1, } __packed; /* Partition IDs encoded in the volume layout in the super block. */ enum partition_id { VDO_BLOCK_MAP_PARTITION = 0, VDO_SLAB_DEPOT_PARTITION = 1, VDO_RECOVERY_JOURNAL_PARTITION = 2, VDO_SLAB_SUMMARY_PARTITION = 3, } __packed; /* Metadata types for the vdo. */ enum vdo_metadata_type { VDO_METADATA_RECOVERY_JOURNAL = 1, VDO_METADATA_SLAB_JOURNAL = 2, VDO_METADATA_RECOVERY_JOURNAL_2 = 3, } __packed; /* A position in the block map where a block map entry is stored. */ struct block_map_slot { physical_block_number_t pbn; slot_number_t slot; }; /* * Four bits of each five-byte block map entry contain a mapping state value used to distinguish * unmapped or discarded logical blocks (which are treated as mapped to the zero block) from entries * that have been mapped to a physical block, including the zero block. * * FIXME: these should maybe be defines. */ enum block_mapping_state { VDO_MAPPING_STATE_UNMAPPED = 0, /* Must be zero to be the default value */ VDO_MAPPING_STATE_UNCOMPRESSED = 1, /* A normal (uncompressed) block */ VDO_MAPPING_STATE_COMPRESSED_BASE = 2, /* Compressed in slot 0 */ VDO_MAPPING_STATE_COMPRESSED_MAX = 15, /* Compressed in slot 13 */ }; enum { VDO_MAX_COMPRESSION_SLOTS = (VDO_MAPPING_STATE_COMPRESSED_MAX - VDO_MAPPING_STATE_COMPRESSED_BASE + 1), }; struct data_location { physical_block_number_t pbn; enum block_mapping_state state; }; /* The configuration of a single slab derived from the configured block size and slab size. */ struct slab_config { /* total number of blocks in the slab */ block_count_t slab_blocks; /* number of blocks available for data */ block_count_t data_blocks; /* number of blocks for reference counts */ block_count_t reference_count_blocks; /* number of blocks for the slab journal */ block_count_t slab_journal_blocks; /* * Number of blocks after which the slab journal starts pushing out a reference_block for * each new entry it receives. */ block_count_t slab_journal_flushing_threshold; /* * Number of blocks after which the slab journal pushes out all reference_blocks and makes * all vios wait. */ block_count_t slab_journal_blocking_threshold; /* Number of blocks after which the slab must be scrubbed before coming online. */ block_count_t slab_journal_scrubbing_threshold; } __packed; struct vdo_config; #endif /* VDO_TYPES_H */ vdo-8.3.1.1/utils/vdo/userVDO.c000066400000000000000000000173721476467262700161340ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "userVDO.h" #include #include #include "memory-alloc.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "physicalLayer.h" /**********************************************************************/ int makeUserVDO(PhysicalLayer *layer, UserVDO **vdoPtr) { UserVDO *vdo; int result = vdo_allocate(1, UserVDO, __func__, &vdo); if (result != VDO_SUCCESS) { return result; } vdo->layer = layer; *vdoPtr = vdo; return VDO_SUCCESS; } /**********************************************************************/ void freeUserVDO(UserVDO **vdoPtr) { UserVDO *vdo = *vdoPtr; if (vdo == NULL) { return; } vdo_destroy_component_states(&vdo->states); vdo_free(vdo); *vdoPtr = NULL; } /**********************************************************************/ int __must_check loadSuperBlock(UserVDO *vdo) { int result = vdo->layer->reader(vdo->layer, vdo_get_data_region_start(vdo->geometry), 1, vdo->superBlockBuffer); if (result != VDO_SUCCESS) { return result; } return vdo_decode_super_block((u8 *) vdo->superBlockBuffer); } /**********************************************************************/ int loadVDOWithGeometry(PhysicalLayer *layer, struct volume_geometry *geometry, bool validateConfig, UserVDO **vdoPtr) { UserVDO *vdo; int result = makeUserVDO(layer, &vdo); if (result != VDO_SUCCESS) { return result; } vdo->geometry = *geometry; result = loadSuperBlock(vdo); if (result != VDO_SUCCESS) { freeUserVDO(&vdo); return result; } result = vdo_decode_component_states((u8 *) vdo->superBlockBuffer, &vdo->geometry, &vdo->states); if (result != VDO_SUCCESS) { freeUserVDO(&vdo); return result; } if (validateConfig) { result = vdo_validate_component_states(&vdo->states, geometry->nonce, layer->getBlockCount(layer), 0); if (result != VDO_SUCCESS) { freeUserVDO(&vdo); return result; } } setDerivedSlabParameters(vdo); *vdoPtr = vdo; return VDO_SUCCESS; } /**********************************************************************/ int loadVolumeGeometry(PhysicalLayer *layer, struct volume_geometry *geometry) { char *block; int result; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "geometry block", &block); if (result != VDO_SUCCESS) return result; result = layer->reader(layer, VDO_GEOMETRY_BLOCK_LOCATION, 1, block); if (result != VDO_SUCCESS) { vdo_free(block); return result; } result = vdo_parse_geometry_block((u8 *) block, geometry); vdo_free(block); return result; } /**********************************************************************/ int loadVDO(PhysicalLayer *layer, bool validateConfig, UserVDO **vdoPtr) { struct volume_geometry geometry; int result = loadVolumeGeometry(layer, &geometry); if (result != VDO_SUCCESS) { return result; } return loadVDOWithGeometry(layer, &geometry, validateConfig, vdoPtr); } /**********************************************************************/ int writeVolumeGeometryWithVersion(PhysicalLayer *layer, struct volume_geometry *geometry, u32 version) { u8 *block; size_t offset = 0; u32 checksum; int result; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "geometry", (char **) &block); if (result != VDO_SUCCESS) return result; memcpy(block, VDO_GEOMETRY_MAGIC_NUMBER, VDO_GEOMETRY_MAGIC_NUMBER_SIZE); offset += VDO_GEOMETRY_MAGIC_NUMBER_SIZE; result = encode_volume_geometry(block, &offset, geometry, version); if (result != VDO_SUCCESS) { vdo_free(block); return result; } checksum = vdo_crc32(block, offset); encode_u32_le(block, &offset, checksum); result = layer->writer(layer, VDO_GEOMETRY_BLOCK_LOCATION, 1, (char *) block); vdo_free(block); return result; } /**********************************************************************/ int saveSuperBlock(UserVDO *vdo) { vdo_encode_super_block((u8 *) vdo->superBlockBuffer, &vdo->states); return vdo->layer->writer(vdo->layer, vdo_get_data_region_start(vdo->geometry), 1, vdo->superBlockBuffer); } /**********************************************************************/ int saveVDO(UserVDO *vdo, bool saveGeometry) { int result = saveSuperBlock(vdo); if (result != VDO_SUCCESS) { return result; } if (!saveGeometry) { return VDO_SUCCESS; } return writeVolumeGeometry(vdo->layer, &vdo->geometry); } /**********************************************************************/ void setDerivedSlabParameters(UserVDO *vdo) { vdo->slabSizeShift = ilog2(vdo->states.vdo.config.slab_size); vdo->slabCount = vdo_compute_slab_count(vdo->states.slab_depot.first_block, vdo->states.slab_depot.last_block, vdo->slabSizeShift); vdo->slabOffsetMask = (1ULL << vdo->slabSizeShift) - 1; } /**********************************************************************/ int getSlabNumber(const UserVDO *vdo, physical_block_number_t pbn, slab_count_t *slabPtr) { const struct slab_depot_state_2_0 *depot = &vdo->states.slab_depot; if ((pbn < depot->first_block) || (pbn >= depot->last_block)) { return VDO_OUT_OF_RANGE; } *slabPtr = ((pbn - depot->first_block) >> vdo->slabSizeShift); return VDO_SUCCESS; } /**********************************************************************/ int getSlabBlockNumber(const UserVDO *vdo, physical_block_number_t pbn, slab_block_number *sbnPtr) { const struct slab_depot_state_2_0 *depot = &vdo->states.slab_depot; if ((pbn < depot->first_block) || (pbn >= depot->last_block)) { return VDO_OUT_OF_RANGE; } slab_block_number sbn = ((pbn - depot->first_block) & vdo->slabOffsetMask); if (sbn >= depot->slab_config.data_blocks) { return VDO_OUT_OF_RANGE; } *sbnPtr = sbn; return VDO_SUCCESS; } /**********************************************************************/ bool isValidDataBlock(const UserVDO *vdo, physical_block_number_t pbn) { slab_block_number sbn; return (getSlabBlockNumber(vdo, pbn, &sbn) == VDO_SUCCESS); } /**********************************************************************/ const struct partition * getPartition(const UserVDO *vdo, enum partition_id id, const char *errorMessage) { struct partition *partition; struct layout layout = vdo->states.layout; int result = vdo_get_partition(&layout, id, &partition); if (result != VDO_SUCCESS) { errx(1, "%s", errorMessage); } return partition; } vdo-8.3.1.1/utils/vdo/userVDO.h000066400000000000000000000150411476467262700161300ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef USER_VDO_H #define USER_VDO_H #include "encodings.h" #include "types.h" #include "physicalLayer.h" /** * A representation of a VDO for use by user space tools. **/ typedef struct user_vdo { /* The physical storage below the VDO */ PhysicalLayer *layer; /* The geometry of the VDO */ struct volume_geometry geometry; /* The buffer for the super block */ char superBlockBuffer[VDO_BLOCK_SIZE]; /* The full state of all components */ struct vdo_component_states states; unsigned int slabSizeShift; slab_count_t slabCount; uint64_t slabOffsetMask; } UserVDO; /** * Construct a user space VDO object. * * @param layer The layer from which to read and write the VDO * @param vdoPtr A pointer to hold the VDO * * @return VDO_SUCCESS or an error **/ int __must_check makeUserVDO(PhysicalLayer *layer, UserVDO **vdoPtr); /** * Free a user space VDO object and NULL out the reference to it. * * @param vdoPtr A pointer to the VDO to free **/ void freeUserVDO(UserVDO **vdoPtr); /** * Load the volume geometry from a layer. * * @param layer The layer from which to read the geometry * @param geometry The structure to receive the decoded fields * * @return VDO_SUCCESS or an error **/ int __must_check loadVolumeGeometry(PhysicalLayer *layer, struct volume_geometry *geometry); /** * Read the super block from the location indicated by the geometry. * * @param vdo The VDO whose super block is to be read * * @return VDO_SUCCESS or an error **/ int __must_check loadSuperBlock(UserVDO *vdo); /** * Load a vdo from a specified super block location. * * @param [in] layer The physical layer the vdo sits on * @param [in] geometry A pointer to the geometry for the volume * @param [in] validateConfig Whether to validate the vdo against the layer * @param [out] vdoPtr A pointer to hold the decoded vdo * * @return VDO_SUCCESS or an error **/ int __must_check loadVDOWithGeometry(PhysicalLayer *layer, struct volume_geometry *geometry, bool validateConfig, UserVDO **vdoPtr); /** * Load a vdo volume. * * @param [in] layer The physical layer the vdo sits on * @param [in] validateConfig Whether to validate the vdo against the layer * @param [out] vdoPtr A pointer to hold the decoded vdo * * @return VDO_SUCCESS or an error **/ int __must_check loadVDO(PhysicalLayer *layer, bool validateConfig, UserVDO **vdoPtr); /** * Write a specific version of geometry block for a VDO. * * @param layer The layer on which to write * @param geometry The volume_geometry to be written * @param version The version of the geometry to write * * @return VDO_SUCCESS or an error. **/ int __must_check writeVolumeGeometryWithVersion(PhysicalLayer *layer, struct volume_geometry *geometry, u32 version); /** * Write a geometry block for a VDO. * * @param layer The layer on which to write * @param geometry The volume_geometry to be written * * @return VDO_SUCCESS or an error. **/ static inline int __must_check writeVolumeGeometry(PhysicalLayer *layer, struct volume_geometry *geometry) { return writeVolumeGeometryWithVersion(layer, geometry, VDO_DEFAULT_GEOMETRY_BLOCK_VERSION); } /** * Encode and write out the super block (assuming the components have already * been encoded). This method is broken out for unit testing. * * @param vdo The vdo whose super block is to be saved * * @return VDO_SUCCESS or an error **/ int __must_check saveSuperBlock(UserVDO *vdo); /** * Encode and save the super block and optionally the geometry block of a VDO. * * @param vdo The VDO to save * @param saveGeometry If true, write the geometry after writing * the super block **/ int __must_check saveVDO(UserVDO *vdo, bool saveGeometry); /** * Set the slab parameters which are derived from the vdo config and the * slab config. * * @param vdo The vdo **/ void setDerivedSlabParameters(UserVDO *vdo); /** * Get the slab number for a pbn. * * @param vdo The vdo * @param pbn The pbn in question * @param slab_ptr A pointer to hold the slab number * * @return VDO_SUCCESS or an error **/ int __must_check getSlabNumber(const UserVDO *vdo, physical_block_number_t pbn, slab_count_t *slabPtr); /** * Get the slab block number for a pbn. * * @param vdo The vdo * @param pbn The pbn in question * @param sbn_ptr A pointer to hold the slab block number * * @return VDO_SUCCESS or an error **/ int __must_check getSlabBlockNumber(const UserVDO *vdo, physical_block_number_t pbn, slab_block_number *sbnPtr); /** * Check whether a given PBN is a valid PBN for a data block. This * recapitulates vdo_is_physical_data_block(). * * @param vdo The vdo * @param pbn The PBN to check * * @return true if the PBN can be used for a data block **/ bool __must_check isValidDataBlock(const UserVDO *vdo, physical_block_number_t pbn); /** * Get a partition from the VDO or fail with an error. * * @param vdo The VDO * @param id The ID of the desired partition * @param errorMessage The error message if the partition does not exist **/ const struct partition * __must_check getPartition(const UserVDO *vdo, enum partition_id id, const char *errorMessage); #endif /* USER_VDO_H */ vdo-8.3.1.1/utils/vdo/vdoConfig.c000066400000000000000000000330401476467262700165110ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include "vdoConfig.h" #include "logger.h" #include "memory-alloc.h" #include "permassert.h" #include "time-utils.h" #include "constants.h" #include "encodings.h" #include "status-codes.h" #include "physicalLayer.h" #include "userVDO.h" #include "vdoVolumeUtils.h" enum { RECOVERY_JOURNAL_STARTING_SEQUENCE_NUMBER = 1, }; /**********************************************************************/ int initializeLayoutFromConfig(const struct vdo_config *config, physical_block_number_t startingOffset, struct layout *layout) { return vdo_initialize_layout(config->physical_blocks, startingOffset, DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT, config->recovery_journal_size, VDO_SLAB_SUMMARY_BLOCKS, layout); } struct recovery_journal_state_7_0 __must_check configureRecoveryJournal(void) { return (struct recovery_journal_state_7_0) { .journal_start = RECOVERY_JOURNAL_STARTING_SEQUENCE_NUMBER, .logical_blocks_used = 0, .block_map_data_blocks = 0, }; } /** * Compute the approximate number of pages which the forest will allocate in * order to map the specified number of logical blocks. This method assumes * that the block map is entirely arboreal. * * @param logicalBlocks The number of blocks to map * @param rootCount The number of trees in the forest * * @return A (slight) over-estimate of the total number of possible forest * pages including the leaves **/ static block_count_t __must_check computeForestSize(block_count_t logicalBlocks, root_count_t rootCount) { struct boundary newSizes; block_count_t approximateNonLeaves = vdo_compute_new_forest_pages(rootCount, NULL, logicalBlocks, &newSizes); // Exclude the tree roots since those aren't allocated from slabs, // and also exclude the super-roots, which only exist in memory. approximateNonLeaves -= rootCount * (newSizes.levels[VDO_BLOCK_MAP_TREE_HEIGHT - 2] + newSizes.levels[VDO_BLOCK_MAP_TREE_HEIGHT - 1]); block_count_t approximateLeaves = vdo_compute_block_map_page_count(logicalBlocks - approximateNonLeaves); // This can be a slight over-estimate since the tree will never have to // address these blocks, so it might be a tiny bit smaller. return (approximateNonLeaves + approximateLeaves); } /** * Configure a new VDO. * * @param vdo The VDO to configure * * @return VDO_SUCCESS or an error **/ static int __must_check configureVDO(UserVDO *vdo) { struct vdo_config *config = &vdo->states.vdo.config; // The layout starts 1 block past the beginning of the data region, as the // data region contains the super block but the layout does not. physical_block_number_t startingOffset = vdo_get_data_region_start(vdo->geometry) + 1; int result = initializeLayoutFromConfig(config, startingOffset, &vdo->states.layout); if (result != VDO_SUCCESS) { return result; } vdo->states.recovery_journal = configureRecoveryJournal(); struct slab_config slabConfig; result = vdo_configure_slab(config->slab_size, config->slab_journal_blocks, &slabConfig); if (result != VDO_SUCCESS) { return result; } const struct partition *partition = getPartition(vdo, VDO_SLAB_DEPOT_PARTITION, "no allocator partition"); result = vdo_configure_slab_depot(partition, slabConfig, 0, &vdo->states.slab_depot); if (result != VDO_SUCCESS) { return result; } setDerivedSlabParameters(vdo); if (config->logical_blocks == 0) { block_count_t dataBlocks = slabConfig.data_blocks * vdo->slabCount; config->logical_blocks = dataBlocks - computeForestSize(dataBlocks, DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT); } partition = getPartition(vdo, VDO_BLOCK_MAP_PARTITION, "no block map partition"); vdo->states.block_map = (struct block_map_state_2_0) { .flat_page_origin = VDO_BLOCK_MAP_FLAT_PAGE_ORIGIN, .flat_page_count = 0, .root_origin = partition->offset, .root_count = DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT, }; vdo->states.vdo.state = VDO_NEW; return VDO_SUCCESS; } /**********************************************************************/ int formatVDO(const struct vdo_config *config, const struct index_config *indexConfig, PhysicalLayer *layer) { // Generate a uuid. uuid_t uuid; uuid_generate(uuid); return formatVDOWithNonce(config, indexConfig, layer, current_time_us(), &uuid); } /**********************************************************************/ int calculateMinimumVDOFromConfig(const struct vdo_config *config, const struct index_config *indexConfig, block_count_t *minVDOBlocks) { // The minimum VDO size is the minimal size of the fixed layout + // one slab size for the allocator. The minimum fixed layout size // calculated below comes from vdoLayout.c in makeVDOFixedLayout(). block_count_t indexSize = 0; if (indexConfig != NULL) { int result = computeIndexBlocks(indexConfig, &indexSize); if (result != VDO_SUCCESS) { return result; } } block_count_t blockMapBlocks = DEFAULT_VDO_BLOCK_MAP_TREE_ROOT_COUNT; block_count_t journalBlocks = config->recovery_journal_size; block_count_t slabBlocks = config->slab_size; // The +2 takes into account the super block and geometry block. block_count_t fixedLayoutSize = indexSize + 2 + blockMapBlocks + journalBlocks + VDO_SLAB_SUMMARY_BLOCKS; *minVDOBlocks = fixedLayoutSize + slabBlocks; return VDO_SUCCESS; } /** * Clear a partition by writing zeros to every block in that partition. * * @param vdo The VDO with the partition to be cleared * @param id The ID of the partition to clear * * @return VDO_SUCCESS or an error code **/ static int __must_check clearPartition(UserVDO *vdo, enum partition_id id) { struct partition *partition; int result = vdo_get_partition(&vdo->states.layout, id, &partition); if (result != VDO_SUCCESS) { return result; } block_count_t bufferBlocks = 1; for (block_count_t n = partition->count; (bufferBlocks < 4096) && ((n & 0x1) == 0); n >>= 1) { bufferBlocks <<= 1; } char *zeroBuffer; result = vdo->layer->allocateIOBuffer(vdo->layer, bufferBlocks * VDO_BLOCK_SIZE, "zero buffer", &zeroBuffer); if (result != VDO_SUCCESS) { return result; } for (physical_block_number_t pbn = partition->offset; (pbn < partition->offset + partition->count) && (result == VDO_SUCCESS); pbn += bufferBlocks) { result = vdo->layer->writer(vdo->layer, pbn, bufferBlocks, zeroBuffer); } vdo_free(zeroBuffer); return result; } /**********************************************************************/ int computeIndexBlocks(const struct index_config *index_config, block_count_t *index_blocks_ptr) { int result; u64 index_bytes; block_count_t index_blocks; struct uds_parameters uds_parameters = { .memory_size = index_config->mem, .sparse = index_config->sparse, }; result = uds_compute_index_size(&uds_parameters, &index_bytes); if (result != UDS_SUCCESS) return vdo_log_error_strerror(result, "error computing index size"); index_blocks = index_bytes / VDO_BLOCK_SIZE; if ((((u64) index_blocks) * VDO_BLOCK_SIZE) != index_bytes) return vdo_log_error_strerror(VDO_PARAMETER_MISMATCH, "index size must be a multiple of block size %d", VDO_BLOCK_SIZE); *index_blocks_ptr = index_blocks; return VDO_SUCCESS; } /**********************************************************************/ int initializeVolumeGeometry(nonce_t nonce, uuid_t *uuid, const struct index_config *index_config, struct volume_geometry *geometry) { int result; block_count_t index_size = 0; if (index_config != NULL) { result = computeIndexBlocks(index_config, &index_size); if (result != VDO_SUCCESS) return result; } *geometry = (struct volume_geometry) { /* This is for backwards compatibility. */ .unused = 0, .nonce = nonce, .bio_offset = 0, .regions = { [VDO_INDEX_REGION] = { .id = VDO_INDEX_REGION, .start_block = 1, }, [VDO_DATA_REGION] = { .id = VDO_DATA_REGION, .start_block = 1 + index_size, } } }; uuid_copy(geometry->uuid, *uuid); if (index_size > 0) memcpy(&geometry->index_config, index_config, sizeof(struct index_config)); return VDO_SUCCESS; } /** * Configure a VDO and its geometry and write it out. * * @param vdo The VDO to create * @param config The configuration parameters for the VDO * @param indexConfig The configuration parameters for the index * @param nonce The nonce for the VDO * @param uuid The uuid for the VDO **/ static int configureAndWriteVDO(UserVDO *vdo, const struct vdo_config *config, const struct index_config *indexConfig, nonce_t nonce, uuid_t *uuid) { int result = initializeVolumeGeometry(nonce, uuid, indexConfig, &vdo->geometry); if (result != VDO_SUCCESS) { return result; } char *block; result = vdo->layer->allocateIOBuffer(vdo->layer, VDO_BLOCK_SIZE, "geometry block", &block); if (result != VDO_SUCCESS) { return result; } result = vdo->layer->writer(vdo->layer, VDO_GEOMETRY_BLOCK_LOCATION, 1, block); vdo_free(block); if (result != VDO_SUCCESS) { return result; } vdo->states.vdo.config = *config; vdo->states.vdo.nonce = nonce; vdo->states.volume_version = VDO_VOLUME_VERSION_67_0; result = configureVDO(vdo); if (result != VDO_SUCCESS) { return result; } result = clearPartition(vdo, VDO_BLOCK_MAP_PARTITION); if (result != VDO_SUCCESS) { return vdo_log_error_strerror(result, "cannot clear block map partition"); } result = clearPartition(vdo, VDO_RECOVERY_JOURNAL_PARTITION); if (result != VDO_SUCCESS) { return vdo_log_error_strerror(result, "cannot clear recovery journal partition"); } return saveVDO(vdo, true); } /**********************************************************************/ int formatVDOWithNonce(const struct vdo_config *config, const struct index_config *indexConfig, PhysicalLayer *layer, nonce_t nonce, uuid_t *uuid) { int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { return result; } result = vdo_validate_config(config, layer->getBlockCount(layer), 0); if (result != VDO_SUCCESS) { return result; } UserVDO *vdo; result = makeUserVDO(layer, &vdo); if (result != VDO_SUCCESS) { return result; } result = configureAndWriteVDO(vdo, config, indexConfig, nonce, uuid); freeUserVDO(&vdo); return result; } /** * Change the state of an inactive VDO image. * * @param layer A physical layer * @param requireReadOnly Whether the existing VDO must be in read-only mode * @param newState The new state to store in the VDO **/ static int __must_check updateVDOSuperBlockState(PhysicalLayer *layer, bool requireReadOnly, enum vdo_state newState) { UserVDO *vdo; int result = loadVDO(layer, false, &vdo); if (result != VDO_SUCCESS) { return result; } if (requireReadOnly && (vdo->states.vdo.state != VDO_READ_ONLY_MODE)) { freeUserVDO(&vdo); return VDO_NOT_READ_ONLY; } vdo->states.vdo.state = newState; result = saveVDO(vdo, false); freeUserVDO(&vdo); return result; } /**********************************************************************/ int forceVDORebuild(PhysicalLayer *layer) { int result = updateVDOSuperBlockState(layer, true, VDO_FORCE_REBUILD); if (result == VDO_NOT_READ_ONLY) { return vdo_log_error_strerror(VDO_NOT_READ_ONLY, "Can't force rebuild on a normal VDO"); } return result; } /**********************************************************************/ int setVDOReadOnlyMode(PhysicalLayer *layer) { return updateVDOSuperBlockState(layer, false, VDO_READ_ONLY_MODE); } vdo-8.3.1.1/utils/vdo/vdoConfig.h000066400000000000000000000120101476467262700165100ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef VDO_CONFIG_H #define VDO_CONFIG_H #include "errors.h" #include "indexer.h" #include "encodings.h" #include "types.h" // The vdo_config structure is fully declared in types.h /** * Initialize the recovery journal state for a new VDO. * * @return An initialized recovery journal state **/ struct recovery_journal_state_7_0 __must_check configureRecoveryJournal(void); /** * Format a physical layer to function as a new VDO. This function must be * called on a physical layer before a VDO can be loaded for the first time on a * given layer. Once a layer has been formatted, it can be loaded and shut down * repeatedly. If a new VDO is desired, this function should be called again. * * @param config The configuration parameters for the VDO * @param indexConfig The configuration parameters for the index * @param layer The physical layer the VDO will sit on * * @return VDO_SUCCESS or an error **/ int __must_check formatVDO(const struct vdo_config *config, const struct index_config *indexConfig, PhysicalLayer *layer); /** * Calculate minimal VDO based on config parameters. * * @param config The configuration parameters for the VDO * @param indexConfig The configuration parameters for the index * @param minVDOBlocks A pointer to hold the minimum blocks needed * * @return VDO_SUCCESS or error. **/ int calculateMinimumVDOFromConfig(const struct vdo_config *config, const struct index_config *indexConfig, block_count_t *minVDOBlocks) __attribute__((warn_unused_result)); /** * Initialize a layout according to a vdo_config. Exposed for testing only. * * @param [in] config The vdo_config to generate a vdo_layout from * @param [in] startingOffset The start of the layouts * @param [out] layout The layout to initialize * * @return VDO_SUCCESS or an error **/ int __must_check initializeLayoutFromConfig(const struct vdo_config *config, physical_block_number_t startingOffset, struct layout *layout); /** * Compute the index size in blocks from the index_config. * * @param index_config The index config * @param index_blocks_ptr A pointer to return the index size in blocks * * @return VDO_SUCCESS or an error. **/ int __must_check computeIndexBlocks(const struct index_config *index_config, block_count_t *index_blocks_ptr); /** * Initialize a volume_geometry for a VDO. * * @param nonce The nonce for the VDO * @param uuid The uuid for the VDO * @param index_config The index config of the VDO * @param geometry The geometry being initialized * * @return VDO_SUCCESS or an error. **/ int __must_check initializeVolumeGeometry(nonce_t nonce, uuid_t *uuid, const struct index_config *index_config, struct volume_geometry *geometry); /** * This is a version of formatVDO() which allows the caller to supply the * desired VDO nonce and uuid. This function exists to facilitate unit tests * which attempt to ensure that version numbers are properly updated when * formats change. * * @param config The configuration parameters for the VDO * @param indexConfig The configuration parameters for the index * @param indexBlocks Size of the index in blocks * @param layer The physical layer the VDO will sit on * @param nonce The nonce for the VDO * @param uuid The uuid for the VDO * * @return VDO_SUCCESS or an error **/ int __must_check formatVDOWithNonce(const struct vdo_config *config, const struct index_config *indexConfig, PhysicalLayer *layer, nonce_t nonce, uuid_t *uuid); /** * Force the VDO to exit read-only mode and rebuild when it next loads * by setting the super block state. * * @param layer The physical layer on which the VDO resides **/ int __must_check forceVDORebuild(PhysicalLayer *layer); /** * Force the VDO to enter read-only mode when off-line. This is only * used by a test utility. * * @param layer The physical layer on which the VDO resides **/ int __must_check setVDOReadOnlyMode(PhysicalLayer *layer); #endif /* VDO_CONFIG_H */ vdo-8.3.1.1/utils/vdo/vdoStats.h000066400000000000000000000023431476467262700164110ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef VDO_STATS_H #define VDO_STATS_H #include "types.h" /** * Read vdo statistics from a buffer * * @param buf pointer to the buffer * @param stats pointer to the statistics * * @return VDO_SUCCESS or an error */ int read_vdo_stats(char *buf, struct vdo_statistics *stats); /** * Write vdo statistics to stdout * * @param stats pointer to the statistics * * @return VDO_SUCCESS or an error */ int vdo_write_stats(struct vdo_statistics *stats); #endif /* VDO_STATS_H */ vdo-8.3.1.1/utils/vdo/vdoStatsWriter.c000066400000000000000000001073241476467262700176060ustar00rootroot00000000000000// SPDX-License-Identifier: GPL-2.0-only /* * Copyright 2023 Red Hat * * If you add new statistics, be sure to update the following files: * * ../base/statistics.h * ../base/message-stats.c * ../base/pool-sysfs-stats.c * ./messageStatsReader.c * ../../../perl/Permabit/Statistics/Definitions.pm */ #include #include #include #include "numeric.h" #include "string-utils.h" #include "math.h" #include "statistics.h" #include "status-codes.h" #include "types.h" #include "vdoStats.h" #define MAX_STATS 239 #define MAX_STAT_LENGTH 80 int fieldCount = 0; int maxLabelLength = 0; char labels[MAX_STATS][MAX_STAT_LENGTH]; char values[MAX_STATS][MAX_STAT_LENGTH]; static int write_u8(char *label, u8 value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%hhu", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_u64(char *label, u64 value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%lu", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_string(char *label, char *value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%s", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_block_count_t(char *label, block_count_t value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%lu", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_u32(char *label, u32 value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%u", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_double(char *label, double value) { int count = sprintf(labels[fieldCount], "%s", label); if (count < 0) { return VDO_UNEXPECTED_EOF; } maxLabelLength = max(maxLabelLength, (int) strlen(label)); count = sprintf(values[fieldCount++], "%.2f", value); if (count < 0) { return VDO_UNEXPECTED_EOF; } return VDO_SUCCESS; } static int write_block_allocator_statistics(char *prefix, struct block_allocator_statistics *stats) { int result = 0; char *joined = NULL; /** The total number of slabs from which blocks may be allocated */ if (asprintf(&joined, "%s slab count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->slab_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** The total number of slabs from which blocks have ever been allocated */ if (asprintf(&joined, "%s slabs opened", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->slabs_opened); free(joined); if (result != VDO_SUCCESS) { return result; } /** The number of times since loading that a slab has been re-opened */ if (asprintf(&joined, "%s slabs reopened", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->slabs_reopened); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_commit_statistics(char *prefix, struct commit_statistics *stats) { int result = 0; char *joined = NULL; u64 batching = stats->started - stats->written; u64 writing = stats->written - stats->committed; if (asprintf(&joined, "%s batching", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, batching); free(joined); if (result != VDO_SUCCESS) { return result; } /** The total number of items on which processing has started */ if (asprintf(&joined, "%s started", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->started); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s writing", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, writing); free(joined); if (result != VDO_SUCCESS) { return result; } /** The total number of items for which a write operation has been issued */ if (asprintf(&joined, "%s written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->written); free(joined); if (result != VDO_SUCCESS) { return result; } /** The total number of items for which a write operation has completed */ if (asprintf(&joined, "%s committed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->committed); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_recovery_journal_statistics(char *prefix, struct recovery_journal_statistics *stats) { int result = 0; char *joined = NULL; /** Number of times the on-disk journal was full */ if (asprintf(&joined, "%s disk full count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->disk_full); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times the recovery journal requested slab journal commits. */ if (asprintf(&joined, "%s commits requested count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->slab_journal_commits_requested); free(joined); if (result != VDO_SUCCESS) { return result; } /** Write/Commit totals for individual journal entries */ if (asprintf(&joined, "%s entries", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_commit_statistics(joined, &stats->entries); free(joined); if (result != VDO_SUCCESS) { return result; } /** Write/Commit totals for journal blocks */ if (asprintf(&joined, "%s blocks", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_commit_statistics(joined, &stats->blocks); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_packer_statistics(char *prefix, struct packer_statistics *stats) { int result = 0; char *joined = NULL; /** Number of compressed data items written since startup */ if (asprintf(&joined, "%s compressed fragments written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->compressed_fragments_written); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of blocks containing compressed items written since startup */ if (asprintf(&joined, "%s compressed blocks written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->compressed_blocks_written); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of VIOs that are pending in the packer */ if (asprintf(&joined, "%s compressed fragments in packer", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->compressed_fragments_in_packer); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_slab_journal_statistics(char *prefix, struct slab_journal_statistics *stats) { int result = 0; char *joined = NULL; /** Number of times the on-disk journal was full */ if (asprintf(&joined, "%s disk full count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->disk_full_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times an entry was added over the flush threshold */ if (asprintf(&joined, "%s flush count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->flush_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times an entry was added over the block threshold */ if (asprintf(&joined, "%s blocked count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->blocked_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times a tail block was written */ if (asprintf(&joined, "%s blocks written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->blocks_written); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times we had to wait for the tail to write */ if (asprintf(&joined, "%s tail busy count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->tail_busy_count); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_slab_summary_statistics(char *prefix, struct slab_summary_statistics *stats) { int result = 0; char *joined = NULL; /** Number of blocks written */ if (asprintf(&joined, "%s blocks written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->blocks_written); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_ref_counts_statistics(char *prefix, struct ref_counts_statistics *stats) { int result = 0; char *joined = NULL; /** Number of reference blocks written */ if (asprintf(&joined, "%s blocks written", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->blocks_written); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_block_map_statistics(char *prefix, struct block_map_statistics *stats) { int result = 0; char *joined = NULL; /** number of dirty (resident) pages */ if (asprintf(&joined, "%s dirty pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->dirty_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of clean (resident) pages */ if (asprintf(&joined, "%s clean pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->clean_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of free pages */ if (asprintf(&joined, "%s free pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->free_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of pages in failed state */ if (asprintf(&joined, "%s failed pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->failed_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of pages incoming */ if (asprintf(&joined, "%s incoming pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->incoming_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of pages outgoing */ if (asprintf(&joined, "%s outgoing pages", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->outgoing_pages); free(joined); if (result != VDO_SUCCESS) { return result; } /** how many times free page not avail */ if (asprintf(&joined, "%s cache pressure", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->cache_pressure); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of get_vdo_page() calls for read */ if (asprintf(&joined, "%s read count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->read_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of get_vdo_page() calls for write */ if (asprintf(&joined, "%s write count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->write_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of times pages failed to read */ if (asprintf(&joined, "%s failed reads", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->failed_reads); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of times pages failed to write */ if (asprintf(&joined, "%s failed writes", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->failed_writes); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets that are reclaimed */ if (asprintf(&joined, "%s reclaimed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->reclaimed); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets for outgoing pages */ if (asprintf(&joined, "%s read outgoing", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->read_outgoing); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets that were already there */ if (asprintf(&joined, "%s found in cache", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->found_in_cache); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets requiring discard */ if (asprintf(&joined, "%s discard required", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->discard_required); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets enqueued for their page */ if (asprintf(&joined, "%s wait for page", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->wait_for_page); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of gets that have to fetch */ if (asprintf(&joined, "%s fetch required", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->fetch_required); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of page fetches */ if (asprintf(&joined, "%s pages loaded", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->pages_loaded); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of page saves */ if (asprintf(&joined, "%s pages saved", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->pages_saved); free(joined); if (result != VDO_SUCCESS) { return result; } /** the number of flushes issued */ if (asprintf(&joined, "%s flush count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->flush_count); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_hash_lock_statistics(char *prefix, struct hash_lock_statistics *stats) { int result = 0; char *joined = NULL; /** Number of times the UDS advice proved correct */ if (asprintf(&joined, "%s dedupe advice valid", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->dedupe_advice_valid); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times the UDS advice proved incorrect */ if (asprintf(&joined, "%s dedupe advice stale", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->dedupe_advice_stale); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of writes with the same data as another in-flight write */ if (asprintf(&joined, "%s concurrent data matches", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->concurrent_data_matches); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of writes whose hash collided with an in-flight write */ if (asprintf(&joined, "%s concurrent hash collisions", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->concurrent_hash_collisions); free(joined); if (result != VDO_SUCCESS) { return result; } /** Current number of dedupe queries that are in flight */ if (asprintf(&joined, "%s current dedupe queries", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->curr_dedupe_queries); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_error_statistics(char *prefix, struct error_statistics *stats) { int result = 0; char *joined = NULL; /** number of times VDO got an invalid dedupe advice PBN from UDS */ if (asprintf(&joined, "%s invalid advice PBN count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->invalid_advice_pbn_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of times a VIO completed with a VDO_NO_SPACE error */ if (asprintf(&joined, "%s no space error count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->no_space_error_count); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of times a VIO completed with a VDO_READ_ONLY error */ if (asprintf(&joined, "%s read only error count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->read_only_error_count); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_bio_stats(char *prefix, struct bio_stats *stats) { int result = 0; char *joined = NULL; /** Number of REQ_OP_READ bios */ if (asprintf(&joined, "%s read", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->read); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of REQ_OP_WRITE bios with data */ if (asprintf(&joined, "%s write", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->write); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_PREFLUSH and containing no data */ if (asprintf(&joined, "%s empty flush", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->empty_flush); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of REQ_OP_DISCARD bios */ if (asprintf(&joined, "%s discard", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->discard); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_PREFLUSH */ if (asprintf(&joined, "%s flush", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->flush); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of bios tagged with REQ_FUA */ if (asprintf(&joined, "%s fua", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->fua); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_memory_usage(char *prefix, struct memory_usage *stats) { int result = 0; char *joined = NULL; /** Tracked bytes currently allocated. */ if (asprintf(&joined, "%s bytes used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->bytes_used); free(joined); if (result != VDO_SUCCESS) { return result; } /** Maximum tracked bytes allocated. */ if (asprintf(&joined, "%s peak bytes used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->peak_bytes_used); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_index_statistics(char *prefix, struct index_statistics *stats) { int result = 0; char *joined = NULL; /** Number of records stored in the index */ if (asprintf(&joined, "%s entries indexed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->entries_indexed); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of post calls that found an existing entry */ if (asprintf(&joined, "%s posts found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->posts_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of post calls that added a new entry */ if (asprintf(&joined, "%s posts not found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->posts_not_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of query calls that found an existing entry */ if (asprintf(&joined, "%s queries found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->queries_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of query calls that added a new entry */ if (asprintf(&joined, "%s queries not found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->queries_not_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of update calls that found an existing entry */ if (asprintf(&joined, "%s updates found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->updates_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of update calls that added a new entry */ if (asprintf(&joined, "%s updates not found", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->updates_not_found); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of entries discarded */ if (asprintf(&joined, "%s entries discarded", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->entries_discarded); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } static int write_vdo_statistics(char *prefix, struct vdo_statistics *stats) { int result = 0; char *joined = NULL; u64 one_k_blocks = stats->physical_blocks * stats->block_size / 1024; u64 one_k_blocks_used = (stats->data_blocks_used + stats->overhead_blocks_used) * stats->block_size / 1024; u64 one_k_blocks_available = (stats->physical_blocks - stats->data_blocks_used - stats->overhead_blocks_used) * stats->block_size / 1024; u8 used_percent = (int) (100 * ((double) (stats->data_blocks_used + stats->overhead_blocks_used) / stats->physical_blocks) + 0.5); s32 savings = (stats->logical_blocks_used > 0) ? (int) (100 * (s64) (stats->logical_blocks_used - stats->data_blocks_used) / (u64) stats->logical_blocks_used) : 0; u8 saving_percent = savings; char five_twelve_byte_emulation[4] = ""; sprintf(five_twelve_byte_emulation, "%s", (stats->logical_block_size == 512) ? "on" : "off"); double write_amplification_ratio = (stats->bios_in.write > 0) ? roundf((double) (stats->bios_meta.write + stats->bios_out.write) / stats->bios_in.write) : 0.00; if (asprintf(&joined, "%s version", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->version); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of blocks used for data */ if (asprintf(&joined, "%s data blocks used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if ((!stats->in_recovery_mode) && (strcmp("read-only", stats->mode))) { result = write_u64(joined, stats->data_blocks_used); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of blocks used for VDO metadata */ if (asprintf(&joined, "%s overhead blocks used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if (!stats->in_recovery_mode) { result = write_u64(joined, stats->overhead_blocks_used); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of logical blocks that are currently mapped to physical blocks */ if (asprintf(&joined, "%s logical blocks used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if (!stats->in_recovery_mode) { result = write_u64(joined, stats->logical_blocks_used); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } /** number of physical blocks */ if (asprintf(&joined, "%s physical blocks", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_block_count_t(joined, stats->physical_blocks); free(joined); if (result != VDO_SUCCESS) { return result; } /** number of logical blocks */ if (asprintf(&joined, "%s logical blocks", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_block_count_t(joined, stats->logical_blocks); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s 1K-blocks", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, one_k_blocks); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s 1K-blocks used", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if ((!stats->in_recovery_mode) && (strcmp("read-only", stats->mode))) { result = write_u64(joined, one_k_blocks_used); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s 1K-blocks available", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if ((!stats->in_recovery_mode) && (strcmp("read-only", stats->mode))) { result = write_u64(joined, one_k_blocks_available); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s used percent", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if ((!stats->in_recovery_mode) && (strcmp("read-only", stats->mode))) { result = write_u8(joined, used_percent); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s saving percent", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if ((!stats->in_recovery_mode) && (strcmp("read-only", stats->mode)) && (savings >= 0)) { result = write_u8(joined, saving_percent); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } /** Size of the block map page cache, in bytes */ if (asprintf(&joined, "%s block map cache size", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->block_map_cache_size); free(joined); if (result != VDO_SUCCESS) { return result; } /** The physical block size */ if (asprintf(&joined, "%s block size", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->block_size); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times the VDO has successfully recovered */ if (asprintf(&joined, "%s completed recovery count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->complete_recoveries); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times the VDO has recovered from read-only mode */ if (asprintf(&joined, "%s read-only recovery count", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->read_only_recoveries); free(joined); if (result != VDO_SUCCESS) { return result; } /** String describing the operating mode of the VDO */ if (asprintf(&joined, "%s operating mode", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_string(joined, stats->mode); free(joined); if (result != VDO_SUCCESS) { return result; } /** What percentage of recovery mode work has been completed */ if (asprintf(&joined, "%s recovery progress (%%)", prefix) == -1) { return VDO_UNEXPECTED_EOF; } if (stats->in_recovery_mode) { result = write_u8(joined, stats->recovery_percentage); } else { result = write_string(joined, "N/A"); } free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the compressed block packer */ if (asprintf(&joined, "%s", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_packer_statistics(joined, &stats->packer); free(joined); if (result != VDO_SUCCESS) { return result; } /** Counters for events in the block allocator */ if (asprintf(&joined, "%s", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_block_allocator_statistics(joined, &stats->allocator); free(joined); if (result != VDO_SUCCESS) { return result; } /** Counters for events in the recovery journal */ if (asprintf(&joined, "%s journal", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_recovery_journal_statistics(joined, &stats->journal); free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the slab journals */ if (asprintf(&joined, "%s slab journal", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_slab_journal_statistics(joined, &stats->slab_journal); free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the slab summary */ if (asprintf(&joined, "%s slab summary", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_slab_summary_statistics(joined, &stats->slab_summary); free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the reference counts */ if (asprintf(&joined, "%s reference", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_ref_counts_statistics(joined, &stats->ref_counts); free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the block map */ if (asprintf(&joined, "%s block map", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_block_map_statistics(joined, &stats->block_map); free(joined); if (result != VDO_SUCCESS) { return result; } /** The dedupe statistics from hash locks */ if (asprintf(&joined, "%s", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_hash_lock_statistics(joined, &stats->hash_lock); free(joined); if (result != VDO_SUCCESS) { return result; } /** Counts of error conditions */ if (asprintf(&joined, "%s", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_error_statistics(joined, &stats->errors); free(joined); if (result != VDO_SUCCESS) { return result; } /** The VDO instance */ if (asprintf(&joined, "%s instance", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->instance); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s 512 byte emulation", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_string(joined, five_twelve_byte_emulation); free(joined); if (result != VDO_SUCCESS) { return result; } /** Current number of active VIOs */ if (asprintf(&joined, "%s current VDO IO requests in progress", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->current_vios_in_progress); free(joined); if (result != VDO_SUCCESS) { return result; } /** Maximum number of active VIOs */ if (asprintf(&joined, "%s maximum VDO IO requests in progress", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u32(joined, stats->max_vios); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of times the UDS index was too slow in responding */ if (asprintf(&joined, "%s dedupe advice timeouts", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->dedupe_advice_timeouts); free(joined); if (result != VDO_SUCCESS) { return result; } /** Number of flush requests submitted to the storage device */ if (asprintf(&joined, "%s flush out", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_u64(joined, stats->flush_out); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s write amplification ratio", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_double(joined, write_amplification_ratio); free(joined); if (result != VDO_SUCCESS) { return result; } /** Bios submitted into VDO from above */ if (asprintf(&joined, "%s bios in", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_in); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios in partial", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_in_partial); free(joined); if (result != VDO_SUCCESS) { return result; } /** Bios submitted onward for user data */ if (asprintf(&joined, "%s bios out", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_out); free(joined); if (result != VDO_SUCCESS) { return result; } /** Bios submitted onward for metadata */ if (asprintf(&joined, "%s bios meta", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_meta); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios journal", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_journal); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios page cache", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_page_cache); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios out completed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_out_completed); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios meta completed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_meta_completed); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios journal completed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_journal_completed); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios page cache completed", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_page_cache_completed); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios acknowledged", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_acknowledged); free(joined); if (result != VDO_SUCCESS) { return result; } if (asprintf(&joined, "%s bios acknowledged partial", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_acknowledged_partial); free(joined); if (result != VDO_SUCCESS) { return result; } /** Current number of bios in progress */ if (asprintf(&joined, "%s bios in progress", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_bio_stats(joined, &stats->bios_in_progress); free(joined); if (result != VDO_SUCCESS) { return result; } /** Memory usage stats. */ if (asprintf(&joined, "%s KVDO module", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_memory_usage(joined, &stats->memory_usage); free(joined); if (result != VDO_SUCCESS) { return result; } /** The statistics for the UDS index */ if (asprintf(&joined, "%s", prefix) == -1) { return VDO_UNEXPECTED_EOF; } result = write_index_statistics(joined, &stats->index); free(joined); if (result != VDO_SUCCESS) { return result; } return VDO_SUCCESS; } int vdo_write_stats(struct vdo_statistics *stats) { fieldCount = 0; maxLabelLength = 0; memset(labels, '\0', MAX_STATS * MAX_STAT_LENGTH); memset(values, '\0', MAX_STATS * MAX_STAT_LENGTH); int result = write_vdo_statistics(" ", stats); if (result != VDO_SUCCESS) { return result; } for (int i = 0; i < fieldCount; i++) { printf("%s%*s : %s\n", labels[i], maxLabelLength - (int) strlen(labels[i]), "", values[i]); } return VDO_SUCCESS; } vdo-8.3.1.1/utils/vdo/vdoVolumeUtils.c000066400000000000000000000057331476467262700176040ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include "vdoVolumeUtils.h" #include #include "errors.h" #include "permassert.h" #include "status-codes.h" #include "fileLayer.h" #include "userVDO.h" static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; /** * Load a VDO from a file. * * @param [in] filename The file name * @param [in] readOnly Whether the layer should be read-only. * @param [in] validateConfig Whether the VDO should validate its config * @param [out] vdoPtr A pointer to hold the VDO * * @return VDO_SUCCESS or an error code **/ static int __must_check loadVDOFromFile(const char *filename, bool readOnly, bool validateConfig, UserVDO **vdoPtr) { int result = VDO_ASSERT(validateConfig || readOnly, "Cannot make a writable VDO" " without validating its config"); if (result != VDO_SUCCESS) { return result; } PhysicalLayer *layer; if (readOnly) { result = makeReadOnlyFileLayer(filename, &layer); } else { result = makeFileLayer(filename, 0, &layer); } if (result != VDO_SUCCESS) { warnx("Failed to make FileLayer from '%s' with %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); return result; } // Create the VDO. UserVDO *vdo; result = loadVDO(layer, validateConfig, &vdo); if (result != VDO_SUCCESS) { layer->destroy(&layer); warnx("loading VDO failed with: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); return result; } *vdoPtr = vdo; return VDO_SUCCESS; } /**********************************************************************/ int makeVDOFromFile(const char *filename, bool readOnly, UserVDO **vdoPtr) { return loadVDOFromFile(filename, readOnly, true, vdoPtr); } /**********************************************************************/ int readVDOWithoutValidation(const char *filename, UserVDO **vdoPtr) { return loadVDOFromFile(filename, true, false, vdoPtr); } /**********************************************************************/ void freeVDOFromFile(UserVDO **vdoPtr) { UserVDO *vdo = *vdoPtr; if (vdo == NULL) { return; } PhysicalLayer *layer = vdo->layer; freeUserVDO(&vdo); layer->destroy(&layer); *vdoPtr = NULL; } vdo-8.3.1.1/utils/vdo/vdoVolumeUtils.h000066400000000000000000000031531476467262700176030ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #ifndef VDO_VOLUME_UTILS_H #define VDO_VOLUME_UTILS_H #include "types.h" #include "userVDO.h" /** * Load a VDO from a file. * * @param [in] filename The file name * @param [in] readOnly Whether the layer should be read-only. * @param [out] vdoPtr A pointer to hold the VDO * * @return VDO_SUCCESS or an error code **/ int __must_check makeVDOFromFile(const char *filename, bool readOnly, UserVDO **vdoPtr); /** * Load a VDO from a file without validating the config. * * @param [in] filename The file name * @param [out] vdoPtr A pointer to hold the VDO * * @return VDO_SUCCESS or an error code **/ int __must_check readVDOWithoutValidation(const char *filename, UserVDO **vdoPtr); /** * Free the VDO made with makeVDOFromFile(). * * @param vdoPtr The pointer to the VDO to free **/ void freeVDOFromFile(UserVDO **vdoPtr); #endif // VDO_VOLUME_UTILS_H vdo-8.3.1.1/utils/vdo/vdoaudit.c000066400000000000000000000613551476467262700164240ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include #include #include #include #include "errors.h" #include "fileUtils.h" #include "logger.h" #include "memory-alloc.h" #include "syscalls.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "blockMapUtils.h" #include "slabSummaryReader.h" #include "userVDO.h" #include "vdoVolumeUtils.h" // Reference counts are one byte, so the error delta range of possible // (stored - audited) values is [ 0 - 255 .. 255 - 0 ]. enum { MIN_ERROR_DELTA = -255, MAX_ERROR_DELTA = 255, }; /** * A record to hold the audit information for each slab. **/ typedef struct { slab_count_t slabNumber; physical_block_number_t slabOrigin; /** Reference counts audited from the block map for each slab data block */ uint8_t *refCounts; /** Number of reference count inconsistencies found in the slab */ uint32_t badRefCounts; /** * Histogram of the reference count differences in the slab, indexed by * 255 + (storedReferences - auditedReferences). **/ uint32_t deltaCounts[MAX_ERROR_DELTA - MIN_ERROR_DELTA + 1]; /** Offset in the slab of the first block with an error */ slab_block_number firstError; /** Offset in the slab of the last block with an error */ slab_block_number lastError; } SlabAudit; static const char usageString[] = "[--help] [ [--summary] | [--verbose] ] [--version] filename"; static const char helpString[] = "vdoAudit - confirm the reference counts of a VDO device\n" "\n" "SYNOPSIS\n" " vdoAudit [ [--summary] | [--verbose] ] \n" "\n" "DESCRIPTION\n" " vdoAudit adds up the logical block references to all physical\n" " blocks of a VDO device found in , then compares that sum\n" " to the stored number of logical blocks. It also confirms all of\n" " the actual reference counts on all physical blocks against the\n" " stored reference counts. Finally, it validates that the slab summary\n" " approximation of the free blocks in each slab is correct.\n" "\n" " If --verbose is specified, a line item will be reported for each\n" " inconsistency; otherwise a summary of the problems will be displayed.\n" "\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "summary", no_argument, NULL, 's' }, { "verbose", no_argument, NULL, 'v' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char optionString[] = "hsvV"; // Command-line options static const char *filename; static bool verbose = false; // Values loaded from the volume static UserVDO *vdo = NULL; static struct slab_summary_entry *slabSummaryEntries = NULL; static block_count_t slabDataBlocks = 0; /** Values derived from the volume */ static unsigned int hintShift; /** Total number of mapped entries found in block map leaf pages */ static block_count_t lbnCount = 0; /** Reference counts and audit counters for each slab */ static SlabAudit slabs[MAX_VDO_SLABS] = { { 0, }, }; // Total number of errors of each type found static uint64_t badBlockMappings = 0; static uint64_t badRefCounts = 0; static slab_count_t badSlabs = 0; static slab_count_t badSummaryHints = 0; static const char *VDO_STATE_NAMES[] = { [VDO_CLEAN] = "CLEAN", [VDO_DIRTY] = "DIRTY", [VDO_FORCE_REBUILD] = "FORCE_REBUILD", [VDO_NEW] = "NEW", [VDO_READ_ONLY_MODE] = "READ_ONLY_MODE", [VDO_REBUILD_FOR_UPGRADE] = "REBUILD_FOR_UPGRADE", [VDO_RECOVERING] = "RECOVERING", [VDO_REPLAYING] = "REPLAYING", }; /** * Get the name of a VDO state code for logging purposes. * * @param state The state code * * @return The name of the state code **/ static const char *vdo_get_state_name(enum vdo_state state) { int result; /* Catch if a state has been added without updating the name array. */ STATIC_ASSERT(ARRAY_SIZE(VDO_STATE_NAMES) == VDO_STATE_COUNT); result = VDO_ASSERT(state < ARRAY_SIZE(VDO_STATE_NAMES), "vdo_state value %u must have a registered name", state); if (result != VDO_SUCCESS) { return "INVALID VDO STATE CODE"; } return VDO_STATE_NAMES[state]; } /** * Explain how this command-line function is used. * * @param progname Name of this program * @param usageOptionString Multi-line explanation **/ static void usage(const char *progname, const char *usageOptionsString) { fprintf(stderr, "Usage: %s %s\n", progname, usageOptionsString); exit(1); } /** * Display an error count and a description of the count, appending * 's' as a plural unless the error count is equal to one. **/ static void printErrorCount(uint64_t errorCount, const char *errorName) { printf("%llu%s%s\n", (unsigned long long) errorCount, errorName, ((errorCount == 1) ? "" : "s")); } /** * Display a histogram of the reference count error deltas found in the audit * of a slab. * * @param audit The audit to display **/ static void printSlabErrorHistogram(const SlabAudit *audit) { if (audit->badRefCounts == 0) { return; } // 50 histogram bar dots, so each dot represents 2% of the errors in a slab. static const char *HISTOGRAM_BAR = "**************************************************"; unsigned int scale = strlen(HISTOGRAM_BAR); printf(" error delta histogram\n"); printf(" delta count (%u%% of errors in slab per dot)\n", 100 / scale); for (int delta = MIN_ERROR_DELTA; delta <= MAX_ERROR_DELTA; delta++) { uint32_t count = audit->deltaCounts[delta - MIN_ERROR_DELTA]; if (count == 0) { continue; } // Round up any fraction of a dot to a full dot. int width = DIV_ROUND_UP(count * (uint64_t) scale, audit->badRefCounts); printf(" %5d %8u %.*s\n", delta, count, width, HISTOGRAM_BAR); } printf("\n"); } /** * Display a summary of the problems found in the audit of a slab. * * @param audit The audit to display **/ static void printSlabErrorSummary(const SlabAudit *audit) { if (audit->badRefCounts == 0) { return; } printf("slab %u at PBN %llu had ", audit->slabNumber, (unsigned long long) audit->slabOrigin); if (audit->badRefCounts == 1) { printf("1 reference count error in SBN %u", audit->lastError); } else { printf("%u reference count errors in SBN range [%u .. %u]", audit->badRefCounts, audit->firstError, audit->lastError); } printf("\n"); } /** * Display a summary of all the problems found in the audit. **/ static void printErrorSummary(void) { printf("audit summary for VDO volume '%s':\n", filename); printErrorCount(badBlockMappings, "block mapping error"); printErrorCount(badSummaryHints, "free space hint error"); printErrorCount(badRefCounts, "reference count error"); printErrorCount(badSlabs, "error-containing slab"); for (slab_count_t i = 0; i < vdo->slabCount; i++) { printSlabErrorSummary(&slabs[i]); printSlabErrorHistogram(&slabs[i]); } } /** * Release any and all allocated memory. **/ static void freeAuditAllocations(void) { vdo_free(slabSummaryEntries); for (slab_count_t i = 0; i < vdo->slabCount; i++) { vdo_free(slabs[i].refCounts); } freeVDOFromFile(&vdo); } /** * Get the filename and any option settings from the input arguments and place * them in the corresponding global variables. Print command usage if * arguments are wrong. * * @param argc Number of input arguments * @param argv Array of input arguments * * @return VDO_SUCCESS or some error. **/ static int processAuditArgs(int argc, char *argv[]) { int c; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); break; case 's': verbose = false; break; case 'v': verbose = true; break; case 'V': printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); break; default: usage(argv[0], usageString); break; } } // Explain usage and exit if (optind != (argc - 1)) { usage(argv[0], usageString); } filename = argv[optind]; return VDO_SUCCESS; } /** * Read from the layer. * * @param startBlock The block number at which to start reading * @param blockCount The number of blocks to read * @param buffer The buffer to read into * * @return VDO_SUCCESS or an error **/ static int readFromLayer(physical_block_number_t startBlock, block_count_t blockCount, char *buffer) { return vdo->layer->reader(vdo->layer, startBlock, blockCount, buffer); } /** * Report a problem with a block map entry. **/ static void reportBlockMapEntry(const char *message, struct block_map_slot slot, height_t height, physical_block_number_t pbn, enum block_mapping_state state) { badBlockMappings++; if (!verbose) { return; } if (vdo_is_state_compressed(state)) { warnx("Mapping at (page %llu, slot %u) (height %u)" " %s (PBN %llu, state %u)\n", (unsigned long long) slot.pbn, slot.slot, height, message, (unsigned long long) pbn, state); } else { warnx("Mapping at (page %llu, slot %u) (height %u)" " %s (PBN %llu)\n", (unsigned long long) slot.pbn, slot.slot, height, message, (unsigned long long) pbn); } } /** * Record the given reference in a block map page. * * Implements MappingExaminer. **/ static int examineBlockMapEntry(struct block_map_slot slot, height_t height, physical_block_number_t pbn, enum block_mapping_state state) { if (state == VDO_MAPPING_STATE_UNMAPPED) { if (pbn != VDO_ZERO_BLOCK) { reportBlockMapEntry("is unmapped but has a physical block", slot, height, pbn, state); return VDO_BAD_MAPPING; } return VDO_SUCCESS; } if (vdo_is_state_compressed(state) && (pbn == VDO_ZERO_BLOCK)) { reportBlockMapEntry("is compressed but has no physical block", slot, height, pbn, state); return VDO_BAD_MAPPING; } if (height == 0) { lbnCount++; if (pbn == VDO_ZERO_BLOCK) { return VDO_SUCCESS; } } slab_count_t slabNumber = 0; int result = getSlabNumber(vdo, pbn, &slabNumber); if (result != VDO_SUCCESS) { reportBlockMapEntry("refers to out-of-range physical block", slot, height, pbn, state); return result; } slab_block_number offset = 0; result = getSlabBlockNumber(vdo, pbn, &offset); if (result != VDO_SUCCESS) { reportBlockMapEntry("refers to slab metadata block", slot, height, pbn, state); return result; } SlabAudit *audit = &slabs[slabNumber]; if (height > 0) { // If this interior tree block has already been referenced, warn. if ((audit->refCounts[offset]) != 0) { reportBlockMapEntry("refers to previously referenced tree page", slot, height, pbn, state); } // If this interior tree block appears to be compressed, warn. if (vdo_is_state_compressed(state)) { reportBlockMapEntry("refers to compressed fragment", slot, height, pbn, state); } audit->refCounts[offset] = PROVISIONAL_REFERENCE_COUNT; } else { // If incrementing the reference count goes above the maximum, warn. if ((audit->refCounts[offset] == PROVISIONAL_REFERENCE_COUNT) || (++audit->refCounts[offset] > MAXIMUM_REFERENCE_COUNT) ) { reportBlockMapEntry("overflows reference count", slot, height, pbn, state); } } return VDO_SUCCESS; } /** * Report a problem with the reference count of a block in a slab. * * @param audit The audit record for the slab * @param sbn The offset of the block within the slab * @param treePage true if the block appears to be a * block map tree page * @param pristine true if the slab has never been used * @param auditedReferences The number of references to the block found * by examining the block map tree * @param storedReferences The reference count recorded in the slab (or * zero for a pristine slab) **/ static void reportRefCount(SlabAudit *audit, slab_block_number sbn, bool treePage, bool pristine, vdo_refcount_t auditedReferences, vdo_refcount_t storedReferences) { int errorDelta = storedReferences - (int) auditedReferences; badRefCounts++; if (audit->badRefCounts == 0) { badSlabs++; } audit->badRefCounts++; audit->deltaCounts[errorDelta - MIN_ERROR_DELTA]++; audit->firstError = min(audit->firstError, sbn); audit->lastError = max(audit->lastError, sbn); if (!verbose) { return; } warnx("Reference mismatch for%s pbn %llu\n" "Block map had %u but%s slab %u had %u\n", (treePage ? " tree page" : ""), (unsigned long long) audit->slabOrigin + sbn, auditedReferences, (pristine ? " (uninitialized)" : ""), audit->slabNumber, storedReferences); } /** * Verify all reference count entries in a given * packed_reference_sector against observed reference counts. Any * mismatches will generate a warning message. * * @param audit The audit record for the slab * @param sector packed_reference_sector to check * @param entries Number of counts in this sector * @param startingOffset The starting offset within the slab * * @return The allocated count for this sector **/ static block_count_t verifyRefCountSector(SlabAudit *audit, struct packed_reference_sector *sector, block_count_t entries, slab_block_number startingOffset) { block_count_t allocatedCount = 0; for (block_count_t i = 0; i < entries; i++) { slab_block_number sbn = startingOffset + i; vdo_refcount_t observedReferences = audit->refCounts[sbn]; vdo_refcount_t storedReferences = sector->counts[i]; // If the observed reference is provisional, it is a block map tree page, // and there are two valid reference count values. if (observedReferences == PROVISIONAL_REFERENCE_COUNT) { if ((storedReferences == 1) || (storedReferences == MAXIMUM_REFERENCE_COUNT)) { allocatedCount++; continue; } reportRefCount(audit, sbn, true, false, observedReferences, storedReferences); continue; } if (observedReferences != storedReferences) { // Mismatch, but maybe the refcount is provisional and the proper // count is 0. if ((observedReferences == EMPTY_REFERENCE_COUNT) && (storedReferences == PROVISIONAL_REFERENCE_COUNT)) { continue; } reportRefCount(audit, sbn, false, false, observedReferences, storedReferences); } if (storedReferences > 0) { allocatedCount++; } } return allocatedCount; } /** * Verify all reference count entries in a given packed_reference_block * against observed reference counts. * Any mismatches will generate a warning message. * * @param audit The audit record for the slab * @param block packed_reference_block to check * @param blockEntries Number of counts in this block * @param startingOffset The starting offset within the slab * * @return The allocated count for this block **/ static block_count_t verifyRefCountBlock(SlabAudit *audit, struct packed_reference_block *block, block_count_t blockEntries, slab_block_number startingOffset) { block_count_t allocatedCount = 0; block_count_t entries = blockEntries; for (sector_count_t i = 0; (i < VDO_SECTORS_PER_BLOCK) && (entries > 0); i++) { block_count_t sectorEntries = min(entries, (block_count_t) COUNTS_PER_SECTOR); allocatedCount += verifyRefCountSector(audit, &block->sectors[i], sectorEntries, startingOffset); startingOffset += sectorEntries; entries -= sectorEntries; } return allocatedCount; } /** * Verify that the number of free blocks in the slab matches the summary's * approximate value. * * @param slabNumber The number of the slab to check * @param freeBlocks The actual number of free blocks in the slab **/ static void verifySummaryHint(slab_count_t slabNumber, block_count_t freeBlocks) { block_count_t fullnessHint = slabSummaryEntries[slabNumber].fullness_hint; block_count_t freeBlockHint = fullnessHint << hintShift; block_count_t hintError = (1ULL << hintShift); if ((freeBlocks < max(freeBlockHint, hintError) - hintError) || (freeBlocks >= (freeBlockHint + hintError))) { badSummaryHints++; if (verbose) { warnx("Slab summary reports roughly %llu free blocks in\n" "slab %u, instead of %llu blocks", (unsigned long long) freeBlockHint, slabNumber, (unsigned long long) freeBlocks); } } } /** * Verify that the reference counts for a given slab are consistent with the * block map. * * @param slabNumber The number of the slab to verify * @param buffer A buffer to hold the reference counts for the slab **/ static int verifySlab(slab_count_t slabNumber, char *buffer) { SlabAudit *audit = &slabs[slabNumber]; if (!slabSummaryEntries[slabNumber].load_ref_counts) { // Confirm that all reference counts for this pristine slab are 0. for (slab_block_number sbn = 0; sbn < slabDataBlocks; sbn++) { if (audit->refCounts[sbn] != 0) { reportRefCount(audit, sbn, false, true, audit->refCounts[sbn], 0); } } // Verify that the slab summary contains the expected free block count. verifySummaryHint(slabNumber, slabDataBlocks); return VDO_SUCCESS; } // Get the refCounts stored on this used slab. struct slab_depot_state_2_0 depot = vdo->states.slab_depot; int result = readFromLayer(audit->slabOrigin + slabDataBlocks, depot.slab_config.reference_count_blocks, buffer); if (result != VDO_SUCCESS) { warnx("Could not read reference count buffer for slab number %u\n", slabNumber); return result; } char *currentBlockStart = buffer; block_count_t freeBlocks = 0; slab_block_number currentOffset = 0; block_count_t remainingEntries = slabDataBlocks; while (remainingEntries > 0) { struct packed_reference_block *block = (struct packed_reference_block *) currentBlockStart; block_count_t blockEntries = min((block_count_t) COUNTS_PER_BLOCK, remainingEntries); block_count_t allocatedCount = verifyRefCountBlock(audit, block, blockEntries, currentOffset); freeBlocks += (blockEntries - allocatedCount); remainingEntries -= blockEntries; currentBlockStart += VDO_BLOCK_SIZE; currentOffset += blockEntries; } // Verify that the slab summary contains the expected free block count. verifySummaryHint(slabNumber, freeBlocks); return VDO_SUCCESS; } /** * Check that the reference counts are consistent with the block map. Warn for * any physical block whose reference counts are inconsistent. * * @return VDO_SUCCESS or some error. **/ static int verifyPBNRefCounts(void) { struct slab_config slabConfig = vdo->states.slab_depot.slab_config; size_t refCountBytes = (slabConfig.reference_count_blocks * VDO_BLOCK_SIZE); char *buffer; int result = vdo->layer->allocateIOBuffer(vdo->layer, refCountBytes, "slab reference counts", &buffer); if (result != VDO_SUCCESS) { warnx("Could not allocate %zu bytes for slab reference counts", refCountBytes); return result; } hintShift = vdo_get_slab_summary_hint_shift(vdo->slabSizeShift); for (slab_count_t slabNumber = 0; slabNumber < vdo->slabCount; slabNumber++) { result = verifySlab(slabNumber, buffer); if (result != VDO_SUCCESS) { break; } } vdo_free(buffer); return result; } /** * Audit a VDO by checking that its block map and reference counts are * consistent. * * @return true if the volume was fully consistent **/ static bool auditVDO(void) { if (vdo->states.vdo.state == VDO_NEW) { warnx("The VDO volume is newly formatted and has no auditable state"); return false; } if (vdo->states.vdo.state != VDO_CLEAN) { warnx("WARNING: The VDO was not cleanly shut down (it has state '%s')", vdo_get_state_name(vdo->states.vdo.state)); } // Get logical block count and populate observed slab reference counts. int result = examineBlockMapEntries(vdo, examineBlockMapEntry); if (result != VDO_SUCCESS) { return false; } // Load the slab summary data. result = readSlabSummary(vdo, &slabSummaryEntries); if (result != VDO_SUCCESS) { return false; } // Audit stored versus counted mapped logical blocks. block_count_t savedLBNCount = vdo->states.recovery_journal.logical_blocks_used; if (lbnCount == savedLBNCount) { warnx("Logical block count matched at %llu", (unsigned long long) savedLBNCount); } else { warnx("Logical block count mismatch! Expected %llu, got %llu", (unsigned long long) savedLBNCount, (unsigned long long) lbnCount); } // Now confirm the stored references of all physical blocks. result = verifyPBNRefCounts(); if (result != VDO_SUCCESS) { return false; } return ((lbnCount == savedLBNCount) && (badRefCounts == 0) && (badSummaryHints == 0)); } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } result = processAuditArgs(argc, argv); if (result != VDO_SUCCESS) { exit(1); } result = makeVDOFromFile(filename, true, &vdo); if (result != VDO_SUCCESS) { errx(1, "Could not load VDO from '%s': %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } struct slab_depot_state_2_0 depot = vdo->states.slab_depot; physical_block_number_t slabOrigin = depot.first_block; slabDataBlocks = depot.slab_config.data_blocks; slab_count_t slabCount = vdo_compute_slab_count(depot.first_block, depot.last_block, vdo->slabSizeShift); for (slab_count_t i = 0; i < slabCount; i++) { SlabAudit *audit = &slabs[i]; audit->slabNumber = i; audit->slabOrigin = slabOrigin; slabOrigin += depot.slab_config.slab_blocks; // So firstError = min(firstError, x) will always do the right thing. audit->firstError = (slab_block_number) -1; result = vdo_allocate(slabDataBlocks, uint8_t, __func__, &audit->refCounts); if (result != VDO_SUCCESS) { freeAuditAllocations(); errx(1, "Could not allocate %llu reference counts: %s", (unsigned long long) slabDataBlocks, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } } bool passed = auditVDO(); if (passed) { warnx("All pbn references matched.\n"); } else if (!verbose) { printErrorSummary(); } freeAuditAllocations(); exit(passed ? 0 : 1); } vdo-8.3.1.1/utils/vdo/vdodebugmetadata.c000066400000000000000000000517051476467262700201030ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "syscalls.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "fileLayer.h" #include "parseUtils.h" #include "userVDO.h" #include "vdoVolumeUtils.h" static const char usageString[] = "[--help] [--pbn=] [--searchLBN=] [--version] filename"; static const char helpString[] = "vdoDebugMetadata - load a metadata dump of a VDO device\n" "\n" "SYNOPSIS\n" " vdoDebugMetadata [--pbn=] [--searchLBN=] \n" "\n" "DESCRIPTION\n" " vdoDebugMetadata loads the metadata regions dumped by vdoDumpMetadata.\n" " It should be run under GDB, with a breakpoint on the function\n" " doNothing.\n" "\n" " Variables vdo, slabSummary, slabs, and recoveryJournal are\n" " available, providing access to the VDO super block state, the slab\n" " summary blocks, all slab journal and reference blocks per slab,\n" " and all recovery journal blocks.\n" "\n" " Please note that this tool does not provide access to block map pages.\n" "\n" " Any --pbn argument(s) will print the slab journal entries for the\n" " given PBN(s).\n" "\n" " Any --searchLBN argument(s) will print the recovery journal entries\n" " for the given LBN(s). This includes PBN, increment/decrement, mapping\n" " state, recovery journal position information, and whether the \n" " recovery journal block is valid.\n" "\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "pbn", required_argument, NULL, 'p' }, { "searchLBN", required_argument, NULL, 's' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; typedef struct { struct packed_slab_journal_block **slabJournalBlocks; struct packed_reference_block **referenceBlocks; } SlabState; typedef struct { struct recovery_block_header header; struct packed_journal_sector *sectors[VDO_SECTORS_PER_BLOCK]; } UnpackedJournalBlock; static UserVDO *vdo = NULL; static struct slab_summary_entry **slabSummary = NULL; static slab_count_t slabCount = 0; static SlabState *slabs = NULL; static UnpackedJournalBlock *recoveryJournal = NULL; static char *rawJournalBytes = NULL; static physical_block_number_t nextBlock; static const struct slab_config *slabConfig = NULL; static physical_block_number_t *pbns = NULL; static uint8_t pbnCount = 0; static logical_block_number_t *searchLBNs = NULL; static uint8_t searchLBNCount = 0; enum { MAX_PBNS = 255, MAX_SEARCH_LBNS = 255, }; /** * Explain how this program is used. * * @param progname Name of this program * @param usageOptionString Explanation **/ static void usage(const char *progname, const char *usageOptionsString) { fprintf(stderr, "Usage: %s %s\n", progname, usageOptionsString); exit(1); } /** * Get the filename (or "help") from the input arguments. * Print command usage if arguments are wrong. * * @param [in] argc Number of input arguments * @param [in] argv Array of input arguments * @param [out] filename Name of this VDO's file or block device * * @return VDO_SUCCESS or some error. **/ static int processArgs(int argc, char *argv[], char **filename) { int c; char *optionString = "hp:s:V"; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { if (c == (int) 'h') { printf("%s", helpString); exit(0); } if (c == (int) 'V') { printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); } if (c == (int) 'p') { // Limit to 255 PBNs for now. if (pbnCount == MAX_PBNS) { errx(1, "Cannot specify more than %u PBNs", MAX_PBNS); } int result = parseUInt64(optarg, &pbns[pbnCount++]); if (result != VDO_SUCCESS) { warnx("Cannot parse PBN as a number"); usage(argv[0], usageString); } } if (c == (int) 's') { // Limit to 255 search LBNs for now. if (pbnCount == MAX_SEARCH_LBNS) { errx(1, "Cannot specify more than %u search LBNs", MAX_SEARCH_LBNS); } int result = parseUInt64(optarg, &searchLBNs[searchLBNCount++]); if (result != VDO_SUCCESS) { warnx("Cannot parse search LBN as a number"); usage(argv[0], usageString); } } } // Explain usage and exit. if (optind != (argc - 1)) { usage(argv[0], usageString); } *filename = argv[optind]; return VDO_SUCCESS; } /** * This function provides an easy place to set a breakpoint. **/ __attribute__((__noinline__)) static void doNothing(void) { __asm__(""); } /** * Read blocks from the current position. * * @param [in] count How many blocks to read * @param [out] buffer The buffer to read into * * @return VDO_SUCCESS or an error **/ static int readBlocks(block_count_t count, char *buffer) { int result = vdo->layer->reader(vdo->layer, nextBlock, count, buffer); if (result != VDO_SUCCESS) { return result; } nextBlock += count; return result; } /** * Free a single slab state * * @param statePtr A pointer to the state to free **/ static void freeState(SlabState *state) { if (state == NULL) { return; } if (state->slabJournalBlocks != NULL) { for (block_count_t i = 0; i < slabConfig->slab_journal_blocks; i++) { vdo_free(state->slabJournalBlocks[i]); state->slabJournalBlocks[i] = NULL; } } if (state->referenceBlocks != NULL) { for (block_count_t i = 0; i < slabConfig->reference_count_blocks; i++) { vdo_free(state->referenceBlocks[i]); state->referenceBlocks[i] = NULL; } } vdo_free(state->slabJournalBlocks); vdo_free(state->referenceBlocks); } /** * Allocate a single slab state. * * @param [out] statePtr Where to store the allocated state. **/ static int allocateState(SlabState *state) { int result = vdo_allocate(slabConfig->slab_journal_blocks, struct packed_slab_journal_block *, __func__, &state->slabJournalBlocks); if (result != VDO_SUCCESS) { freeState(state); return result; } result = vdo_allocate(slabConfig->reference_count_blocks, struct packed_reference_block *, __func__, &state->referenceBlocks); if (result != VDO_SUCCESS) { freeState(state); return result; } PhysicalLayer *layer = vdo->layer; for (block_count_t i = 0; i < slabConfig->reference_count_blocks; i++) { char *buffer; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "reference count block", &buffer); if (result != VDO_SUCCESS) { freeState(state); return result; } state->referenceBlocks[i] = (struct packed_reference_block *) buffer; } for (block_count_t i = 0; i < slabConfig->slab_journal_blocks; i++) { char *buffer; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "slab journal block", &buffer); if (result != VDO_SUCCESS) { freeState(state); return result; } state->slabJournalBlocks[i] = (struct packed_slab_journal_block *) buffer; } return result; } /** * Allocate sufficient space to read the metadata dump. **/ static int allocateMetadataSpace(void) { slabConfig = &vdo->states.slab_depot.slab_config; int result = vdo_allocate(vdo->slabCount, SlabState, __func__, &slabs); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %u slab state pointers", slabCount); } while (slabCount < vdo->slabCount) { result = allocateState(&slabs[slabCount]); if (result != VDO_SUCCESS) { errx(1, "Could not allocate slab state for slab %u", slabCount); } slabCount++; } PhysicalLayer *layer = vdo->layer; struct vdo_config *config = &vdo->states.vdo.config; size_t journalBytes = config->recovery_journal_size * VDO_BLOCK_SIZE; result = layer->allocateIOBuffer(layer, journalBytes, "recovery journal", &rawJournalBytes); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %llu bytes for the journal", (unsigned long long) journalBytes); } result = vdo_allocate(config->recovery_journal_size, UnpackedJournalBlock, __func__, &recoveryJournal); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %llu journal block structures", (unsigned long long) config->recovery_journal_size); } result = vdo_allocate(VDO_SLAB_SUMMARY_BLOCKS, struct slab_summary_entry *, __func__, &slabSummary); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %d slab summary block pointers", VDO_SLAB_SUMMARY_BLOCKS); } for (block_count_t i = 0; i < VDO_SLAB_SUMMARY_BLOCKS; i++) { char *buffer; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "slab summary block", &buffer); if (result != VDO_SUCCESS) { errx(1, "Could not allocate slab summary block %llu", (unsigned long long) i); } slabSummary[i] = (struct slab_summary_entry *) buffer; } return result; } /** * Free the allocations from allocateMetadataSpace(). **/ static void freeMetadataSpace(void) { if (slabs != NULL) { for (slab_count_t i = 0; i < slabCount; i++) { freeState(&slabs[i]); } } vdo_free(slabs); slabs = NULL; vdo_free(rawJournalBytes); rawJournalBytes = NULL; vdo_free(recoveryJournal); recoveryJournal = NULL; if (slabSummary != NULL) { for (block_count_t i = 0; i < VDO_SLAB_SUMMARY_BLOCKS; i++) { vdo_free(slabSummary[i]); slabSummary[i] = NULL; } } vdo_free(slabSummary); slabSummary = NULL; } /** * Read the metadata into the appropriate places. **/ static void readMetadata(void) { /** * The dump tool dumps the whole block map of whatever size, or some LBNs, * or nothing, at the beginning of the dump. This tool doesn't currently know * how to read the block map, so we figure out how many other metadata blocks * there are, then skip back from the end of the file to the beginning of * that metadata. **/ block_count_t metadataBlocksPerSlab = (slabConfig->reference_count_blocks + slabConfig->slab_journal_blocks); struct vdo_config *config = &vdo->states.vdo.config; block_count_t totalNonBlockMapMetadataBlocks = ((metadataBlocksPerSlab * slabCount) + config->recovery_journal_size + VDO_SLAB_SUMMARY_BLOCKS); nextBlock = (vdo->layer->getBlockCount(vdo->layer) - totalNonBlockMapMetadataBlocks); for (slab_count_t i = 0; i < slabCount; i++) { SlabState *slab = &slabs[i]; for (block_count_t j = 0; j < slabConfig->reference_count_blocks; j++) { int result = readBlocks(1, (char *) slab->referenceBlocks[j]); if (result != VDO_SUCCESS) { errx(1, "Could not read reference block %llu for slab %u", (unsigned long long) j, i); } } for (block_count_t j = 0; j < slabConfig->slab_journal_blocks; j++) { int result = readBlocks(1, (char *) slab->slabJournalBlocks[j]); if (result != VDO_SUCCESS) { errx(1, "Could not read slab journal block %llu for slab %u", (unsigned long long) j, i); } } } int result = readBlocks(config->recovery_journal_size, rawJournalBytes); if (result != VDO_SUCCESS) { errx(1, "Could not read recovery journal"); } for (block_count_t i = 0; i < config->recovery_journal_size; i++) { UnpackedJournalBlock *block = &recoveryJournal[i]; struct packed_journal_header *packedHeader = (struct packed_journal_header *) &rawJournalBytes[i * VDO_BLOCK_SIZE]; block->header = vdo_unpack_recovery_block_header(packedHeader); for (uint8_t sector = 1; sector < VDO_SECTORS_PER_BLOCK; sector++) { block->sectors[sector] = vdo_get_journal_block_sector(packedHeader, sector); } } for (block_count_t i = 0; i < VDO_SLAB_SUMMARY_BLOCKS; i++) { readBlocks(1, (char *) slabSummary[i]); } } /** * Search slab journal for PBNs. **/ static void findSlabJournalEntries(physical_block_number_t pbn) { struct slab_depot_state_2_0 depot = vdo->states.slab_depot; if ((pbn < depot.first_block) || (pbn > depot.last_block)) { printf("PBN %llu out of range; skipping.\n", (unsigned long long) pbn); return; } block_count_t offset = pbn - depot.first_block; slab_count_t slabNumber = offset >> vdo->slabSizeShift; slab_block_number slabOffset = offset & vdo->slabOffsetMask; printf("PBN %llu is offset %d in slab %d\n", (unsigned long long) pbn, slabOffset, slabNumber); for (block_count_t i = 0; i < depot.slab_config.slab_journal_blocks; i++) { struct packed_slab_journal_block *block = slabs[slabNumber].slabJournalBlocks[i]; journal_entry_count_t entryCount = __le16_to_cpu(block->header.entry_count); for (journal_entry_count_t entryIndex = 0; entryIndex < entryCount; entryIndex++) { struct slab_journal_entry entry = vdo_decode_slab_journal_entry(block, entryIndex); if (slabOffset == entry.sbn) { printf("PBN %llu (%llu, %d) %s\n", (unsigned long long) pbn, (unsigned long long) __le64_to_cpu(block->header.sequence_number), entryIndex, vdo_get_journal_operation_name(entry.operation)); } } } } /** * Determine whether the given header describes a valid block for the * given journal, even if it is older than the last successful recovery * or reformat. A block is not "relevant" if it is unformatted, or has a * different nonce value. Use this for cases where it would not be * appropriate to use isValidRecoveryJournalBlock because we do want to * consider blocks with other recoveryCount values. * * @param header The unpacked block header to check * * @return True if the header is valid **/ static inline bool __must_check isBlockFromJournal(const struct recovery_block_header *header) { return ((header->metadata_type == VDO_METADATA_RECOVERY_JOURNAL) && (header->nonce == vdo->states.vdo.nonce)); } /** * Determine whether the sequence number is possible for the given * offset. Similar to isCongruentRecoveryJournalBlock(), but does not * run isValidRecoveryJournalBlock(). * * @param header The unpacked block header to check * @param offset An offset indicating where the block was in the journal * * @return True if the sequence number is possible **/ static inline bool __must_check isSequenceNumberPossibleForOffset(const struct recovery_block_header *header, physical_block_number_t offset) { block_count_t journal_size = vdo->states.vdo.config.recovery_journal_size; physical_block_number_t expectedOffset = vdo_compute_recovery_journal_block_number(journal_size, header->sequence_number); return (expectedOffset == offset); } /** * Search recovery journal for PBNs belonging to the given LBN. **/ static void findRecoveryJournalEntries(logical_block_number_t lbn) { struct block_map_slot desiredSlot = (struct block_map_slot) { .pbn = lbn / VDO_BLOCK_MAP_ENTRIES_PER_PAGE, .slot = lbn % VDO_BLOCK_MAP_ENTRIES_PER_PAGE, }; for (block_count_t i = 0; i < vdo->states.vdo.config.recovery_journal_size; i++) { UnpackedJournalBlock block = recoveryJournal[i]; for (sector_count_t j = 1; j < VDO_SECTORS_PER_BLOCK; j++) { const struct packed_journal_sector *sector = block.sectors[j]; for (journal_entry_count_t k = 0; k < sector->entry_count; k++) { struct recovery_journal_entry entry = vdo_unpack_recovery_journal_entry(§or->entries[k]); if ((desiredSlot.pbn == entry.slot.pbn) && (desiredSlot.slot == entry.slot.slot)) { bool isValidJournalBlock = isBlockFromJournal(&block.header); bool isSequenceNumberPossible = isSequenceNumberPossibleForOffset(&block.header, i); bool isSectorValid = vdo_is_valid_recovery_journal_sector(&block.header, sector, j); printf("found LBN %llu at offset %llu" " (block %svalid, sequence number %llu %spossible), " "sector %u (sector %svalid), entry %u " ": PBN %llu, %s, mappingState %u\n", (unsigned long long) lbn, (unsigned long long) i, (isValidJournalBlock ? "" : "not "), (unsigned long long) block.header.sequence_number, (isSequenceNumberPossible ? "" : "not "), j, (isSectorValid ? "" : "not "), k, (unsigned long long) entry.mapping.pbn, vdo_get_journal_operation_name(entry.operation), entry.mapping.state); } } } } } /** * Load from a dump file. * * @param filename The file name * * @return VDO_SUCCESS or an error code **/ static int __must_check readVDOFromDump(const char *filename) { PhysicalLayer *layer; int result = makeReadOnlyFileLayer(filename, &layer); if (result != VDO_SUCCESS) { char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; warnx("Failed to make FileLayer from '%s' with %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); return result; } // Load the geometry and tweak it to match the dump. struct volume_geometry geometry; result = loadVolumeGeometry(layer, &geometry); if (result != VDO_SUCCESS) { layer->destroy(&layer); char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; warnx("VDO geometry read failed for '%s' with %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); return result; } // Create the VDO. result = makeUserVDO(layer, &vdo); if (result != VDO_SUCCESS) { return result; } vdo->geometry = geometry; vdo->geometry.regions[VDO_DATA_REGION].start_block = 1; result = loadSuperBlock(vdo); if (result != VDO_SUCCESS) { freeUserVDO(&vdo); return result; } result = vdo_decode_component_states((u8 *) vdo->superBlockBuffer, &geometry, &vdo->states); if (result != VDO_SUCCESS) { freeUserVDO(&vdo); return result; } vdo->states.layout.start = 2; setDerivedSlabParameters(vdo); return VDO_SUCCESS; } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } char *filename; result = vdo_allocate(MAX_PBNS, physical_block_number_t, __func__, &pbns); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %zu bytes", sizeof(physical_block_number_t) * MAX_PBNS); } result = vdo_allocate(MAX_SEARCH_LBNS, logical_block_number_t, __func__, &searchLBNs); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %zu bytes", sizeof(logical_block_number_t) * MAX_SEARCH_LBNS); } result = processArgs(argc, argv, &filename); if (result != VDO_SUCCESS) { exit(1); } result = readVDOFromDump(filename); if (result != VDO_SUCCESS) { errx(1, "Could not load VDO from '%s': %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } allocateMetadataSpace(); readMetadata(); // Print the nonce for this dump. printf("Nonce value: %llu\n", (unsigned long long) vdo->states.vdo.nonce); // For any PBNs specified, process them. for (uint8_t i = 0; i < pbnCount; i++) { findSlabJournalEntries(pbns[i]); } // Process any search LBNs. for (uint8_t i = 0; i < searchLBNCount; i++) { findRecoveryJournalEntries(searchLBNs[i]); } // This is a great line for a GDB breakpoint. doNothing(); // If someone runs the program manually, tell them to use GDB. if ((pbnCount == 0) && (searchLBNCount == 0)) { printf("%s", helpString); } freeMetadataSpace(); PhysicalLayer *layer = vdo->layer; freeUserVDO(&vdo); layer->destroy(&layer); exit(result); } vdo-8.3.1.1/utils/vdo/vdodumpblockmap.c000066400000000000000000000125271476467262700177710ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include "errors.h" #include "logger.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "blockMapUtils.h" #include "vdoVolumeUtils.h" static const char usageString[] = "[--help] [--lba=] [--version] "; static const char helpString[] = "vdoDumpBlockMap - dump the LBA->PBA mappings of a VDO device\n" "\n" "SYNOPSIS\n" " vdoDumpBlockMap [--lba=] \n" "\n" "DESCRIPTION\n" " vdoDumpBlockMap dumps all (or only the specified) LBA->PBA mappings\n" " from a cleanly shut down VDO device\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "lba", required_argument, NULL, 'l' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static logical_block_number_t lbn = 0xFFFFFFFFFFFFFFFF; static UserVDO *vdo; /** * Explain how this command-line function is used. * * @param progname Name of this program * @param usageOptionString Multi-line explanation **/ static void usage(const char *progname, const char *usageOptionsString) { fprintf(stderr, "Usage: %s %s\n", progname, usageOptionsString); exit(1); } /** * Get the filename (or "help") from the input arguments. * Print command usage if arguments are wrong. * * @param [in] argc Number of input arguments * @param [in] argv Array of input arguments * @param [out] filename Name of this VDO's file or block device * * @return VDO_SUCCESS or some error. **/ static int processDumpArgs(int argc, char *argv[], char **filename) { int c; char *optionString = "l:hV"; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { if (c == (int) 'h') { printf("%s", helpString); exit(0); } if (c == (int) 'V') { printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); } if (c == (int) 'l') { char *endptr; errno = 0; lbn = strtoull(optarg, &endptr, 0); if (errno == ERANGE || errno == EINVAL || endptr == optarg) { errx(1, "No LBA specified"); } } } // Explain usage and exit if (optind != (argc - 1)) { usage(argv[0], usageString); } *filename = argv[optind]; return VDO_SUCCESS; } /**********************************************************************/ static int dumpLBN(void) { physical_block_number_t pbn; enum block_mapping_state state; int result = findLBNMapping(vdo, lbn, &pbn, &state); if (result != VDO_SUCCESS) { warnx("Could not read mapping for lbn %llu", (unsigned long long) lbn); return result; } printf("%llu\t", (unsigned long long) lbn); switch (state) { case VDO_MAPPING_STATE_UNMAPPED: printf("unmapped \t%llu\n", (unsigned long long) pbn); break; case VDO_MAPPING_STATE_UNCOMPRESSED: printf("mapped \t%llu\n", (unsigned long long) pbn); break; default: printf("compressed \t%llu slot %u\n", (unsigned long long) pbn, state - VDO_MAPPING_STATE_COMPRESSED_BASE); break; } return VDO_SUCCESS; } /** * Print out a mapping from a block map page. * * Implements MappingExaminer. **/ static int dumpBlockMapEntry(struct block_map_slot slot, height_t height, physical_block_number_t pbn, enum block_mapping_state state) { if ((state != VDO_MAPPING_STATE_UNMAPPED) || (pbn != VDO_ZERO_BLOCK)) { printf("PBN %llu\t slot %u\t height %u\t" "-> PBN %llu (compression state %u)\n", (unsigned long long) slot.pbn, slot.slot, height, (unsigned long long) pbn, state); } return VDO_SUCCESS; } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } char *filename; result = processDumpArgs(argc, argv, &filename); if (result != VDO_SUCCESS) { exit(1); } result = makeVDOFromFile(filename, true, &vdo); if (result != VDO_SUCCESS) { errx(1, "Could not load VDO from '%s': %s", filename, uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } result = ((lbn != 0xFFFFFFFFFFFFFFFF) ? dumpLBN() : examineBlockMapEntries(vdo, dumpBlockMapEntry)); freeVDOFromFile(&vdo); exit((result == VDO_SUCCESS) ? 0 : 1); } vdo-8.3.1.1/utils/vdo/vdodumpmetadata.c000066400000000000000000000262721476467262700177630ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include "errors.h" #include "fileUtils.h" #include "memory-alloc.h" #include "string-utils.h" #include "syscalls.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "blockMapUtils.h" #include "fileLayer.h" #include "parseUtils.h" #include "physicalLayer.h" #include "userVDO.h" #include "vdoVolumeUtils.h" enum { STRIDE_LENGTH = 256, MAX_LBNS = 255, }; static const char usageString[] = "[--help] [--no-block-map] [--lbn=] [--version] vdoBacking" " outputFile"; static const char helpString[] = "vdodumpmetadata - dump the metadata regions from a VDO device\n" "\n" "SYNOPSIS\n" " vdodumpmetadata [--no-block-map] [--lbn=] " " \n" "\n" "DESCRIPTION\n" " vdodumpmetadata dumps the metadata regions of a VDO device to\n" " another file, to enable save and transfer of metadata from\n" " a VDO without transfer of the entire backing store.\n" "\n" " vdodumpmetadata will produce a large output file. The expected size is\n" " roughly equal to VDO's metadata size. A rough estimate of the storage\n" " needed is 1.4 GB per TB of logical space.\n" "\n" " If the --no-block-map option is used, the output file will be of size\n" " no higher than 130MB + (9 MB per slab).\n" "\n" " --lbn implies --no-block-map, and saves the block map page associated\n" " with the specified LBN in the output file. This option may be\n" " specified up to 255 times.\n" "\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "lbn", required_argument, NULL, 'l' }, { "no-block-map", no_argument, NULL, 'b' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char *vdoBacking = NULL; static UserVDO *vdo = NULL; static char *outputFilename = NULL; static int outputFD = -1; static char *buffer = NULL; static bool noBlockMap = false; static uint8_t lbnCount = 0; static physical_block_number_t *lbns = NULL; /** * Explain how this command-line tool is used. * * @param progname Name of this program * @param usageOptionString Multi-line explanation **/ static void usage(const char *progname) { errx(1, "Usage: %s %s\n", progname, usageString); } /** * Release any and all allocated memory. **/ static void freeAllocations(void) { freeVDOFromFile(&vdo); try_sync_and_close_file(outputFD); vdo_free(buffer); vdo_free(lbns); buffer = NULL; } /** * Parse the arguments passed; print command usage if arguments are wrong. * * @param argc Number of input arguments * @param argv Array of input arguments **/ static void processArgs(int argc, char *argv[]) { int c; char *optionString = "hbl:V"; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); case 'b': noBlockMap = true; break; case 'l': // lbnCount is a uint8_t, so we need to check that we don't // overflow it by performing this equality check before incrementing. if (lbnCount == MAX_LBNS) { errx(1, "Cannot specify more than %u LBNs", MAX_LBNS); } noBlockMap = true; int result = parseUInt64(optarg, &lbns[lbnCount++]); if (result != VDO_SUCCESS) { warnx("Cannot parse LBN as a number"); usage(argv[0]); } break; case 'V': printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); default: usage(argv[0]); break; } } // Explain usage and exit if (optind != (argc - 2)) { usage(argv[0]); } vdoBacking = argv[optind++]; outputFilename = argv[optind++]; } /** * Copy blocks from the VDO backing to the output file. * * @param startBlock The block to start at in the VDO backing * @param count How many blocks to copy * * @return VDO_SUCCESS or an error **/ static int copyBlocks(physical_block_number_t startBlock, block_count_t count) { while ((count > 0)) { block_count_t blocksToWrite = min((block_count_t) STRIDE_LENGTH, count); int result = vdo->layer->reader(vdo->layer, startBlock, blocksToWrite, buffer); if (result != VDO_SUCCESS) { return result; } result = write_buffer(outputFD, buffer, blocksToWrite * VDO_BLOCK_SIZE); if (result != VDO_SUCCESS) { return result; } startBlock += blocksToWrite; count -= blocksToWrite; } return VDO_SUCCESS; } /** * Write a zero block to the output file. * * @return VDO_SUCCESS or an error **/ static int zeroBlock(void) { memset(buffer, 0, VDO_BLOCK_SIZE); return write_buffer(outputFD, buffer, VDO_BLOCK_SIZE); } /** * Copy the referenced page to the output file. * * Implements MappingExaminer. **/ static int copyPage(struct block_map_slot slot __attribute__((unused)), height_t height, physical_block_number_t pbn, enum block_mapping_state state) { if ((height == 0) || !isValidDataBlock(vdo, pbn) || (state == VDO_MAPPING_STATE_UNMAPPED)) { // Nothing to add to the dump. return VDO_SUCCESS; } int result = copyBlocks(pbn, 1); if (result != VDO_SUCCESS) { warnx("Could not copy block map page %llu", (unsigned long long) pbn); } return result; } /**********************************************************************/ static void dumpGeometryBlock(void) { // Copy the geometry block. int result = copyBlocks(0, 1); if (result != VDO_SUCCESS) { errx(1, "Could not copy super block"); } } /**********************************************************************/ static void dumpSuperBlock(void) { struct volume_geometry geometry; int result = loadVolumeGeometry(vdo->layer, &geometry); if (result != VDO_SUCCESS) { errx(1, "Could not load geometry"); } // Copy the super block. result = copyBlocks(vdo_get_data_region_start(geometry), 1); if (result != VDO_SUCCESS) { errx(1, "Could not copy super block"); } } /**********************************************************************/ static void dumpBlockMap(void) { if (!noBlockMap) { // Copy the block map. struct block_map_state_2_0 *map = &vdo->states.block_map; int result = copyBlocks(map->root_origin, map->root_count); if (result != VDO_SUCCESS) { errx(1, "Could not copy tree root block map pages"); } result = examineBlockMapEntries(vdo, copyPage); if (result != VDO_SUCCESS) { errx(1, "Could not copy allocated block map pages"); } } else { // Copy any specific block map pages requested. for (size_t i = 0; i < lbnCount; i++) { physical_block_number_t pagePBN; int result = findLBNPage(vdo, lbns[i], &pagePBN); if (result != VDO_SUCCESS) { errx(1, "Could not read block map for LBN %llu", (unsigned long long) lbns[i]); } if (pagePBN == VDO_ZERO_BLOCK) { result = zeroBlock(); } else { result = copyBlocks(pagePBN, 1); } if (result != VDO_SUCCESS) { errx(1, "Could not copy block map for LBN %llu", (unsigned long long) lbns[i]); } } } } /**********************************************************************/ static void dumpSlabs(void) { // Copy the slab metadata. const struct slab_depot_state_2_0 depot = vdo->states.slab_depot; const struct slab_config slabConfig = depot.slab_config; block_count_t journalBlocks = slabConfig.slab_journal_blocks; block_count_t refCountBlocks = slabConfig.reference_count_blocks; for (slab_count_t i = 0; i < vdo->slabCount; i++) { physical_block_number_t slabStart = depot.first_block + (i * vdo->states.vdo.config.slab_size); physical_block_number_t origin = slabStart + slabConfig.data_blocks; int result = copyBlocks(origin, refCountBlocks + journalBlocks); if (result != VDO_SUCCESS) { errx(1, "Could not copy slab metadata"); } } } /**********************************************************************/ static void dumpRecoveryJournal(void) { // Copy the recovery journal. const struct partition *partition = getPartition(vdo, VDO_RECOVERY_JOURNAL_PARTITION, "Could not copy recovery journal, no partition"); int result = copyBlocks(partition->offset, vdo->states.vdo.config.recovery_journal_size); if (result != VDO_SUCCESS) { errx(1, "Could not copy recovery journal"); } } /**********************************************************************/ static void dumpSlabSummary(void) { // Copy the slab summary. const struct partition *partition = getPartition(vdo, VDO_SLAB_SUMMARY_PARTITION, "Could not copy slab summary, no partition"); int result = copyBlocks(partition->offset, VDO_SLAB_SUMMARY_BLOCKS); if (result != VDO_SUCCESS) { errx(1, "Could not copy slab summary"); } } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } result = vdo_allocate(MAX_LBNS, physical_block_number_t, __func__, &lbns); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %zu bytes", sizeof(physical_block_number_t) * MAX_LBNS); } processArgs(argc, argv); // Read input VDO. result = makeVDOFromFile(vdoBacking, true, &vdo); if (result != VDO_SUCCESS) { errx(1, "Could not load VDO from '%s'", vdoBacking); } // Allocate buffer for copies. size_t copyBufferBytes = STRIDE_LENGTH * VDO_BLOCK_SIZE; result = vdo->layer->allocateIOBuffer(vdo->layer, copyBufferBytes, "copy buffer", &buffer); if (result != VDO_SUCCESS) { errx(1, "Could not allocate %zu bytes", copyBufferBytes); } // Open the dump output file. result = open_file(outputFilename, FU_CREATE_WRITE_ONLY, &outputFD); if (result != UDS_SUCCESS) { errx(1, "Could not open output file '%s'", outputFilename); } dumpGeometryBlock(); dumpSuperBlock(); dumpBlockMap(); dumpSlabs(); dumpRecoveryJournal(); dumpSlabSummary(); freeAllocations(); exit(0); } vdo-8.3.1.1/utils/vdo/vdoforcerebuild.c000066400000000000000000000063101476467262700177510ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include #include "errors.h" #include "logger.h" #include "constants.h" #include "status-codes.h" #include "types.h" #include "vdoConfig.h" #include "fileLayer.h" static const char usageString[] = " [--help] filename"; static const char helpString[] = "vdoforcerebuild - prepare a VDO device to exit read-only mode\n" "\n" "SYNOPSIS\n" " vdoforcerebuild filename\n" "\n" "DESCRIPTION\n" " vdoforcerebuild forces an existing VDO device to exit read-only\n" " mode and to attempt to regenerate as much metadata as possible.\n" "\n" "OPTIONS\n" " --help\n" " Print this help message and exit.\n" "\n" " --version\n" " Show the version of vdoforcerebuild.\n" "\n"; // N.B. the option array must be in sync with the option string. static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char optionString[] = "h"; static void usage(const char *progname, const char *usageOptionsString) { errx(1, "Usage: %s%s\n", progname, usageOptionsString); } int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } int c; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); break; case 'V': fprintf(stdout, "vdoforcerebuild version is: %s\n", CURRENT_VERSION); exit(0); break; default: usage(argv[0], usageString); break; }; } if (optind != (argc - 1)) { usage(argv[0], usageString); } char *filename = argv[optind]; PhysicalLayer *layer; // Passing 0 physical blocks will make a filelayer to fit the file. result = makeFileLayer(filename, 0, &layer); if (result != VDO_SUCCESS) { errx(result, "makeFileLayer failed on '%s'", filename); } result = forceVDORebuild(layer); if (result != VDO_SUCCESS) { char buf[VDO_MAX_ERROR_MESSAGE_SIZE]; errx(result, "forceRebuild failed on '%s': %s", filename, uds_string_error(result, buf, sizeof(buf))); } layer->destroy(&layer); } vdo-8.3.1.1/utils/vdo/vdoformat.c000066400000000000000000000474161476467262700166100ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include #include #include #include #include #include #include #include "errors.h" #include "fileUtils.h" #include "logger.h" #include "string-utils.h" #include "syscalls.h" #include "time-utils.h" #include "constants.h" #include "status-codes.h" #include "types.h" #include "vdoConfig.h" #include "fileLayer.h" #include "parseUtils.h" #include "userVDO.h" #include "vdoVolumeUtils.h" enum { MIN_SLAB_BITS = 13, DEFAULT_SLAB_BITS = 19, }; static const char usageString[] = " [--help] [options...] filename"; static const char helpString[] = "vdoformat - format a VDO device\n" "\n" "SYNOPSIS\n" " vdoformat [options] filename\n" "\n" "DESCRIPTION\n" " vdoformat formats the block device named by filename as a VDO device\n" " This is analogous to low-level device formatting. The device will not\n" " be formatted if it already contains a VDO, unless the --force flag is\n" " used.\n" "\n" " vdoformat can also modify some of the formatting parameters.\n" "\n" "OPTIONS\n" " --force\n" " Format the block device, even if there is already a VDO formatted\n" " thereupon.\n" "\n" " --help\n" " Print this help message and exit.\n" "\n" " --logical-size=\n" " Set the logical (provisioned) size of the VDO device to .\n" " A size suffix of K for kilobytes, M for megabytes, G for\n" " gigabytes, T for terabytes, or P for petabytes is optional. The\n" " default unit is megabytes.\n" "\n" " --slab-bits=\n" " Set the free space allocator's slab size to 2^ 4 KB blocks.\n" " must be a value between 13 and 23 (inclusive), corresponding\n" " to a slab size between 32 MB and 32 GB. The default value is 19\n" " which results in a slab size of 2 GB. This allocator manages the\n" " space VDO uses to store user data.\n" "\n" " The maximum number of slabs in the system is 8192, so this value\n" " determines the maximum physical size of a VDO volume. One slab is\n" " the minimum amount by which a VDO volume can be grown. Smaller\n" " slabs also increase the potential for parallelism if the device\n" " has multiple physical threads. Therefore, this value should be set\n" " as small as possible, given the eventual maximal size of the\n" " volume.\n" "\n" " --uds-memory-size=\n" " Specify the amount of memory, in gigabytes, to devote to the\n" " index. Accepted options are 0.25, 0.5, 0.50, 0.75, and all\n" " positive integers.\n" "\n" " --uds-sparse\n" " Specify whether or not to use a sparse index.\n" "\n" " --verbose\n" " Describe what is being formatted and with what parameters.\n" "\n" " --version\n" " Show the version of vdoformat.\n" "\n"; // N.B. the option array must be in sync with the option string. static struct option options[] = { { "force", no_argument, NULL, 'f' }, { "help", no_argument, NULL, 'h' }, { "logical-size", required_argument, NULL, 'l' }, { "slab-bits", required_argument, NULL, 'S' }, { "uds-memory-size", required_argument, NULL, 'm' }, { "uds-sparse", no_argument, NULL, 's' }, { "verbose", no_argument, NULL, 'v' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char optionString[] = "fhil:S:m:svV"; static void usage(const char *progname, const char *usageOptionsString) { errx(1, "Usage: %s%s\n", progname, usageOptionsString); } /**********************************************************************/ static void printReadableSize(size_t size) { const char *UNITS[] = { "B", "KB", "MB", "GB", "TB", "PB" }; unsigned int unit = 0; float floatSize = 0; while ((size >= 1024) && (unit < ARRAY_SIZE(UNITS) - 1)) { floatSize = (float)size / 1024; size = size / 1024; unit++; }; if (unit > 0) { printf("%4.2f %s", floatSize, UNITS[unit]); } else { printf("%zu %s", size, UNITS[unit]); } } /**********************************************************************/ static void describeCapacity(const UserVDO *vdo, uint64_t logicalSize, unsigned int slabBits) { if (logicalSize == 0) { printf("Logical blocks defaulted to %llu blocks.\n", (unsigned long long) vdo->states.vdo.config.logical_blocks); } struct slab_config slabConfig = vdo->states.slab_depot.slab_config; size_t totalSize = vdo->slabCount * slabConfig.slab_blocks * VDO_BLOCK_SIZE; size_t maxTotalSize = MAX_VDO_SLABS * slabConfig.slab_blocks * VDO_BLOCK_SIZE; printf("The VDO volume can address "); printReadableSize(totalSize); printf(" in %u data slab%s", vdo->slabCount, ((vdo->slabCount != 1) ? "s" : "")); if (vdo->slabCount > 1) { printf(", each "); printReadableSize(slabConfig.slab_blocks * VDO_BLOCK_SIZE); } printf(".\n"); if (vdo->slabCount < MAX_VDO_SLABS) { printf("It can grow to address at most "); printReadableSize(maxTotalSize); printf(" of physical storage in %u slabs.\n", MAX_VDO_SLABS); if (slabBits < MAX_VDO_SLAB_BITS) { printf("If a larger maximum size might be needed, use bigger slabs.\n"); } } else { printf("The volume has the maximum number of slabs and so cannot grow.\n"); if (slabBits < MAX_VDO_SLAB_BITS) { printf("Consider using larger slabs to allow the volume to grow.\n"); } } } static const char MSG_FAILED_SIG_OFFSET[] = "Failed to get offset of the %s" \ " signature on %s.\n"; static const char MSG_FAILED_SIG_LENGTH[] = "Failed to get length of the %s" \ " signature on %s.\n"; static const char MSG_FAILED_SIG_INVALID[] = "Found invalid data in the %s" \ " signature on %s.\n"; static const char MSG_SIG_DATA[] = "Found existing signature on %s at" \ " offset %s: LABEL=\"%s\" UUID=\"%s\" TYPE=\"%s\" USAGE=\"%s\".\n"; /********************************************************************** * Print info on existing signature found by blkid. If called with * force, print messages to stdout, otherwise print messages to stderr * * @param probe the current blkid probe location * @param filename the name of the file blkid is probing * @param force whether we called vdoformat with --force. * * @return VDO_SUCCESS or error. */ static int printSignatureInfo(blkid_probe probe, const char *filename, bool force) { const char *offset = NULL, *type = NULL, *magic = NULL, *usage = NULL, *label = NULL, *uuid = NULL; size_t len; int result = VDO_SUCCESS; result = blkid_probe_lookup_value(probe, "TYPE", &type, NULL); if (result == VDO_SUCCESS) { result = blkid_probe_lookup_value(probe, "SBMAGIC_OFFSET", &offset, NULL); if (result != VDO_SUCCESS) { fprintf(force ? stdout : stderr, MSG_FAILED_SIG_OFFSET, type, filename); } result = blkid_probe_lookup_value(probe, "SBMAGIC", &magic, &len); if (result != VDO_SUCCESS) { fprintf(force ? stdout : stderr, MSG_FAILED_SIG_LENGTH, type, filename); } } else { result = blkid_probe_lookup_value(probe, "PTTYPE", &type, NULL); if (result != VDO_SUCCESS) { // Unknown type. Ignore. return VDO_SUCCESS; } result = blkid_probe_lookup_value(probe, "PTMAGIC_OFFSET", &offset, NULL); if (result != VDO_SUCCESS) { fprintf(force ? stdout: stderr, MSG_FAILED_SIG_OFFSET, type, filename); } result = blkid_probe_lookup_value(probe, "PTMAGIC", &magic, &len); if (result != VDO_SUCCESS) { fprintf(force ? stdout : stderr, MSG_FAILED_SIG_LENGTH, type, filename); } usage = "partition table"; } if ((len == 0) || (offset == NULL)) { fprintf(force ? stdout : stderr, MSG_FAILED_SIG_INVALID, type, filename); } if (usage == NULL) { (void) blkid_probe_lookup_value(probe, "USAGE", &usage, NULL); } /* Return values ignored here, in the worst case we print NULL */ (void) blkid_probe_lookup_value(probe, "LABEL", &label, NULL); (void) blkid_probe_lookup_value(probe, "UUID", &uuid, NULL); fprintf(force ? stdout : stderr, MSG_SIG_DATA, filename, offset, label, uuid, type, usage); return VDO_SUCCESS; } /********************************************************************** * Check for existing signatures on disk using blkid. * * @param filename the name of the file blkid is probing * @param force whether we called vdoformat with --force. * * @return VDO_SUCCESS or error. */ static int checkForSignaturesUsingBlkid(const char *filename, bool force) { int result = VDO_SUCCESS; blkid_probe probe = NULL; probe = blkid_new_probe_from_filename(filename); if (probe == NULL) { errx(1, "Failed to create a new blkid probe for device %s", filename); } blkid_probe_enable_partitions(probe, 1); blkid_probe_set_partitions_flags(probe, BLKID_PARTS_MAGIC); blkid_probe_enable_superblocks(probe, 1); blkid_probe_set_superblocks_flags(probe, BLKID_SUBLKS_LABEL | BLKID_SUBLKS_UUID | BLKID_SUBLKS_TYPE | BLKID_SUBLKS_USAGE | BLKID_SUBLKS_VERSION | BLKID_SUBLKS_MAGIC | BLKID_SUBLKS_BADCSUM); int found = 0; while (blkid_do_probe(probe) == VDO_SUCCESS) { found++; printSignatureInfo(probe, filename, force); } if (found > 0) { if (force) { printf("Formatting device already containing a known signature.\n"); } else { fprintf(stderr, "Cannot format device already containing a known signature!\n" "If you are sure you want to format this device again, use the\n" "--force option.\n"); result = EPERM; } } blkid_free_probe(probe); return result; } /********************************************************************** * Count the number of processes holding access to the device * * @param path the path to the holders sysfs directory. * @param force pointer to holder count variable. * * @return VDO_SUCCESS or error. */ static int countHolders(char *path, int *holders) { struct stat statbuf; int result = logging_stat(path, &statbuf, "Getting holder count"); if (result != UDS_SUCCESS) { fprintf(stderr, "Unable to get status of %s.\n", path); return result; } struct dirent *dirent; DIR *d = opendir(path); if (d == NULL) { fprintf(stderr, "Unable to open holders directory.\n"); return EPERM; } while ((dirent = readdir(d))) { if (strcmp(dirent->d_name, ".") && strcmp(dirent->d_name, "..")) { (*holders)++; } } closedir(d); return VDO_SUCCESS; } #define HOLDER_CHECK_RETRIES 25 #define HOLDER_CHECK_USLEEP_DELAY 200000 /********************************************************************** * Check that the device we are about to format is not in use by * something else. * * @param filename the name of the device we are checking * @param major the device's major number * @param minor the device's minor number * * @return VDO_SUCCESS or error. */ static int checkDeviceInUse(char *filename, uint32_t major, uint32_t minor) { unsigned int retries = HOLDER_CHECK_RETRIES; int holders = 0; char *path; int result = vdo_alloc_sprintf(__func__, &path, "/sys/dev/block/%u:%u/holders", major, minor); if (result != VDO_SUCCESS) { return result; } result = countHolders(path, &holders); if (result != VDO_SUCCESS) { free(path); return result; } while (holders > 0 && retries--) { if (!retries) { fprintf(stderr, "The device %s is in use.\n", filename); free(path); return EPERM; } usleep(HOLDER_CHECK_USLEEP_DELAY); printf("Retrying in use check for %s.\n", filename); int result = countHolders(path, &holders); if (result != VDO_SUCCESS) { free(path); return result; } } free(path); return VDO_SUCCESS; } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } uint64_t logicalSize = 0; // defaults to physicalSize unsigned int slabBits = DEFAULT_SLAB_BITS; UdsConfigStrings configStrings; memset(&configStrings, 0, sizeof(configStrings)); int c; uint64_t sizeArg; static bool verbose = false; static bool force = false; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'f': force = true; break; case 'h': printf("%s", helpString); exit(0); break; case 'l': result = parseSize(optarg, true, &sizeArg); if (result != VDO_SUCCESS) { usage(argv[0], usageString); } logicalSize = sizeArg; break; case 'S': result = parseUInt(optarg, MIN_SLAB_BITS, MAX_VDO_SLAB_BITS, &slabBits); if (result != VDO_SUCCESS) { warnx("invalid slab bits, must be %u-%u", MIN_SLAB_BITS, MAX_VDO_SLAB_BITS); usage(argv[0], usageString); } break; case 'm': configStrings.memorySize = optarg; break; case 's': configStrings.sparse = "1"; break; case 'v': verbose = true; break; case 'V': fprintf(stdout, "vdoformat version is: %s\n", CURRENT_VERSION); exit(0); break; default: usage(argv[0], usageString); break; }; } if (optind != (argc - 1)) { usage(argv[0], usageString); } char *filename = argv[optind]; struct stat statbuf; result = logging_stat_missing_ok(filename, &statbuf, "Getting status"); if (result != UDS_SUCCESS && result != ENOENT) { errx(1, "unable to get status of %s", filename); } if (!S_ISBLK(statbuf.st_mode)) { errx(1, "%s must be a block device", filename); } uint32_t major = major(statbuf.st_rdev); uint32_t minor = minor(statbuf.st_rdev); result = checkDeviceInUse(filename, major, minor); if (result != VDO_SUCCESS) { errx(1, "checkDeviceInUse failed on %s", filename); } int fd; result = open_file(filename, FU_READ_WRITE, &fd); if (result != UDS_SUCCESS) { errx(1, "unable to open %s", filename); } uint64_t physicalSize; if (ioctl(fd, BLKGETSIZE64, &physicalSize) < 0) { errx(1, "unable to get size of %s", filename); } if (physicalSize > MAXIMUM_VDO_PHYSICAL_BLOCKS * VDO_BLOCK_SIZE) { errx(1, "underlying block device size exceeds the maximum (%llu)", (unsigned long long) (MAXIMUM_VDO_PHYSICAL_BLOCKS * VDO_BLOCK_SIZE)); } result = close_file(fd, "cannot close file"); if (result != UDS_SUCCESS) { errx(1, "cannot close %s", filename); } struct vdo_config config = { .logical_blocks = logicalSize / VDO_BLOCK_SIZE, .physical_blocks = physicalSize / VDO_BLOCK_SIZE, .slab_size = 1 << slabBits, .slab_journal_blocks = DEFAULT_VDO_SLAB_JOURNAL_SIZE, .recovery_journal_size = DEFAULT_VDO_RECOVERY_JOURNAL_SIZE, }; if ((config.logical_blocks * VDO_BLOCK_SIZE) != (block_count_t) logicalSize) { errx(1, "logical size must be a multiple of block size %d", VDO_BLOCK_SIZE); } char errorBuffer[VDO_MAX_ERROR_MESSAGE_SIZE]; if (config.logical_blocks > MAXIMUM_VDO_LOGICAL_BLOCKS) { errx(1, "%llu requested logical space exceeds the maximum " "(%llu): %s", (unsigned long long) logicalSize, (unsigned long long) (MAXIMUM_VDO_LOGICAL_BLOCKS * VDO_BLOCK_SIZE), uds_string_error(VDO_OUT_OF_RANGE, errorBuffer, sizeof(errorBuffer))); } PhysicalLayer *layer; result = makeFileLayer(filename, config.physical_blocks, &layer); if (result != VDO_SUCCESS) { errx(1, "makeFileLayer failed on '%s'", filename); } // Check whether there's already something on this device already... result = checkForSignaturesUsingBlkid(filename, force); if (result != VDO_SUCCESS) { errx(1, "checkForSignaturesUsingBlkid failed on '%s'", filename); } struct index_config indexConfig; result = parseIndexConfig(&configStrings, &indexConfig); if (result != VDO_SUCCESS) { errx(1, "parseIndexConfig failed: %s", uds_string_error(result, errorBuffer, sizeof(errorBuffer))); } // Zero out the UDS superblock in case there's already a UDS there. char *zeroBuffer; result = layer->allocateIOBuffer(layer, VDO_BLOCK_SIZE, "zero buffer", &zeroBuffer); if (result != VDO_SUCCESS) { return result; } result = layer->writer(layer, 1, 1, zeroBuffer); if (result != VDO_SUCCESS) { return result; } if (verbose) { if (logicalSize > 0) { printf("Formatting '%s' with %llu logical and %llu" " physical blocks of %u bytes.\n", filename, (unsigned long long) config.logical_blocks, (unsigned long long) config.physical_blocks, VDO_BLOCK_SIZE); } else { printf("Formatting '%s' with default logical and %llu" " physical blocks of %u bytes.\n", filename, (unsigned long long) config.physical_blocks, VDO_BLOCK_SIZE); } } result = formatVDO(&config, &indexConfig, layer); if (result != VDO_SUCCESS) { const char *extraHelp = ""; if (result == VDO_TOO_MANY_SLABS) { extraHelp = "\nReduce the device size or increase the slab size"; } if (result == UDS_ASSERTION_FAILED) { result = VDO_BAD_CONFIGURATION; extraHelp = "\nInformation on the failure can be found in the logs"; } if (result == VDO_NO_SPACE) { block_count_t minVDOBlocks = 0; int calcResult = calculateMinimumVDOFromConfig(&config, &indexConfig, &minVDOBlocks); if (calcResult != VDO_SUCCESS) { errx(1, "Unable to calculate minimum required VDO size"); } else { uint64_t minimumSize = minVDOBlocks * VDO_BLOCK_SIZE; fprintf(stderr, "Minimum required size for VDO volume: %llu bytes\n", (unsigned long long) minimumSize); } } errx(1, "formatVDO failed on '%s': %s%s", filename, uds_string_error(result, errorBuffer, sizeof(errorBuffer)), extraHelp); } UserVDO *vdo; result = loadVDO(layer, true, &vdo); if (result != VDO_SUCCESS) { errx(1, "unable to verify configuration after formatting '%s'", filename); } // Display default logical size, max capacity, etc. describeCapacity(vdo, logicalSize, slabBits); freeUserVDO(&vdo); // Close and sync the underlying file. layer->destroy(&layer); } vdo-8.3.1.1/utils/vdo/vdolistmetadata.c000066400000000000000000000146321476467262700177660ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include "errors.h" #include "string-utils.h" #include "syscalls.h" #include "encodings.h" #include "status-codes.h" #include "types.h" #include "userVDO.h" #include "vdoVolumeUtils.h" static const char usageString[] = "[--help] [--version] "; static const char helpString[] = "vdoListMetadata - list the metadata regions on a VDO device\n" "\n" "SYNOPSIS\n" " vdoListMetadata \n" "\n" "DESCRIPTION\n" " vdoListMetadata lists the metadata regions of a VDO device\n" " as ranges of block numbers. Each range is on a separate line\n" " of the form:\n" " startBlock .. endBlock: label\n" " Both endpoints are included in the range, and are the zero-based\n" " indexes of 4KB VDO metadata blocks on the backing device.\n" "\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char *vdoBackingName = NULL; static UserVDO *vdo = NULL; /** * Explain how this command-line tool is used. * * @param programName Name of this program * @param usageString Multi-line explanation **/ static void usage(const char *programName) { errx(1, "Usage: %s %s\n", programName, usageString); } /** * Parse the arguments passed; print command usage if arguments are wrong. * * @param argc Number of input arguments * @param argv Array of input arguments **/ static void processArgs(int argc, char *argv[]) { int c; while ((c = getopt_long(argc, argv, "hV", options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); case 'V': printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); default: usage(argv[0]); break; } } // Explain usage and exit if (optind != (argc - 1)) { usage(argv[0]); } vdoBackingName = argv[optind++]; } /** * List a range of metadata blocks on stdout. * * @param label The type of metadata * @param startBlock The block to start at in the VDO backing device * @param count The number of metadata blocks in the range **/ static void listBlocks(const char *label, physical_block_number_t startBlock, block_count_t count) { printf("%ld .. %ld: %s\n", startBlock, startBlock + count - 1, label); } /**********************************************************************/ static void listGeometryBlock(void) { // The geometry block is a single block at the start of the volume. listBlocks("geometry block", 0, 1); } /**********************************************************************/ static void listIndex(void) { // The index is all blocks from the geometry block to the super block, // exclusive. listBlocks("index", 1, vdo_get_data_region_start(vdo->geometry) - 1); } /**********************************************************************/ static void listSuperBlock(void) { // The SuperBlock is a single block at the start of the data region. listBlocks("super block", vdo_get_data_region_start(vdo->geometry), 1); } /**********************************************************************/ static void listBlockMap(void) { struct block_map_state_2_0 map = vdo->states.block_map; if (map.root_count > 0) { listBlocks("block map tree roots", map.root_origin, map.root_count); } } /**********************************************************************/ static void listSlabs(void) { struct slab_depot_state_2_0 depot = vdo->states.slab_depot; physical_block_number_t slabOrigin = depot.first_block; for (slab_count_t slab = 0; slab < vdo->slabCount; slab++) { // List the slab's reference count blocks. char buffer[64]; sprintf(buffer, "slab %u reference blocks", slab); listBlocks(buffer, slabOrigin + depot.slab_config.data_blocks, depot.slab_config.reference_count_blocks); // List the slab's journal blocks. sprintf(buffer, "slab %u journal", slab); listBlocks(buffer, vdo_get_slab_journal_start_block(&depot.slab_config, slabOrigin), depot.slab_config.slab_journal_blocks); slabOrigin += vdo->states.vdo.config.slab_size; } } /**********************************************************************/ static void listRecoveryJournal(void) { const struct partition *partition = getPartition(vdo, VDO_RECOVERY_JOURNAL_PARTITION, "no recovery journal partition"); listBlocks("recovery journal", partition->offset, vdo->states.vdo.config.recovery_journal_size); } /**********************************************************************/ static void listSlabSummary(void) { const struct partition *partition = getPartition(vdo, VDO_SLAB_SUMMARY_PARTITION, "no slab summary partition"); listBlocks("slab summary", partition->offset, VDO_SLAB_SUMMARY_BLOCKS); } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } processArgs(argc, argv); // Read input VDO, without validating its config. result = readVDOWithoutValidation(vdoBackingName, &vdo); if (result != VDO_SUCCESS) { errx(1, "Could not load VDO from '%s'", vdoBackingName); } listGeometryBlock(); listIndex(); listSuperBlock(); listBlockMap(); listSlabs(); listRecoveryJournal(); listSlabSummary(); freeVDOFromFile(&vdo); exit(0); } vdo-8.3.1.1/utils/vdo/vdoreadonly.c000066400000000000000000000064051476467262700171260ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include #include "errors.h" #include "fileUtils.h" #include "logger.h" #include "string-utils.h" #include "constants.h" #include "status-codes.h" #include "fileLayer.h" #include "physicalLayer.h" #include "vdoConfig.h" #include "vdoVolumeUtils.h" static const char usageString[] = " [--help] filename"; static const char helpString[] = "vdoreadonly - Puts a VDO device into read-only mode\n" "\n" "SYNOPSIS\n" " vdoreadonly filename\n" "\n" "DESCRIPTION\n" " vdoreadonly forces an existing VDO device into read-only\n" " mode.\n" "\n" "OPTIONS\n" " --help\n" " Print this help message and exit.\n" "\n" " --version\n" " Show the version of vdoreadonly.\n" "\n"; // N.B. the option array must be in sync with the option string. static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char optionString[] = "h"; static void usage(const char *progname, const char *usageOptionsString) { errx(1, "Usage: %s%s\n", progname, usageOptionsString); } /**********************************************************************/ int main(int argc, char *argv[]) { static char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } int c; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); break; case 'V': fprintf(stdout, "vdoreadonly version is: %s\n", CURRENT_VERSION); exit(0); break; default: usage(argv[0], usageString); break; }; } if (optind != (argc - 1)) { usage(argv[0], usageString); } char *filename = argv[optind]; PhysicalLayer *layer; result = makeFileLayer(filename, 0, &layer); if (result != VDO_SUCCESS) { errx(result, "makeFileLayer failed on '%s'", filename); } result = setVDOReadOnlyMode(layer); if (result != VDO_SUCCESS) { char buf[VDO_MAX_ERROR_MESSAGE_SIZE]; errx(result, "setting read-only mode failed on '%s': %s", filename, uds_string_error(result, buf, sizeof(buf))); } // Close and sync the underlying file. layer->destroy(&layer); } vdo-8.3.1.1/utils/vdo/vdorecover000077500000000000000000000207711476467262700165420ustar00rootroot00000000000000#!/bin/bash ## # Copyright Red Hat. # By Sweet Tea Dorminy, Awez Shaikh, Nikhil Kshirsagar # # Licensed under the GPL 2. See LICENSE in this repository. ## set -e _disableVDO(){ dmsetup reload $VDO_VOLUME_NAME --table "0 `blockdev --getsz $VDO_DEVICE` error" dmsetup resume $VDO_VOLUME_NAME } _enableOriginalVDO(){ dmsetup reload $VDO_VOLUME_NAME --table "${VDO_TABLE}" dmsetup resume $VDO_VOLUME_NAME } _cleanup(){ echo "Error detected, cleaning up..." umount $MOUNT_POINT || true if [[ -n $VDO_DEPENDENT_DEV_ORIGINAL_TABLE ]]; then dmsetup reload $VDO_DEPENDENT_DEV --table "${VDO_DEPENDENT_DEV_ORIGINAL_TABLE}" dmsetup resume $VDO_DEPENDENT_DEV fi if [[ -n $VDO_VOLUME_NAME ]]; then DEVICE_NAME=$VDO_VOLUME_NAME dmsetup remove $DEVICE_NAME-merge || true dmsetup remove $DEVICE_NAME-origin || true dmsetup remove $DEVICE_NAME-snap || true fi losetup -d $LOOPBACK1 || true rm $LOOPBACK_DIR/$DEVICE_NAME-tmp_loopback_file || true if [[ -n $VDO_TABLE ]]; then _disableVDO fi DEVICE_NAME=$(basename $VDO_BACKING) if [[ -n $DEVICE_NAME ]]; then dmsetup remove $DEVICE_NAME-merge || true dmsetup remove $DEVICE_NAME-origin || true dmsetup remove $DEVICE_NAME-snap || true fi if [[ -n $VDO_TABLE ]]; then _enableOriginalVDO fi losetup -d $LOOPBACK0 || true rm $LOOPBACK_DIR/$DEVICE_NAME-tmp_loopback_file || true } _waitForUserToDeleteStuff(){ MOUNT_POINT=$1 ANS='n' while [[ $ANS != y ]] ;do echo " " echo "Please remove some files from $MOUNT_POINT, then proceed" echo " " echo -n "Proceed? [y/n]: " read -n 1 ANS done } _fstrimAndPrompt(){ MOUNT_POINT=$1 local KEEPGOING=true while $KEEPGOING; do # If we weren't provided a $MOUNT_POINT, then don't try to run fstrim if [[ -n $MOUNT_POINT ]]; then fstrim $MOUNT_POINT || echo "Unable to run fstrim against $MOUNT_POINT" fi USED=$(vdostats $VDO_VOLUME_NAME | awk 'NR==2 {print $5}' | sed 's/%//') echo "Now down to just ${USED}% used" # If volume usage is (still) at 100%, then the device needs additional manual cleanup. if [[ $USED == 100 ]];then _waitForUserToDeleteStuff "$MOUNT_POINT" else KEEPGOING=false fi done } _fstrim(){ DEVICE=$1 MOUNT_POINT=$2 local NUMERATOR=$(dmsetup status $DEVICE | awk '{print $4}' | awk -F "/" '{print $1}') local DENOMINATOR=$(dmsetup status $DEVICE | awk '{print $4}' | awk -F "/" '{print $2}') if [[ $NUMERATOR -lt $DENOMINATOR ]]; then echo "Beginning space reclaim process -- running fstrim..." _fstrimAndPrompt $MOUNT_POINT else echo "No room on snapshot for fstrim!" fi } _unmount(){ MOUNT_POINT=$1 local UMNT=true while $UMNT; do umount $MOUNT_POINT UOUT=$(echo $?) if [[ $UOUT -ne 0 ]]; then echo "Process still has a open file or directory in $MOUNT_POINT" sleep 10 else UMNT=false fi done rmdir $MOUNT_POINT } _waitForMerge(){ DEVICE=$1 local KEEPGOING=true while $KEEPGOING; do local NUMERATOR=$(dmsetup status $DEVICE | awk '{print $4}' | awk -F "/" '{print $1}') local DENOMINATOR=$(dmsetup status $DEVICE | awk '{print $5}') if [[ $NUMERATOR -ne $DENOMINATOR ]];then printf "Merging, %u more chunks for %s\n" $((NUMERATOR - DENOMINATOR)) $DEVICE sleep 1 else KEEPGOING=false fi done } _mergeSnapshot(){ DEVICE=$1 DEVICE_NAME=$(basename $DEVICE) dmsetup remove $DEVICE_NAME-origin dmsetup suspend $DEVICE_NAME-snap #dmsetup create $VDO_VOLUME_NAME --table "$(echo $VDO_TABLE | awk "{\$5=\"${VDO_BACKING}\"; print }")" MERGE_TABLE=$(dmsetup table $DEVICE_NAME-snap | awk "{\$3=\"snapshot-merge\"; print }") dmsetup create $DEVICE_NAME-merge --table "$MERGE_TABLE" _waitForMerge $DEVICE_NAME-merge dmsetup remove $DEVICE_NAME-merge dmsetup remove $DEVICE_NAME-snap } _mergeDataSnap(){ PARENT=$1 if [[ -n $VDO_DEPENDENT_DEV ]]; then dmsetup suspend $VDO_DEPENDENT_DEV fi _mergeSnapshot $1 if [[ -n $VDO_DEPENDENT_DEV ]]; then dmsetup reload $VDO_DEPENDENT_DEV --table "${VDO_DEPENDENT_DEV_ORIGINAL_TABLE}" dmsetup resume $VDO_DEPENDENT_DEV fi losetup -d $LOOPBACK1 PARENT_NAME=$(basename $PARENT) rm $LOOPBACK_DIR/$PARENT_NAME-tmp_loopback_file } _mergeBackingSnap(){ _disableVDO _mergeSnapshot $VDO_BACKING _enableOriginalVDO losetup -d $LOOPBACK0 VDO_BACKING_NAME=$(basename $VDO_BACKING) rm $LOOPBACK_DIR/$VDO_BACKING_NAME-tmp_loopback_file } _mkloop(){ DEVICE=$1 LO_DEV_SIZE=${TMPFILESZ:-$(($(blockdev --getsz $DEVICE)*10/100))} DEVICE_NAME=$(basename $DEVICE) TMPFS=$(df -k $LOOPBACK_DIR | awk 'NR==2 {print $4}') if [[ TMPFS -lt LO_DEV_SIZE ]]; then echo "Not enough free space for Snapshot" echo "Specify LOOPBACK_DIR with free space or smaller TMPFILESZ in kb" exit 1 fi truncate -s ${LO_DEV_SIZE}M $LOOPBACK_DIR/$DEVICE_NAME-tmp_loopback_file LOOPBACK=$(losetup -f $LOOPBACK_DIR/$DEVICE_NAME-tmp_loopback_file --show) } _snap(){ DEVICE=$1 DEVICE_NAME=$(basename $DEVICE) dmsetup create $DEVICE_NAME-origin --table "0 `blockdev --getsz $DEVICE` snapshot-origin $DEVICE" _mkloop $DEVICE dmsetup create $DEVICE_NAME-snap --table "0 `blockdev --getsz $DEVICE` snapshot $DEVICE $LOOPBACK PO 4096 2 discard_zeroes_cow discard_passdown_origin" SNAP="/dev/mapper/${DEVICE_NAME}-snap" } _insertSnapUnderVDO(){ VDO_BACKING=$(echo $VDO_TABLE | cut -d' ' -f 5) _disableVDO _snap $VDO_BACKING LOOPBACK0=$LOOPBACK SNAP_UNDER_VDO=$SNAP SNAP_VDO_TABLE=$(echo $VDO_TABLE | awk "{ \$5=\"${SNAP_UNDER_VDO}\"; print }") dmsetup reload $VDO_VOLUME_NAME --table "${SNAP_VDO_TABLE}" dmsetup resume $VDO_VOLUME_NAME } _addSnapAboveVDO(){ _snap $VDO_DEVICE SNAP_OVER_VDO=$SNAP LOOPBACK1=$LOOPBACK } _tmpMount(){ DEVICE=$1 MOUNT_POINT=$(mktemp --tmpdir -d vdo-recover-XXXXXXXX) mount $1 $MOUNT_POINT echo $MOUNT_POINT } _mountVDOSnap(){ MOUNT=$(_tmpMount $SNAP_OVER_VDO) _fstrim $SNAP_OVER_VDO $MOUNT echo "Beginning commit of data changes" _unmount $MOUNT } _repointUpperDevicesOrMountVDO(){ # Check whether some other device is using VDO. If so, change that # device to point at the VDO snap and prompt the user to clean up; # else mount the VDO snap, fstrim, &c. SNAP_OVER_BASENAME=$(basename $SNAP_OVER_VDO) VDO_MAJMIN=$(dmsetup ls | grep \\\b$VDO_VOLUME_NAME\\\s | cut -f 2 | sed -r 's/[()]//g') VDO_MAJMIN_DEPS_VERSION=$(echo $VDO_MAJMIN | sed "s/:/, /g") ORIGIN_OVER_BASENAME=$(echo $SNAP_OVER_BASENAME | sed 's/snap$/origin/') VDO_DEPENDENT_DEV=$(dmsetup deps | grep "${VDO_MAJMIN_DEPS_VERSION}"\ | grep -v \\\b$ORIGIN_OVER_BASENAME\\\b\ | grep -v \\\b$SNAP_OVER_BASENAME\\\b\ | cut -d':' -f 1) if [[ -n $VDO_DEPENDENT_DEV ]]; then echo "Detecting dependent device $VDO_DEPENDENT_DEV on $VDO_VOLUME_NAME -- manual intervention will be required" VDO_DEPENDENT_DEV_ORIGINAL_TABLE=$(dmsetup table $VDO_DEPENDENT_DEV) DEPENDENT_NEW_TABLE=$(echo $VDO_DEPENDENT_DEV_ORIGINAL_TABLE | sed "s#${VDO_MAJMIN}#${SNAP_OVER_VDO}#g; s#^${VDO_DEVICE}\$#${SNAP_OVER_VDO}#g") dmsetup reload $VDO_DEPENDENT_DEV --table "${DEPENDENT_NEW_TABLE}" dmsetup resume $VDO_DEPENDENT_DEV echo "You may want to remount, and run fstrim, on any filesystem" echo "mounted on ${VDO_DEPENDENT_DEV}." DEPENDENT_MOUNT=$(awk "/${VDO_DEPENDENT_DEV}/ {print \$2}" /proc/self/mounts) _fstrimAndPrompt "${DEPENDENT_MOUNT}" dmsetup reload $VDO_DEPENDENT_DEV --table "${VDO_DEPENDENT_DEV_ORIGINAL_TABLE}" dmsetup resume $VDO_DEPENDENT_DEV else _mountVDOSnap fi } _recoveryProcess(){ echo "Recovery process started" LOOPBACK_DIR=${LOOPBACK_DIR:-$(mktemp -d --tmpdir vdo-loopback-XXX)} _insertSnapUnderVDO _addSnapAboveVDO _repointUpperDevicesOrMountVDO echo "Beginning commit of data changes" _mergeDataSnap $VDO_VOLUME_NAME _mergeBackingSnap echo "Recovery process completed, $VDO_VOLUME_NAME is ${USED}% Used" } ####################################################################### VDO_DEVICE=$1 if [[ -z $1 ]] || [[ $1 == "--help" ]] || [[ $1 == "-h" ]]; then echo "Usage: ./vdo_recover {path to vdo device}" exit 1 else VDO_VOLUME_NAME=$(basename $VDO_DEVICE) if [[ $EUID -ne 0 ]]; then echo "$0: cannot open $VDO_DEVICE: Permission denied" 1>&2 exit 1 else for entry in $(dmsetup ls --target vdo) do if [ ${entry[@]} = $VDO_VOLUME_NAME ]; then if grep -qs "$VDO_DEVICE$" /proc/self/mounts ; then echo "$VDO_VOLUME_NAME appears mounted." grep "$VDO_DEVICE" /proc/self/mounts exit 1 else trap _cleanup 0 VDO_TABLE=$(dmsetup table $VDO_VOLUME_NAME) _recoveryProcess trap - 0 exit 0 fi else echo "$VDO_DEVICE not present" fi done echo "$VDO_DEVICE not detected -- not running?" exit 1 fi fi vdo-8.3.1.1/utils/vdo/vdostats.bash000066400000000000000000000022211476467262700171320ustar00rootroot00000000000000# bash completion for vdostats # # Copyright Red Hat # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA # 02110-1301, USA. # # TODO : Add device name at the end of completion of the commands. _vdostats() { local opts cur _init_completion || return COMPREPLY=() opts="--help --all --human-readable --si --verbose --version" cur="${COMP_WORDS[COMP_CWORD]}" case "${cur}" in *) COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) ) ;; esac } complete -F _vdostats vdostats vdo-8.3.1.1/utils/vdo/vdostats.c000066400000000000000000000351761476467262700164560ustar00rootroot00000000000000/* * Copyright Red Hat * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA * 02110-1301, USA. */ #include #include #include #include #include #include #include #include #include #include #include #include "errors.h" #include "logger.h" #include "memory-alloc.h" #include "statistics.h" #include "status-codes.h" #include "vdoStats.h" static const char usageString[] = " [--help] [--version] [options...] [device [device ...]]"; static const char helpString[] = "vdostats - Display configuration and statistics of VDO volumes\n" "\n" "SYNOPSIS\n" " vdostats [options] [device [device ...]]\n" "\n" "DESCRIPTION\n" " vdostats displays configuration and statistics information for the given\n" " VDO devices. If no devices are given, it displays information about all\n" " VDO devices.\n" "\n" " The VDO devices must be running in order for configuration and\n" " statistics information to be reported.\n" "\n" "OPTIONS\n" " -h, --help\n" " Print this help message and exit.\n" "\n" " -a, --all\n" " For backwards compatibility. Equivalent to --verbose.\n" "\n" " --human-readable\n" " Display stats in human-readable form.\n" "\n" " --si\n" " Use SI units, implies --human-readable.\n" "\n" " -v, --verbose\n" " Include statistics regarding utilization and block I/O (bios).\n" "\n" " -V, --version\n" " Print the vdostats version number and exit.\n" "\n"; static struct option options[] = { { "help", no_argument, NULL, 'h' }, { "all", no_argument, NULL, 'a' }, { "human-readable", no_argument, NULL, 'r' }, { "si", no_argument, NULL, 's' }, { "verbose", no_argument, NULL, 'v' }, { "version", no_argument, NULL, 'V' }, { NULL, 0, NULL, 0 }, }; static char optionString[] = "harsvV"; enum style { STYLE_DF, STYLE_YAML, }; enum style style = STYLE_DF; static bool humanReadable = false; static bool si = false; static bool verbose = false; static bool headerPrinted = false; static int maxDeviceNameLength = 6; typedef struct dfStats { uint64_t size; uint64_t used; uint64_t available; int usedPercent; int savingPercent; } DFStats; typedef struct dfFieldLengths { int name; int size; int used; int available; int usedPercent; int savingPercent; } DFFieldLengths; typedef struct vdoPath { char name[NAME_MAX]; char resolvedName[NAME_MAX]; char resolvedPath[PATH_MAX]; } VDOPath; static VDOPath *vdoPaths = NULL; static int pathCount = 0; /********************************************************************** * Obtain the VDO device statistics. * * @param stats The device statistics * * @return A DFStats structure of device statistics * **/ static DFStats getDFStats(struct vdo_statistics *stats) { uint64_t size = stats->physical_blocks; uint64_t logicalUsed = stats->logical_blocks_used; uint64_t dataUsed = stats->data_blocks_used; uint64_t metaUsed = stats->overhead_blocks_used; uint64_t used = dataUsed + metaUsed; uint64_t available = size - used; int usedPercent = (int) (100.0 * used / size + 0.5); int savingPercent = 0; if (logicalUsed > 0) { savingPercent = (int) (100.0 * (logicalUsed - dataUsed) / logicalUsed); } return (DFStats) { .size = size, .used = used, .available = available, .usedPercent = usedPercent, .savingPercent = savingPercent, }; } /********************************************************************** * Display the size in human readable format. * * @param aFieldWidth The size field width * @param aSize The size to be displayed * **/ static void printSizeAsHumanReadable(const int aFieldWidth, const uint64_t aSize) { static const char UNITS[] = { 'B', 'K', 'M', 'G', 'T' }; double size = (double) aSize; int divisor = si ? 1000 : 1024; unsigned int i = 0; while ((size >= divisor) && (i < (ARRAY_SIZE(UNITS) - 1))) { size /= divisor; i++; } printf("%*.1f%c ", aFieldWidth - 1, size, UNITS[i]); } /********************************************************************** * Display the device statistics in DFStyle. * * @param path The device path * @param stats The device statistics * **/ static void displayDFStyle(const char *path, struct vdo_statistics *stats) { const DFFieldLengths fieldLength = {maxDeviceNameLength, 9, 9, 9, 4, 13}; char dfName[fieldLength.name + 1]; DFStats dfStats = getDFStats(stats); // Extract the device name. Use strdup for non const string. char *devicePath = strdup(path); strcpy(dfName, basename(devicePath)); free(devicePath); // Display the device statistics if (!headerPrinted) { printf("%-*s %*s %*s %*s %*s %*s\n", fieldLength.name, "Device", fieldLength.size, humanReadable ? "Size" : "1k-blocks", fieldLength.used, "Used", fieldLength.available, "Available", fieldLength.usedPercent, "Use%", fieldLength.savingPercent, "Space saving%"); headerPrinted = true; } if (stats->in_recovery_mode) { printf("%-*s %*" PRIu64 " %*s %*s %*s %*s\n", fieldLength.name, dfName, fieldLength.size, ((dfStats.size * stats->block_size) / 1024), fieldLength.used, "N/A", fieldLength.available, "N/A", (fieldLength.usedPercent - 1), "N/A", (fieldLength.savingPercent - 1), "N/A"); return; } if (humanReadable) { // Convert to human readable form (e.g., G, T, P) and // optionally in SI units (1000 as opposed to 1024). printf("%-*s ", fieldLength.name, dfName); // The first argument is the field width (provided as input // here to ease matching any future changes with the below format // string). printSizeAsHumanReadable(fieldLength.size, dfStats.size * stats->block_size); printSizeAsHumanReadable(fieldLength.used, dfStats.used * stats->block_size); printSizeAsHumanReadable(fieldLength.available, dfStats.available * stats->block_size); } else { // Convert blocks to kb for printing printf("%-*s %*" PRIu64 " %*" PRIu64 " %*" PRIu64 " ", fieldLength.name, dfName, fieldLength.size, dfStats.size * stats->block_size / 1024, fieldLength.used, dfStats.used * stats->block_size / 1024, fieldLength.available, dfStats.available * stats->block_size / 1024); } if (dfStats.savingPercent < 0) { printf("%*d%% %*s\n", (fieldLength.usedPercent - 1), dfStats.usedPercent, (fieldLength.savingPercent - 1), "N/A"); } else { printf("%*d%% %*d%%\n", (fieldLength.usedPercent - 1), dfStats.usedPercent, (fieldLength.savingPercent - 1), dfStats.savingPercent); } } /********************************************************************** * Display the usage string. * * @param path The device path * @param name The dmsetup name * **/ static void usage(const char *progname, const char *usageOptionsString) { errx(1, "Usage: %s%s\n", progname, usageOptionsString); } /********************************************************************** * Parse the arguments passed; print command usage if arguments are wrong. * * @param argc Number of input arguments * @param argv Array of input arguments **/ static void process_args(int argc, char *argv[]) { int c; while ((c = getopt_long(argc, argv, optionString, options, NULL)) != -1) { switch (c) { case 'h': printf("%s", helpString); exit(0); break; case 'a': verbose = true; break; case 'r': humanReadable = true; break; case 's': si = true; humanReadable = true; break; case 'v': verbose = true; break; case 'V': printf("%s version is: %s\n", argv[0], CURRENT_VERSION); exit(0); break; default: usage(argv[0], usageString); break; }; } } /********************************************************************** * Free the allocated paths * **/ static void freeAllocations(void) { vdo_free(vdoPaths); } /********************************************************************** * Process the VDO stats for a single device. * * @param original The original name passed into vdostats * @param name The name of the vdo device to use in dmsetup message * **/ static void process_device(const char *original, const char *name) { struct vdo_statistics stats; char dmCommand[256]; sprintf(dmCommand, "dmsetup message %s 0 stats", name); FILE* fp = popen(dmCommand, "r"); if (fp == NULL) { freeAllocations(); errx(1, "'%s': Could not retrieve VDO device stats information", name); } char statsBuf[8192]; if (fgets(statsBuf, sizeof(statsBuf), fp) != NULL) { read_vdo_stats(statsBuf, &stats); switch (style) { case STYLE_DF: displayDFStyle(original, &stats); break; case STYLE_YAML: printf("%s : \n", original); vdo_write_stats(&stats); break; default: pclose(fp); freeAllocations(); errx(1, "unknown style %d", style); } } int result = pclose(fp); if ((WIFEXITED(result))) { result = WEXITSTATUS(result); } if (result != 0) { freeAllocations(); errx(1, "'%s': Could not retrieve VDO device stats information", name); } } /********************************************************************** * Transform device into a known vdo path and name, if possible. * * @param device The device name to search for. * * @return struct containing name and path if found, otherwise NULL. * **/ static VDOPath *transformDevice(char *device) { for (int i = 0; i < pathCount; i++) { if (strcmp(device, vdoPaths[i].name) == 0) { return &vdoPaths[i]; } if (strcmp(device, vdoPaths[i].resolvedName) == 0) { return &vdoPaths[i]; } char buf[PATH_MAX]; char *path = realpath(device, buf); if (path == NULL) { continue; } if (strcmp(buf, vdoPaths[i].resolvedPath) == 0) { return &vdoPaths[i]; } } return NULL; } /********************************************************************** * Process the VDO stats for all VDO devices. * **/ static void enumerate_devices(void) { FILE *fp; size_t lineSize = 0; char *dmsetupLine = NULL; fp = popen("dmsetup ls --target vdo", "r"); if (fp == NULL) { errx(1, "Could not retrieve VDO device status information"); } pathCount = 0; while ((getline(&dmsetupLine, &lineSize, fp)) > 0) { pathCount++; } int result = pclose(fp); if ((WIFEXITED(result))) { result = WEXITSTATUS(result); } if (result != 0) { errx(1, "Could not retrieve VDO device status information"); } if (pathCount == 0) { errx(1, "Could not find any VDO devices"); } result = vdo_allocate(pathCount, struct vdoPath, __func__, &vdoPaths); if (result != VDO_SUCCESS) { errx(1, "Could not allocate vdo path structure"); } fp = popen("dmsetup ls --target vdo", "r"); if (fp == NULL) { freeAllocations(); errx(1, "Could not retrieve VDO device status information"); } lineSize = 0; dmsetupLine = NULL; int major, minor; int count = 0; while ((getline(&dmsetupLine, &lineSize, fp)) > 0) { int items = sscanf(dmsetupLine, "%s (%d, %d)", vdoPaths[count].name, &major, &minor); if (items != 3) { pclose(fp); freeAllocations(); errx(1, "Could not parse device mapper information"); } sprintf(vdoPaths[count].resolvedName, "dm-%d", minor); sprintf(vdoPaths[count].resolvedPath, "/dev/%s", vdoPaths[count].resolvedName); count++; } result = pclose(fp); if ((WIFEXITED(result))) { result = WEXITSTATUS(result); } if (result != 0) { freeAllocations(); errx(1, "Could not retrieve VDO device status information"); } } /********************************************************************** * Calculate max device name length to display * * @param name The name to get the length for * */ static void calculateMaxDeviceName(char *name) { int nameLength = strlen(name); maxDeviceNameLength = ((nameLength > maxDeviceNameLength) ? nameLength : maxDeviceNameLength); } /**********************************************************************/ int main(int argc, char *argv[]) { char errBuf[VDO_MAX_ERROR_MESSAGE_SIZE]; int result; result = vdo_register_status_codes(); if (result != VDO_SUCCESS) { errx(1, "Could not register status codes: %s", uds_string_error(result, errBuf, VDO_MAX_ERROR_MESSAGE_SIZE)); } process_args(argc, argv); if (verbose) { style = STYLE_YAML; } // Build a list of known vdo devices that we can validate against. enumerate_devices(); if (vdoPaths == NULL) { errx(2, "Could not collect list of known vdo devices"); } int numDevices = argc - optind; if (numDevices == 0) { // Set maxDeviceNameLength for (int i = 0; i < pathCount; i++) { calculateMaxDeviceName(vdoPaths[i].name); } // Process all VDO devices for (int i = 0; i < pathCount; i++) { process_device(vdoPaths[i].name, vdoPaths[i].name); } } else { // Set maxDeviceNameLength for (int i = optind; i < argc; i++) { calculateMaxDeviceName(basename(argv[i])); } // Process the input devices for (int i = optind; i < argc; i++) { VDOPath *path = transformDevice(argv[i]); if (path != NULL) { process_device(argv[i], path->name); } else { freeAllocations(); errx(1, "'%s': Not a valid running VDO device", argv[i]); } } } freeAllocations(); } vdo-8.3.1.1/vdo.spec000066400000000000000000000055071476467262700141520ustar00rootroot00000000000000%define spec_release 1 %define bash_completions_dir %{_datadir}/bash-completion/completions Summary: Management tools for Virtual Data Optimizer Name: vdo Version: 8.3.1.1 Release: %{spec_release}%{?dist} License: GPL-2.0-only URL: http://github.com/dm-vdo/vdo Source0: %{name}-%{version}.tgz Requires: kmod-kvdo >= 8.2 ExcludeArch: s390 ExcludeArch: ppc ExcludeArch: ppc64 ExcludeArch: i686 BuildRequires: device-mapper-devel BuildRequires: device-mapper-event-devel BuildRequires: gcc BuildRequires: libblkid-devel BuildRequires: libuuid-devel BuildRequires: make %ifarch %{valgrind_arches} BuildRequires: valgrind-devel %endif BuildRequires: zlib-devel # Disable an automatic dependency due to a file in examples/monitor. %define __requires_exclude perl %description Virtual Data Optimizer (VDO) is a device mapper target that delivers block-level deduplication, compression, and thin provisioning. This package provides the user-space management tools for VDO. %package support Summary: Support tools for Virtual Data Optimizer License: GPL-2.0-only Requires: libuuid >= 2.23 ExcludeArch: s390 ExcludeArch: ppc ExcludeArch: ppc64 ExcludeArch: i686 %description support Virtual Data Optimizer (VDO) is a device mapper target that delivers block-level deduplication, compression, and thin provisioning. This package provides the user-space support tools for VDO. %prep %setup -q %build make %install make install DESTDIR=$RPM_BUILD_ROOT INSTALLOWNER= name=%{name} bindir=%{_bindir} \ mandir=%{_mandir} defaultdocdir=%{_defaultdocdir} libexecdir=%{_libexecdir} \ presetdir=%{_presetdir} python3_sitelib=/%{python3_sitelib} \ sysconfdir=%{_sysconfdir} unitdir=%{_unitdir} %files %license COPYING %{_bindir}/vdoforcerebuild %{_bindir}/vdoformat %{_bindir}/vdostats %{bash_completions_dir}/vdostats %dir %{_defaultdocdir}/%{name} %dir %{_defaultdocdir}/%{name}/examples %dir %{_defaultdocdir}/%{name}/examples/monitor %doc %{_defaultdocdir}/%{name}/examples/monitor/monitor_check_vdostats_logicalSpace.pl %doc %{_defaultdocdir}/%{name}/examples/monitor/monitor_check_vdostats_physicalSpace.pl %doc %{_defaultdocdir}/%{name}/examples/monitor/monitor_check_vdostats_savingPercent.pl %{_mandir}/man8/vdoforcerebuild.8* %{_mandir}/man8/vdoformat.8* %{_mandir}/man8/vdostats.8* %files support %{_bindir}/adaptlvm %{_bindir}/vdoaudit %{_bindir}/vdodebugmetadata %{_bindir}/vdodumpblockmap %{_bindir}/vdodumpmetadata %{_bindir}/vdolistmetadata %{_bindir}/vdoreadonly %{_bindir}/vdorecover %{_mandir}/man8/adaptlvm.8* %{_mandir}/man8/vdoaudit.8* %{_mandir}/man8/vdodebugmetadata.8* %{_mandir}/man8/vdodumpblockmap.8* %{_mandir}/man8/vdodumpmetadata.8* %{_mandir}/man8/vdolistmetadata.8* %{_mandir}/man8/vdoreadonly.8* %{_mandir}/man8/vdorecover.8* %changelog * Thu Mar 13 2025 - Red Hat VDO Team - 8.3.1.1-1 - See https://github.com/dm-vdo/vdo.git